Introduction to How the Data X-Ray Scans Data
The most basic function of the Data X-Ray is to scan your data and get an overview of where certain types of data are. There are two basic data types (and several sub types, but that will be treated in another article) that we support: structured data and unstructured data.
Structured data is (normally) what you will find in an SQL database. It is data that has context and is pre-parsed into consumable elements. Unstructured data is data that is free form text and is encountered often in data like word documents, pdfs, and more.
Interpreting Structured Data (SQL Databases, Spreadsheets, etc.)
For structured datasources like SQL databases, you will get a result similar to the below.
Your first view is a high level overview of what data that you have parsed in a visualization called a treemap. It is meant to show you how large the amount of data you have in each table within the schemas found in the database. You can then prioritize actions by the amount and types of data found and deep dive on certain tables and columns in the Detail View section of the page. You may find that there is data in there that you knew about and some that you did not.
Structured data is supported in all languages (depending on the training data you provide, of course).
Interpreting Unstructured Data (Documents, PDFs, txt files, etc.)
Unstructured data is data found in text documents and records that is not naturally pre-parsed. That is, a computer does not know where to begin or end its analysis. For unstructured data types, the Data X-Ray has a second algorithm to accommodate such data and automatically switches to that algorithm when appropriate.
Again, like structured data you are presented with a high level treemap of what documents contain what types of data. However the Details View is a bit more nuanced in the case of unstructured data. An example of the result of a scan is shown below from a sample Google Drive scan where we can see a mix of unstructured data in a document and structured data in a spreadsheet.
For unstructured data, you can also deep-dive with the Data Viewer*, which is accessed below the Detail View. In the Data Viewer, you can go through the files scanned and look through a word-by-word analysis of how the data was categorized (note, for security reasons the Data Viewer only stores data in browser and never stores your data on Ohalo servers).
Unstructured data is supported in just Western languages as of the time of writing but we will be supporting Japanese and Mandarin soon. (Please contact us if you have any specific requests for a language you want to support in unstructured data types).
Notes on Semi-Structured Data (long text in databases, JSON, etc.)
Some text looks like it is structured, but it is in fact more akin to unstructured data. For instance, when a long text field query is stored in a SQL database column. In the background, the Data X-Ray is smart enough to automatically determine whether any particular data element should be parsed as structured data or whether it should be parsed as unstructured data and presents the correct results to the user.
Taking Next Steps
You can use this information to take various actions on the data such as establishing consent, deleting data, merging data, and more. Contact us to let us know about your use cases!
* Data Viewer only available for SMB (Windows and Linux file systems) connectors at the moment and will be rolling out to our other unstructured datasources soon.
If you have any questions, do not hesitate to email us at [email protected] or simply click the Intercom button to start a chat with our support team.