By now, you should be familiar with what Classifiers and Classes are. If you have not read the previous article, we recommend that you do so.
Building a Classifier
Identifying areas for optimization and reducing false positives
By default we provide a few Classifiers out of the box. As an example, we show the result of scanning an SQL database that we have at Ohalo with the default Personal Data Classifier below. We see that in the Details view of this SQL datasource we are drilling down into the database contents and into a table called "Orders" in the "Northwind" schema.
You will note that the Personal Data Classifier correctly classified columns
ship_address as "Name (English)" and "Addresses (USA & Canada)", respectively. However, the Personal Data Classifier incorrectly classified the
ship_city column as "First Name (English)" instead of a City. The machine learning algorithm, using the Personal Data Classifier's Classes, determined that the closest match for the text elements inside this column was "First Name (English)" instead.
This is what is known as a False Positive, or something that was captured but incorrectly classified. This false positive happened because the Personal Data Classifier does not have any data on what a City is. Therefore, if we want to remove this false positive, we have to teach it what a City is. So let's dig in!
Optimizing and Improving: Adding a New Class
The first step is to view the Classes that are in the default Personal Data Classifier in the Rules tab by clicking the "Personal Data" Classifier.
You will be taken to a page that describes the rules that the Classifier includes. You will see that the Classifier includes Classes like Addresses, Data Formats, Emails, Financial Data, Names, National ID Numbers, and more, but notably no Class for City is included (although we'll probably add one soon for our users since that is fairly common text to find in databases and documents). So let's add one.
- Click the "Create New Class" button
- Type in a name for the Class (in this case, "Cities")
- Select an appropriate category that reflects the context of this class within your workflow
- Select a detection rule (use "AI" for now)
Now we have a blank Class available to us where we can upload training data. You should see a screen something like the below.
Scrolling down, you will see a "Load Samples" button. You can click here and be presented with a text box where you can copy and paste training data (other methods for adding training data via an API are also available but for this tutorial we'll use this text box). We have a list of world cities that we found on the internet and includes about 23,000 city names. So we're going to go ahead and copy and paste these samples and click "Save Samples".
Note: As best practice, in general machine learning works better the more data you give it. If you have more clean data, it is often better to employ all of the data that you have to increase the resolution of the algorithms. However, we use an algorithm that gets fairly good results even with a few hundred or few thousand samples, so don't be shy about experimenting with fewer samples.
Going back out to the rules page, we now have a screen that looks like the below, with a new Cities Class available to us.
Optimizing and Improving: Building a New Classifier
On the rules page, we can click the "Create a New Classifier" button that brings us to the page below.
You only need to do three things to build a new Classifier:
- Give your Classifier a name (we'll call it "Personal Data (English Language) Improved" for this tutorial)
- Select the Classes that you want to include in your Classifier (in this case we are adding a new Class, "Cities", to the existing personal data Classes that are available).
- Click "Create Classifier" at the bottom.
Using our New Classifier
As the final step to building a Classifier, you want to actually use it. Now we can go back to our Datasource that we were scanning (viewable in the datasources page). You will see that next to the scan button there is a dropdown available to you.
You can now select the new Classifier to scan your data. When selecting the new Classifier, the Data X-Ray will automatically build and retrain your Classifier and then rescan your datasource with the newly built Classifier. Depending on the number of classes that you have chosen and the amount of data within each class, it may take a bit of time. You will get an email when it is complete. The result for us in our tutorial was that we are now correctly classifying the
ship_city column as "Cities" instead of "First Name (English)".
If you have any questions, do not hesitate to email us at [email protected] or simply click the Intercom button to start a chat with our support team.