Modern-day organizations are generating vast amounts of data that hold immense potential for making informed decisions. However, with the ever-growing volume of data, the greater challenge lies in how these organizations can generate actionable insights. Data classification plays a vital role in addressing this challenge.
Until now, organizations have relied on traditional methods for data classification, including natural language processing (NLP) and machine learning (ML) but these methods have shown limited success.
Through this blog, we explore the journey of implementing SADL (Scribble Automated Data Labeller), a configurable and scalable system designed to classify data fields accurately. We also discuss the limitations of traditional classification methods, the evolution to language model-based learning (LLM), and the role of OpenAI’s GPT-3 in achieving breakthrough accuracy.
Preparation and methodology
The project utilizes a diverse dataset obtained from publicly available sources, specifically 1,300 randomly selected datasets provided by the Government of the United States. The implementation process begins with data cleaning and exploratory data analysis. Over 300 generic features are extracted from the dataset to train the model. These features include textual and numerical attributes, such as a bag of words, histograms, percentage of special characters, and maximum/minimum length. Random Forest models are initially employed for classification tasks, but limitations in accuracy and contextual information lead to the exploration of transformer models. We discuss the details of the two methods that were employed in the following sections:
Our implementation involved training two separate Random Forest models. One for textual records and another one for numerical records.
The following are the features extracted from the textual records:
- The bag of words feature was reduced to 300 features using truncatedSVD
- The frequency of occurrence of each item ) was extracted and scaled down to 20 samples
- Percentage of special characters
- Percentage of digits
- Percentage of alphabets
- Maximum length
- Minimum length
The following are the features extracted from numerical records:
- Maximum Value
- Minimum Value
- Mean Value
- Standard Deviation
These features were used to train two different models. We tried the following test-train split ratio: 30-70
Accuracy – Exact Match: The predicted label is exactly the same as the given label
Accuracy – At Least 1 word: At least a word is common between the predicted label and the given label
Results for Text Data
|Accuracy – Exact Match
|Accuracy – Atleast 1 word
Results for Numerical Data
|Accuracy – Exact Match
|Accuracy – Atleast 1 word
This approach showed some results (up to 49% accuracy) and could have been improved by predetermining a given set of labels, using business rules. This approach had limitations in terms of its ability to provide contextual information.
We realized that complex models that are trained in much larger and diverse data would be able to generalize better.
This is why we decided to use LLMs, which are based on transformer architecture.
The transformer model is a type of neural network architecture that was introduced in 2017 to solve the limitations of processing sequential data. It uses a self-attention mechanism  to process input data in parallel and has the ability to capture long-range dependencies and contextual relationships between words in a sentence.
These models are huge and require a large amount of data to train and could cost millions of dollars, but there are many open-sourced models which were pre-trained and made publicly available via API service.
These large models are the state of the art in many NLP tasks like sentiment analysis or text classification.
The task given to the model during inference is completely determined by the prompt and parameters that you pass in through the API. The API will respond to your prompt and you can use a series of prompts to get your task done. The next section will share details on how we used such an API-based service to complete the task.
We chose OpenAI’s GPT-3 as it is easy to use and is more flexible with the categories.
Hyperparameters used for the experiment
- Model: text-davinci-003
- Temperature: 0.1
- Max_tokens: 3500
- Top_1: 1
- Frequency_penalty: 0
- Presence_penalty: 0
We followed a three-level hierarchy for classification. Starting with the top layer of Domain and it would illustrate the domain of the dataset (not the column label) from the following. The following are the domains that we finalized after multiple iterations:
The next level is the category and it depicts the category of the column label and the following categories were selected after multiple iterations:
- Education level
The third level is subcategories and it depicts the subcategory of the column label. These categories were optimized to fit the diversity within each category. Hence the list of sub-categories would vary for each category.
The results from the experiments and trials showed an acceptability rate of over 90% for both domain and category-related tasks. This was manually verified for a randomly selected subsection of the dataset.
The following table summarizes the results for the domain:
The following table summarizes the results for the categories:
The following table gives an example of how the GPT-3 model can be used to determine the category, sub-category, and description for a column label.
|Year of the Morbidity and Mortality Weekly Report
|Number of live births per 1000 people
|The authority of a legal body to exercise its power over a certain area.
Comparison of Approaches:
Language models like GPT-3 excel in text completion tasks and exhibit high accuracy due to their extensive training data. However, their performance may decrease when prompted with specific queries. While we could fine-tune LLMs with domain-specific data, this could be compute-intensive. Few-shot learning is an approach where the prompt contains a few examples of task completion. This is another viable approach, enabling models to learn new concepts with no extra training data.
In conclusion, our SADL project has successfully demonstrated the power of language model-based learning in data classification. OpenAI’s GPT-3 has showcased exceptional accuracy and the ability to generate descriptions for previously unknown data. Fine-tuning models and exploring LLMs available for download are promising avenues for future improvements. As the volume of available data continues to grow, innovative and efficient techniques like LLMs are essential for extracting meaningful insights. SADL represents one such example of leveraging advanced technologies to unlock the untapped potential of data.