Resources / Blogs / LLMs for data classification: How Scribble built SADL for achieving breakthrough accuracy

LLMs for data classification: How Scribble built SADL for achieving breakthrough accuracy

Raj Krishnan Vijayaraj

June 8, 2023

Modern-day organizations are generating vast amounts of data that hold immense potential for making informed decisions. However, with the ever-growing volume of data, the greater challenge lies in how these organizations can generate actionable insights. Data classification plays a vital role in addressing this challenge.

Until now, organizations have relied on traditional methods for data classification, including natural language processing (NLP) and machine learning (ML) but these methods have shown limited success.

Through this blog, we explore the journey of implementing SADL (Scribble Automated Data Labeller), a configurable and scalable system designed to classify data fields accurately. We also discuss the limitations of traditional classification methods, the evolution to language model-based learning (LLM), and the role of OpenAI’s GPT-3 in achieving breakthrough accuracy.

Table of Contents

Preparation and methodology

The project utilizes a diverse dataset obtained from publicly available sources, specifically 1,300 randomly selected datasets provided by the Government of the United States. The implementation process begins with data cleaning and exploratory data analysis. Over 300 generic features are extracted from the dataset to train the model. These features include textual and numerical attributes, such as a bag of words, histograms, percentage of special characters, and maximum/minimum length. Random Forest models are initially employed for classification tasks, but limitations in accuracy and contextual information lead to the exploration of transformer models. We discuss the details of the two methods that were employed in the following sections:

Method 1

Our implementation involved training two separate Random Forest models. One for textual records and another one for numerical records.

Implementation

The following are the features extracted from the textual records:

The bag of words feature was reduced to 300 features using truncatedSVD
The frequency of occurrence of each item ) was extracted and scaled down to 20 samples
Percentage of special characters
Percentage of digits
Percentage of alphabets
Maximum length
Minimum length

The following are the features extracted from numerical records:

Maximum Value
Minimum Value
Mean Value
Count
Standard Deviation

These features were used to train two different models. We tried the following test-train split ratio: 30-70

Metrics

Accuracy – Exact Match: The predicted label is exactly the same as the given label

Accuracy – At Least 1 word: At least a word is common between the predicted label and the given label

Results for Text Data

Trees	Depth	Accuracy – Exact Match	Accuracy – Atleast 1 word
25	25	18.70%	36.21%
25	50	20.24%	47.47%
25	75	21.27%	49.05%

Results for Numerical Data

Trees	Depth	Accuracy – Exact Match	Accuracy – Atleast 1 word
25	25	14.05%	35.80%
25	50	13.31%	34.90%
25	75	14.05%	35.31%

This approach showed some results (up to 49% accuracy) and could have been improved by predetermining a given set of labels, using business rules. This approach had limitations in terms of its ability to provide contextual information.

We realized that complex models that are trained in much larger and diverse data would be able to generalize better.

This is why we decided to use LLMs, which are based on transformer architecture.

Transformer model

The transformer model is a type of neural network architecture that was introduced in 2017 to solve the limitations of processing sequential data. It uses a self-attention mechanism [2] to process input data in parallel and has the ability to capture long-range dependencies and contextual relationships between words in a sentence.

These models are huge and require a large amount of data to train and could cost millions of dollars, but there are many open-sourced models which were pre-trained and made publicly available via API service.

These large models are the state of the art in many NLP tasks like sentiment analysis or text classification.

The task given to the model during inference is completely determined by the prompt and parameters that you pass in through the API. The API will respond to your prompt and you can use a series of prompts to get your task done. The next section will share details on how we used such an API-based service to complete the task.

Method 2

We chose OpenAI’s GPT-3 as it is easy to use and is more flexible with the categories.

Implementation

Hyperparameters used for the experiment

Model: text-davinci-003
Temperature: 0.1
Max_tokens: 3500
Top_1: 1
Frequency_penalty: 0
Presence_penalty: 0

We followed a three-level hierarchy for classification. Starting with the top layer of Domain and it would illustrate the domain of the dataset (not the column label) from the following. The following are the domains that we finalized after multiple iterations:

Technology
Government
Manufacturing
Finance
Healthcare
Retail
Business

The next level is the category and it depicts the category of the column label and the following categories were selected after multiple iterations:

Date/Time
Education level
Events
Healthcare
Location
Organizations
Other
People
Products
Services

The third level is subcategories and it depicts the subcategory of the column label. These categories were optimized to fit the diversity within each category. Hence the list of sub-categories would vary for each category.

Results

The results from the experiments and trials showed an acceptability rate of over 90% for both domain and category-related tasks. This was manually verified for a randomly selected subsection of the dataset.

The following table summarizes the results for the domain:

Domain	Acceptability
Technology	80.95%
Government	96.00%
Manufacturing	87.10%
Finance	100%
Healthcare	96.04%
Retail	94.85%
Business	74.60%
Total	91.08%

The following table summarizes the results for the categories:

Category	Acceptability
Date/Time	95.56%
Events	97.50%
Healthcare	91.38%
Location	94.25%
Organizations	94.89%
Other	91.43%
People	92.92%
Products	81.33%
Services	78.95%
Education level	100.00%
Total	91.22%

The following table gives an example of how the GPT-3 model can be used to determine the category, sub-category, and description for a column label.

Table:

Label	Category	Sub Category	Description
MMWR Year	Date/Time	Year	Year of the Morbidity and Mortality Weekly Report
Birth Rate	Healthcare	Measurement	Number of live births per 1000 people
Jurisdiction	Location	Region	The authority of a legal body to exercise its power over a certain area.

Comparison of Approaches:

Language models like GPT-3 excel in text completion tasks and exhibit high accuracy due to their extensive training data. However, their performance may decrease when prompted with specific queries. While we could fine-tune LLMs with domain-specific data, this could be compute-intensive. Few-shot learning is an approach where the prompt contains a few examples of task completion. This is another viable approach, enabling models to learn new concepts with no extra training data.

Conclusion:

In conclusion, our SADL project has successfully demonstrated the power of language model-based learning in data classification. OpenAI’s GPT-3 has showcased exceptional accuracy and the ability to generate descriptions for previously unknown data. Fine-tuning models and exploring LLMs available for download are promising avenues for future improvements. As the volume of available data continues to grow, innovative and efficient techniques like LLMs are essential for extracting meaningful insights. SADL represents one such example of leveraging advanced technologies to unlock the untapped potential of data.

References:

[Dataset Repository] – United States Government. [Online]. Available here
Chen, Z., et al. (2018). Generating Schema Labels for Data-centric Tasks. [Online]. Available here
Vaswani, A., et al. (2017). Attention Is All You Need. [Online]. Available here

Table of Contents

Related Blogs

April 28, 2025

How In-Network Providers Shape Group Benefits Strategy

Not long ago, a plan member could walk into an in-network hospital, receive care from an out-of-network provider, and walk out with a five-figure bill. The plan paid some. The provider charged what they liked. The rest landed on the patient. Those days are fading, but not because care has gotten simpler. It is because […]

March 20, 2025

How Insurers Are Innovating Solutions for Group Benefits

You can sense the transformation rippling across the group benefits industry. Employee demographics now span five generations, mental health challenges are on the rise, and personal finances have grown more precarious than ever. Meanwhile, 40% of employers are boosting their investment in benefits innovation to stay competitive (SHRM, 2023). At the same time, tech-savvy startups […]

March 6, 2025

Mitigating Risks in Group Benefits Underwriting: Best Practices for Insurers

Underwriting is the ground zero of group benefits. The place where cost, risk, and regulation collide to shape coverage for millions of employees. Done right, it keeps plans both affordable and solvent. Done wrong, it amplifies the system’s worst pressures. In the U.S. alone, more than 155 million people rely on employer-sponsored health insurance. The […]

LLMs for data classification: How Scribble built SADL for achieving breakthrough accuracy

Preparation and methodology

Method 1

Implementation

Metrics

Transformer model

Method 2

Implementation

Results

Comparison of Approaches:

Leave a Reply

Related Blogs

How In-Network Providers Shape Group Benefits Strategy

How Insurers Are Innovating Solutions for Group Benefits

Mitigating Risks in Group Benefits Underwriting: Best Practices for Insurers

Stay updated on the latest and greatest at Scribble Data