Resources / Blogs / A Primer on Feature Engineering

A Primer on Feature Engineering

Feature engineering is the process of selecting, interpreting, and transforming structured or unstructured raw data into attributes (features) that can be used to build effective machine learning models which more accurately represent the problem at hand. In this context, a “feature” refers to any quantifiable unique input that may be used in a predictive model, such as the sound of someone’s voice or the texture of an object.

Feature engineering is a critical component of machine learning models because the accuracy of the ML model directly depends upon the features you select for your data. High-quality features allow you to build models that are faster, more flexible, and easier to maintain.

The feature engineering process broadly involves two main main phases:

  1. Feature Generation: This is the process of creating new variables from a data set automatically by extracting meaningful descriptors (statistical, linguistic, numerical, etc.)  without losing any significant information present in the original raw data. The aim of this phase is to reduce the amount of raw input data into manageable quantities that can be processed by an algorithm in a reasonable time.

  2. Feature Selection: Feature selection is the process of selectively choosing a subset out of the original set of features to reduce the complexity of predictive models. This increases the computational efficiency of the model and makes it less vulnerable to generalization or noise errors that might come up due to the presence of irrelevant features.

Optionally, a Feature Transformation phase may be included wherein features are modified or represented in such a way as to optimize machine learning algorithms. The goal here is to modify, plot or visualize data in a way that makes more sense, speeds up the algorithm or makes a model more accurate.

 

Why is feature engineering so complex?

Feature engineering requires intimate knowledge of machine learning algorithms and a high level of technical expertise. Now when you consider that building an effective Artificial Intelligence (AI) model requires the use of multiple algorithms, each with their own feature engineering techniques applied to input data, the undertaking becomes extremely complicated.

It requires skill across multiple disciplines – databases, programming, statistics, machine learning, and testing. Building new features, testing them, and analysing their impact is a tedious iterative process which often reveals that the newly added features are actually making the model worse.

Another crucial requirement is domain expertise – a contextual understanding of how the data reflects real-world industry conditions. It takes years to build up the demanding levels of expertise in both domain expertise and technical skills that feature engineering requires.

 

What is ETL, and how is it different from Feature Engineering?

Raw data is rarely usable as is. It has very little context or structure and is often filled with errors or missing information.

To make data usable for feature engineering, it must be cleaned and structured. The process of performing basic data cleaning and data wrangling on large volumes of input data is called Data Preprocessing or ETL (Extract, Transform, Load).

Let us take a closer look at the ETL Process.

 

1. Extract

Most businesses use multiple sources and formats of data – spreadsheets, databases, data lakes, object stores etc. Input data can also consist of unstructured information such as images, audio files, documents, or emails. Before this data can be analysed, it first needs to be moved to a central location.

The Extract phase thus involves locating, copying, and moving of raw data to a central data store where it can be structured.

2. Transform

Once the data arrives at a central datastore, it needs to be processed. Since the input data is in several different formats and comes from different systems, it needs to be modified to maintain its integrity and make it possible to run queries on the data.

Based upon a series of predefined rules, the Transform process is responsible for cleaning, verifying, standardizing, mapping and then sorting the data to make it usable for the next stage.

 

3. Load

To load the newly transformed data into the datastore, you can choose one of the following two methods.

  • Full loading: All of the data points collected during the extract and transform phases are copied over as new entries in the datastore. This process is simpler, but it tends to create exponentially increasing datasets that may become impossible to manage over time.

  • Incremental loading: Here, incoming data is compared to the existing records and only unique information is stored. This may be a less comprehensive process, but it is more resource friendly and requires less maintenance.

 

How Business Leaders can Benefit from ETL

In an increasingly data-driven world, businesses that are able to effectively manage multiple data sources can unlock powerful opportunities for improving profitability, collaboration, and communication.

As businesses grow, the usage of sprawling systems like ERP and CRM increases. In turn, this greatly increases the volume of incoming data. Without an ETL process to interpret and consolidate incoming data to a central location, there is a risk of missing out on crucial business insights.

ETL services can empower business leaders in multiple ways, such as

  1. More strategic decision making

  2. Insight into the historical context of data

  3. Ensuring consistency of data

  4. Enables faster decision making

How Feature Engineering Builds On The ETL Process

After raw input data has been processed by data engineers in the ETL phase to remove duplicates, fix missing values and formatting problems, it moves on to the Feature Engineering phase. In this phase, preprocessed data is transformed into features that accurately reflect the problem the ML model is trying to solve.

For example, let’s say you are building a model to predict the price of a car. At the end of the ETL phase, you may have a long list of input variables such as number of valves, horsepower, torque, traction, drivetrain, fuel economy, trunk capacity etc.

Subsequently, the Feature Engineering phase might involve asking questions such as

  • “Do we need horsepower and torque as separate variables, or do they essentially provide the same information making one of them redundant?”

  • “Should there be a separate variable which takes into account the fuel injection system of the vehicle?”

Specifically, Feature Engineering is an iterative process where data scientists ask questions, build models and test them repeatedly to analyze if the features they have selected allow the model to achieve the desired results.

This advanced process is where the model really begins to take shape, as each iteration of adding, removing, changing and running experiments with the selected features inches the data scientists closer to an optimal feature set. At the end of this process, some extra important features are derived from the existing feature set, and these are then used for better data modeling.

 

Feature Engineering Case Studies

To further illustrate the potential of feature engineering, we will talk about two real-world examples where our own feature engineering platform Enrich has been used to enable expansive business solutions.

TerraPay

TerraPay is a cross-border B2B payment infrastructure solution provider with a vast geographic reach and strategic partnerships across the globe. To leverage their data across the entire expanse of operations, TerraPay and Scribble Data developed the TerraPay Intelligence Platform (TIP).

How does Enrich fit into the TIP network?

Within the TIP framework, Enrich is responsible for building and managing data transformation pipelines, dataset tracking and monitoring, and custom data workflow apps for decision-making.

Use Cases powered by Enrich

1. Forecasting

A crucial part of TIP was a forecasting solution based on the Enrich feature store. It helped Terrapay to manage daily liquidity at different destination countries in the payment architecture.

Some of the challenges the Enrich-based forecasting application helped to solve are

  • Accurately estimating the total liquidity required daily

  • Identifying region-specific and partner-specific consumer behaviour patterns

  • Empowering decision makers with efficient visualization

The solution allowed for diverse kinds of statistical modelling and ML approaches for forecasting.

2. Identity resolution

Enrich enabled the effective segmentation of end users based on behaviour and preferences for TerraPay. This prompted the introduction of several new products and offerings, and further improvements to the existing AML rules.

 

A National Retail Chain

This client is a nation-wide retail chain in India. When they contacted Scribble Data, they were in the process of expanding their small-format stores by an order of 25x to around 10,000 stores. They are also in the process of converting the stores to a membership-based model a la Costco.

To scale effectively, they needed a platform that could help them accurately understand customer personas and predict distribution and demand to help them plan their resource allocation. They needed a platform that would easily be able to pick relevant data streams and then accurately segment customers, codify, and model their personas and help them evolve over time with modern ML techniques.

The client was considering multiple solution alternatives, but they all had some drawbacks such as cost, complexity, over-engineering, obsolescence, and roll-out timeframes. They needed a solution that was at once non-intrusive, granular enough to account for an individual customer and cost-effective to scale. They chose Scribble Enrich.

 

How does Enrich fit into the retail chain’s network?

Scribble Enrich was deployed behind the firewall in this client’s private network. Architecturally, it sits above the client’s data lake.

Enrich was customized with pipeline components, tasks and background services based on the challenges faced by the client. It integrated with the client’s S3-based data lake and processed data sets which were made available to end users through the data lake and existing BI.

 

Use Cases powered by Enrich

Enrich is powering multiple opportunities for this retail client.

1. Improved real-time inventory management and tracking

The Enrich powered solution matches sales data with EOD inventory figures and points out the gaps in real-time to avoid potential out of stock issues.

2. More effective supplier negotiations

Enrich can track product feedback, discounts, wastages, and sales on an SKU level which makes it easier to project ROI and predict the quality and timeliness of different suppliers.

 

3. Identifying shopping trends and cross-sell opportunities

Enrich monitors the SKU uptake and makes cross-sell recommendations based on enriched data.

 

4. Personalized recommendations

Enrich enables the retail chain to create a dynamic, rich customer profile that constantly evolves based on their behavioural and purchase history. It enables the chain to create several internal tiers of rewards for different customers.

 

This implementation was a resounding success, achieving positive ROI within three months. Every month, the number of use cases that the solution is helping this client solve just keeps growing.

In conclusion, a well-thought out and strategic feature engineering process is the key differentiator between a good machine learning model and a bad one. While the process itself may be complex, iterative, and demanding, when done right it can dramatically change the quality of your business decision making.

If you are considering a feature engineering solution for your business, take a moment to learn more about Enrich, our modular feature store that can easily fit into your existing data stack and fill in the missing pieces of the puzzle.

For more articles about MLOps, feature engineering and the science of data, make sure you bookmark the Scribble Data blog and revisit often!

Related Blogs

November 24, 2022

What is the Metadata Economy?

We live in a hyper-digital world, and due to the nearly  infinite number of data sources that surround us, the volume of data generated collectively by individuals, applications and corporations is larger than ever. With such a monumental amount of data to sift through, two core principles have  become increasingly important: Metadata – Make it […]

Read More
November 10, 2022

Data Science Teams are Doing it Wrong: Putting Technology Ahead of People

Despite $200+ billion spent on ML tools, data science teams still struggle to productionize their data and ML models. We decided to do a deep dive and find out why.  Back in 1991, former US Air Force pilot and noted strategist John Boyd called for U.S. Military reforms after Operation Desert Storm. He noted that […]

Read More
November 3, 2022

MLOps – The CEO’s Guide to Productionization of Data [Part 2]

With data being touted as the oil for digital transformation in the 21st century, organizations are increasingly looking to extract insights from their data by building and deploying their custom-built ML models. In our previous article (MLOps – The CEO’s Guide to Productionization of Data, Part 1), we learned why and how embedding ML models […]

Read More