Preprocessing machine learning datasets

☕️ 6 minute read

Data is the key ingredient for any analytical exercise. The quality of it has an immediate impact on the obtained results, when conducting our scientific research. Whatever it is we are exactly looking for.

When conducting data science we start with a seemingly small first step in building our analytical models. That is, to preprocess our gathered data lake and refine our attribute values in order to obtain better results. Worth mentioning is the garbage in, garbage out principle that essentially states messy data will yield messy analytical models. It is of utmost importance that every data preprocessing step is carefully justified, carried out, validated and documented before proceeding with further analysis. In their business-fashioned passion to obtain results as quick as possible, preprocessing is a step that is often disregarded all to easily. The additional value of it, however, cannot be overstated. We will see (in an upcoming article) that it is even a mathematical crime not to preprocess depending on the model used.

But why exactly do we want to conduct this dreadful and often time-consuming task, you might ask. In short: “It allows models to generalize better, which eventually leads to better predictions.”

Data Aggregation

The application of analytics typically requires or presumes the data to be presented in a single table. Several normalized source tables have to be joined/merged in order to construct the aggregated, denormalized table. An individual entity can often be recognized and selected in the different tables by making use of (primary) keys. Following the golden rule of Data Science: “the more data, the better”, it may be worthwile to be on the lookout for additional datasets which may enrich your findings. For retailers, such information can often be bought from independent data farms.

Closely related to data agreggation is data grouping. One may think of aggregation to be similar to the JOIN operation in SQL, and grouping as GROUP BY. As I work on both the Data Engineering and the Data Science side, I would remark that data aggregation is considered to be part of the former field and data grouping of the latter. Data Aggregation often entails the import of data from several (sometimes entirely different) sources. Before we can conduct any useful research it is therefore necessary that data engineers transform it into a uniform format. This could a simple Python Dataframe, or f.e. a more advanced Apache Parquet table. To be more efficient I always look for the tools at hand to solve such problems in a human-friendly way. Two examples frameworks that are commonly used are:

Apache Spark is one of the must haves for every self-respecting data scientist that operate in larger businesses. Spark makes the aggregation step bearable even without the need of much 'data engineering’. As it comes equiped with multiple database drivers and integrations that facilitate the latter.

Dremio is one of the most human-friendly GUI’s that solve the data aggregation problem. It is part of the Open Source movement and it comes with instantly-deployable production templates for Azure Kubernetes Services.

Sampling

Less relevant, due to the sheer processing power of today, but still an honorable mention is data sampling. The aim is to take a subset of historical data in order to build a model. Key requirements for a good sample are relevance, representativeness and actuality. Choosing the optimal time window of the sample involves a trade-off between lots of data and recent data. A classic example would be to scrape historical stock data from the internet in order to forecast market pricing. When scraping twitter & other news-sites for stock feeds, sampling could prove itself useful when operational efficiency is a concern.

Exploratory Analysis

Data exploration means securing initial insights through data visualization & summarization (by using descriptive statistics). There are a lot of frameworks out there to help you gather insights in your data. Here is a shortlist of the ones I have worked with:

For in a business context

For rapid prototyping

Another, more barebones tool, is just using pandas profiling report in Python. This is included in my simple starter toolkit:

When conducting exploratory analysis, researchers tend to deduct certain patterns from their datasets. Distributions can easily be spotted and accounted for when plotting the different attributes. Feel free to take a look at an example report (profiling the Titanic Dataset):

Missing Values & Outliers

We should avoid discarding data-rows with missing values, as the values of other attributes are valuable still. Replacement of missing values through imputation techniques such as mean/median (numeric) or most-common (categorical) is always preferred. When dealing with outliers we can use statistics to deduct if they are significant/valid. When valid prefer a capping technique, or treat it as missing value otherwise.

Feature Engineering

This task deserves an article of its own. An example often used in practice is PCA. Principal Component Analysis (or PCA for the friends) is a popular technique for reducing dimensionality, studying linear relationships and visualising complex datasets. It is has its roots in the linear algebra and is based on the concept of constructing an orthogonal basis of a dataset.

Four properties need to hold:

  1. Each principal component should capture as much variance as possible.
  2. Variance should decrease each step
  3. Transformation should respect the distances between observations and the angle that they form
  4. Coordinates should not be correlated with each other

PCA helps in getting a quick overview of the data and the most important variables in the dataset.

Application areas:

  • Dimensionality Reduction
  • Input Selection: calculating factor loadings
  • Visualisation
  • Text mining and information retrieval

The most efficient method to calculate the PCA’s is Singular Value Decomposition.

Disclaimer: PCA only holds for linear relationships, for non-linear relationships Kernel PCA methods can be used.

Reference material

This book was my course-material when studying Business Intelligence and Analytics lectured by prof. Wouter Verbeke at KULeuven. It covers a lot of the fundamental techniques I utilize when conducting data science in a business context.