What Is Data Preprocessing in ML?

Data preparation plays an important role in your workflow. You need to transform the data in a way that a computer would be able to work with it.

Steps in data preprocessing

Any database is a collection of data objects. You can also call them data samples, events, observations, or records. However, each of them is described with the help of different characteristics. In data science lingo, they are called attributes or features.

Data preprocessing is a necessary step before building a model with these features.

Data preprocessing stages

It usually happens in stages. Let us have a closer look at each of them.

Data quality assessment

First of all, you need to have a good look at your database and perform a data quality assessment. A random collection of data often has irrelevant bits. Here are some examples.

Mismatching in data types

Quite often, you might mix together datasets that use different data formats. Hence, the mismatching: integer vs. float or UTF8 vs ASCII.

Different dimensions of data arrays

When you aggregate data from different datasets, for example, from five different arrays of data for voice recognition, three fields that are present in one of them can be missing in four other arrays.

Mixture of data values

Let’s imagine that you have data, collected from two independent sources. As a result, the gender field has two different values for women: woman and female.

To clean this dataset, you have to make sure that the same name is used as the descriptor within the dataset (it can be female in our case).

Outliers in the dataset

Within 200 years of daily temperature observations for New York, there were several days with very low temperatures in summer.

Outliers are very dangerous. They can strongly influence the output of a machine learning model. Usually, the researchers evaluate the outliers to identify whether each particular record is the result of an error in the data collection or a unique phenomenon which should be taken into consideration for data processing.

Missing data

You may also notice that some important values are missing. These problems arise due to the human factor, program errors, or other reasons. They will affect the accuracy of the predictions, so before going any further with your database, you need to do data cleaning.

Why do we need to preprocess data?

By preprocessing data, we:

  • Make our database more accurate. We eliminate the incorrect or missing values that are there as a result of the human factor or bugs.
  • Boost consistency. When there are inconsistencies in data or duplicates, it affects the accuracy of the results.
  • Make the database more complete. We can fill in the attributes that are missing if needed.
  • Smooth the data. This way we make it easier to use and interpret.

Data cleaning

The goal of data cleaning is to provide simple, complete, and clear sets of examples for machine learning.

Missing data

The situation when you have missing data in your dataset is quite common. In this case, you are looking for additional datasets or collecting more observations.

When you concatenate two or more datasets into one database to get a bigger training set, some data field mismatches are quite common.

When not all the fields are represented in the joined massives, it is better to delete such fields in advance before merging.

What to do: if more than 50% of values are missing for any of the database rows or columns, you have to delete the whole row/columns unless it is possible to fill in the missing values.

Imagine you make a database of Haskell lovers. The values for the gender column are missing for several records: Nik, Jane, Julia, and Helen. In this case, the researcher can add the missing data based on their conclusions. However, this method has flaws, and the model has to bear the risk of being inaccurate.

Noisy data

A large amount of additional meaningless data is called noise.

Noise

This can be:

  • duplicates or semi-duplicates of the data records;
  • data segments, which have no value for a particular research;
  • unnecessary information fields for each of the variables.

An example is when you need to know whether the person speaks English or not. But you got a whole set of features, including the color of their eyes, shoe size, pulse and blood pressure, etc.

You can apply one of the following methods to solve this problem:

  • Binning. Use binning if you have a pool of sorted data. Divide all the data into smaller segments of the same size and apply your dataset preparation methods separately on each segment. For example, you can bin the values for Age into categories such as 21-35, 36-59, and 60-79.
  • Regression. Regression analysis helps to decide what variables do indeed have an impact. Apply regression analysis to smooth large volumes of data. This will allow you to only work with the key features instead of trying to analyze an overwhelming amount of variables. In our post about regression, you can learn more about how to conduct a regression analysis step-by-step.
  • Clustering. Finally, you can apply clustering algorithms to group the data. Here you need to be careful with the outliers.

The outliers are the singular data points dissimilar to the rest of the domain.

Outliers

It’s important not to substitute the outliers by taking them as noise. For example, we are building an algorithm that sorts out different sorts of apples. We can encounter two types of outliers in our dataset:

  • The images contain exotic fruits like pineapples and kiwi. They can be found in your data due to a sampling mistake and represent noise in your dataset.
  • There also can be photos of some “weird apples”, for example, that have a strange shape. When our goal is to teach the machine to recognize the apple sorts, deviation from groups is important. Such outliers will help to teach the ML model to recognize special characters and increase the accuracy of the forecast.

When we are not talking about obvious things like apples and pineapples, it is quite complicated to decide whether the item is important or just noise. Here, the expertise of the data scientist has a great influence on the success of ML modeling.

Data transformation

In fact, by cleaning and smoothing the data, we have already performed data modification. However, by data transformation, we understand the methods of turning the data into an appropriate format for the computer to learn from.

Example: For research about smog around the globe, you have data about wind speeds. However, the data got mixed, and we have three variants of figures: meters per second, miles per second, and kilometers per hour. We need to transform these data to the same scale for ML modeling.

Here are the techniques for data transformation or data scaling:

Aggregation

In the case of data aggregation, the data is pooled together and presented in a unified format for data analysis.

Working with a large amount of high-quality data allows for getting more reliable results from the ML model.

If we want to build a neural network algorithm that simulates the style of Vincent Van Gogh, we need to provide as many paintings by this famous artist as we can to provide enough material for training. The images need to have the same digital format, and we will use data transformation techniques to achieve that.

Normalization

Normalization helps you to scale the data within a range to avoid building incorrect ML models while training and/or executing data analysis. If the data range is very wide, it will be hard to compare the figures. With various normalization techniques, you can transform the original data linearly, perform decimal scaling or Z-score normalization.

For example, to compare the population growth of city X (1+ million citizens) to 1 thousand new citizens in city Y, we need to normalize these figures.

Population growth

Feature selection

Feature selection is the selection of variables in data that are the best predictors for the variable we want to predict.

Feature selection

If there are a lot of features, then the classifier operation time increases. In addition, the prediction accuracy often decreases. Especially if there are a lot of garbage features in the data (that are not correlated with the target variable). In the Machine Learning Mastery blog, you can learn how to perform feature selection for your ML database.

Discretization

During discretization, a programmer transforms the data into sets of small intervals. For example, putting people in categories “young”, “middle age”, “senior” rather than working with continuous age values. Discretization helps to improve efficiency.

Concept hierarchy generation

If you use the concept hierarchy generation method, you can generate a hierarchy between the attributes where it was not specified. For example, if you have the location information that includes a street, city, province, and country but they have no hierarchical order, this method can help you transform the data.

Generalization

With the help of generalization, it is possible to convert low-level data features to high-level data features. For example, house addresses can be generalized to higher-level definitions, such as town or country.

Data reduction

When you work with large amounts of data, it becomes harder to come up with reliable solutions. Data reduction can be used to reduce the amount of data and decrease the costs of analysis.

Researchers really need data reduction when working with verbal speech datasets. Massive arrays contain individual features of the speakers, for example, interjections and filling words. In this case, huge databases can be decreased to a representative sampling for the analysis.

Here are a few techniques for data reduction:

Attribute feature selection

Techniques for data transformation can also be used for data reduction. If you construct a new feature combining the given features in order to make the data mining process more efficient, it is called an attribute selection. For example, the features male/female and student can be constructed into male student/female student. This can be useful if we conduct research about how many men and/or women are students but their study field doesn’t interest us.

Dimensionality reduction

Datasets that are used to solve real-life tasks have a huge number of features. Computer vision, speech generation, translation, and many other tasks cannot sacrifice the speed of operation for the sake of quality. It’s possible to use dimensionality reduction to cut the number of features used.

Dimensionality reduction

Numerosity reduction

Numerosity reduction is a method of data reduction that replaces the original data by a smaller form of data representation. There are two types of numerosity reduction methods – Parametric and Non-Parametric.

Parametric Methods

Parametric methods use models to represent data. Commonly, regression is used to build such models.

Non-parametric methods

These techniques allow for storing reduced representations of the data through histograms, data sampling, and data cube aggregation.

A good resource to explore if you are interested in data reduction techniques is Geekforgeeks.

Final thoughts

Now you know all the steps you need to take to preprocess your data for analysis. Check out our blog to learn where to find the best datasets for your research and how to choose an ML algorithm for your project.

Serokell provides administration of databases and monitoring of data storages which help you optimize your workflow and make your business more efficient. Learn more about our IT consultancy services.

Banner that links to Serokell Shop. You can buy awesome FP T-shirts there!
More from Serokell
How Sber Built ruDALL-E: Interview with Sergei MarkovHow Sber Built ruDALL-E: Interview with Sergei Markov
Thumbnail with text Multimodal Machine LearningThumbnail with text Multimodal Machine Learning
ML: Regression Analysis OverviewML: Regression Analysis Overview