Scikit-Learn’s New API Simplifies Data Preprocessing

Soner Yıldırım
3 min readOct 17, 2022

For transformation, scaling, and normalizing.

Photo by Nick Fewings on Unsplash

Scikit-learn is a widely-used Python library in machine learning. In fact, it is usually one of the first libraries we learn about in data science.

Scikit-learn provides functions and methods to cover the entire machine learning workflow. Therefore, it is not only used for implementing machine learning algorithms but also for tasks like feature preprocessing and model evaluation.

In this article, we will talk about a new API related to the data preprocessing functions. In machine learning, it is highly unlikely that we use the features as they appear in the raw data.

They usually require a lot of preprocessing for optimal results. For instance, some algorithms do not perform well if feature value ranges are very different. They tend to give more importance to the features with higher values so the results become biased.

Consider a house price prediction problem. The area of a house is around 200 square meters whereas the age is usually less than 20. The number of bedrooms can be 1, 2, or 3 in most cases. All of these features are important in determining the price of a house.

However, if we use them without any scaling, machine learning models might give more importance to the…

--

--