How to Normalize Data in Python

How to Normalize Data in Python

In this article we will explore how to normalize data in Python.

Table of Contents


Introduction

One of the first steps in feature engineering for many machine learning models is ensuring that the data is scaled properly.

Some models, such as linear regression, KNN, and SVM, for example, are heavily affected by features with different scales.

While others, such as decision trees, bagging, and boosting algorithms generally do not require any data scaling.

The level of effect of features’ scales on mentioned models is high, and features with larger ranges of values will play a bigger role in the decision making of the algorithm since impacts they produce have larger effect on the outputs.

In such cases, we turn to feature scaling to help us find common level for all these features to be evaluated equally when training the model.

Two most popular feature scaling techniques are:

  1. Z-Score Standardization
  2. Min-Max Normalization

In this article, we will discuss how to perform min-max normalization of data using Python.

To continue following this tutorial we will need the following two Python libraries: pandas and sklearn.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

PowerShell
pip install pandas
pip install sklearn

What is normalization?

In statistics and machine learning, min-max normalization of data is a process of converting original range of data to the range between 0 and 1.

The resulting normalized values represent the original data on 0-1 scale.

This will allow us to compare multiple features together and get more relevant information since now all the data will be on the same scale.

In min-max normalization, for every feature, its minimum value gets transformed into 0 and its maximum value gets transformed into 1. All values in-between get scaled to be within 0-1 range based on the original value relative to minimum and maximum values of the feature.

Suppose you have an array of numbers: \(A = [v_1, v_2, …, v_i]\).

We will first find the minimum and maximum values of the array: \(min_A\) and \(max_A\).

Then, using the min and max values we will transform each original value \(v_i\) into a min-max normalized value \(v’_i\) using the follwoing formula:

$$v’_i = \frac{v_i – min_A}{max_A – min_A}$$


Normalization example

In this section we will take a look at a simple example of data normalization.

Consider the following dataset with prices of different apples:

Weight in gPrice in $
3003
2502
8005

And plotting this dataset should look like this:

not normalized data

Here we see a much larger variation of the weight compared to price, but it appears to looks like this because of different scales of the data.

The prices range is between $2 and $5, whereas the weight range is between 250g and 800g.

Let’s normalize this data!

Start with the weight feature:

Observation\(v_i\)\(v’_i = \frac{v_i – min_W}{max_W – min_W}\)
1300\(\frac{300-250}{800-250} = 0.09\)
2250\(\frac{250-250}{800-250} = 0\)
3800\(\frac{800-250}{800-250} = 1\)
\(min_W\)250
\(max_W\)800

And do the same for the price feature:

Observation\(v_i\)\(v’_i = \frac{v_i – min_P}{max_P – min_P}\)
13\(\frac{3-2}{5-2} = 0.33\)
22\(\frac{2-2}{5-2} = 0\)
35\(\frac{5-2}{5-2} = 1\)
\(min_P\)2
\(max_P\)5

Then combine the two features into one dataset:

Weight (normalized)Price (normalized)
0.090.33
00
11

We can now see that the scale of the features in the dataset is very similar, and when visualizing the data, the spread between the points will be smaller:

Normalized data

The graph looks almost identical with the only difference being the scale of the each axis.


How to normalize data in Python

Let’s start by creating a dataframe that we used in the example above:

Python
import pandas as pd

data = {'weight':[300, 250, 800],
        'price':[3, 2, 5]}

df = pd.DataFrame(data)

print(df)

and you should get:

PowerShell
   weight  price
0     300      3
1     250      2
2     800      5

Once we have the data ready, we can use the MinMaxScaler() class and its methods (from sklearn library) to normalize the data:

Python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

normalized_data = scaler.fit_transform(df)

print(normalized_data)

and you should get:

PowerShell
[[0.09090909 0.33333333]
 [0.         0.        ]
 [1.         1.        ]]

As you can see, the above code returned an array, so the last step would be to convert it to dataframe:

Python
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)

print(normalized_df)

and you should get:

PowerShell
     weight     price
0  0.090909  0.333333
1  0.000000  0.000000
2  1.000000  1.000000

which is identical to the result in the example which we calculated manually.


Conclusion

In this tutorial we discussed how to normalize data in Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.

Leave a Comment

Your email address will not be published. Required fields are marked *