Outlier Detection and Treatment Using Python

Introduction

What are outliers?

Outliers in a dataset are those data points that are significantly different from the remaining data. Outliers lie outside the overall pattern of the distribution. In rough terms outliers are those data points that have extremely high or low values as compared to the rest of the data points.

What is outlier detection and treatment?

The process out identifying/flagging the outliers in a dataset is known as outlier detection.

The process of changing/deleting the outliers in a dataset is known as outlier treatment.

Outlier treatment is necessary because data-related analyses are sensitive to the range and distribution of data points. Outliers can skew the statistical d techniques and result in less accurate results.

The statistical techniques that we will be discussing in this article to detect and treat outliers have been briefly highlighted below:

Article Flow

Article Flow

We will now go through each of the above defined phases in a detailed manner.

The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link.

Ground Work

  1. Before we jump into outlier detection and treatment let us first view the data
Head Output

Head Output

  1. Now check how many records does our data have using the “shape” command
Shape Output

Shape Output

We can conclude from the above output that the data consists of 11,145 rows and 8 columns

  1. Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset
Info Output

Info Output

Our dataset has no missing values, so we can proceed ahead with Outlier detection and treatment

Outlier Detection

  1. Visually identifying the outliers using box-plots
Boxplot Output

Boxplot Output

The dots in the box-plots represent outliers so we can conclude the following points from the above boxplots:

    • For the “age” column, outliers only lie in the higher end of the age brackets
    • For the “player_height” column, outliers lie both in the top and bottom half of the height bracket but the number of outliers at the lower end is higher than the upper end
    • For the “player_weight” column, outliers lie both in the top and bottom half of the weight bracket but the number of outliers at the upper end is higher than the lower end
  1. Now lets see how these numeric columns are distributed and see if we can get any insights from their distribution plots
Distribution Plots

Distribution Plots

We can conclude the following points from the distribution-plots above:

    • The “age” feature is right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column
    • The “player_height” feature is slightly left skewed this indicates that there are outliers towards the lower-end (left end in above plots) of the age column
    • The “player_weight” feature looks slightly right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column

Outlier Treatment

  1. Treating outliers in the age column by deleting the outliers
    • Finding out the upper cut-off limit for the “age” feature to delete outliers
Age Upper Limit

Age Upper Limit

    • Now let’s find out the percentage and absolute count of players that are aged above 39 years
Age Outliers

Age Outliers

    • Let’s now delete these outliers and replot the box-plot for the “age” feature
Age after treatment

Age after treatment

  1. Treating outliers in the height column by capping the values at a maximum and minimum limit
    • Finding out the upper and lower limits for the height column
Height After Treatment

Height After Treatment

    • Capping the outliers at the maximum and minimum limits and re-plotting the box-plot after outlier treatment
Height After Treatment

Height After Treatment

  1. Treating outliers in the weight column by imputing them with mean
    • Finding out the upper and lower limits for the weight column
Weight Limit Output

Weight Limit Output

    • Imputing the outliers using the mean and re-plotting the box-plot after outlier treatment
Weight after Treatment

Weight after Treatment

All the outliers have been treated in the dataset and the data is now ready to be processed further.

Complete Code with Github Link

Github Link

You may also like...