# Outlier Detection and Treatment Using Python

## Introduction

What are outliers?

Outliers in a dataset are those data points that are significantly different from the remaining data. Outliers lie outside the overall pattern of the distribution. In rough terms outliers are those data points that have extremely high or low values as compared to the rest of the data points.

What is outlier detection and treatment?

The process out identifying/flagging the outliers in a dataset is known as outlier detection.

The process of changing/deleting the outliers in a dataset is known as outlier treatment.

Outlier treatment is necessary because data-related analyses are sensitive to the range and distribution of data points. Outliers can skew the statistical d techniques and result in less accurate results.

The statistical techniques that we will be discussing in this article to detect and treat outliers have been briefly highlighted below:

We will now go through each of the above defined phases in a detailed manner.

The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link.

## Ground Work

1. Before we jump into outlier detection and treatment let us first view the data
1. Now check how many records does our data have using the “shape” command

We can conclude from the above output that the data consists of 11,145 rows and 8 columns

1. Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset

Our dataset has no missing values, so we can proceed ahead with Outlier detection and treatment

## Outlier Detection

1. Visually identifying the outliers using box-plots

The dots in the box-plots represent outliers so we can conclude the following points from the above boxplots:

• For the “age” column, outliers only lie in the higher end of the age brackets
• For the “player_height” column, outliers lie both in the top and bottom half of the height bracket but the number of outliers at the lower end is higher than the upper end
• For the “player_weight” column, outliers lie both in the top and bottom half of the weight bracket but the number of outliers at the upper end is higher than the lower end
1. Now lets see how these numeric columns are distributed and see if we can get any insights from their distribution plots

We can conclude the following points from the distribution-plots above:

• The “age” feature is right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column
• The “player_height” feature is slightly left skewed this indicates that there are outliers towards the lower-end (left end in above plots) of the age column
• The “player_weight” feature looks slightly right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column

## Outlier Treatment

1. Treating outliers in the age column by deleting the outliers
• Finding out the upper cut-off limit for the “age” feature to delete outliers
• Now let’s find out the percentage and absolute count of players that are aged above 39 years
• Let’s now delete these outliers and replot the box-plot for the “age” feature
1. Treating outliers in the height column by capping the values at a maximum and minimum limit
• Finding out the upper and lower limits for the height column
• Capping the outliers at the maximum and minimum limits and re-plotting the box-plot after outlier treatment
1. Treating outliers in the weight column by imputing them with mean
• Finding out the upper and lower limits for the weight column
• Imputing the outliers using the mean and re-plotting the box-plot after outlier treatment

All the outliers have been treated in the dataset and the data is now ready to be processed further.

Github Link