Outlier Detection and Treatment Using Python
Introduction
What are outliers?
Outliers in a dataset are those data points that are significantly different from the remaining data. Outliers lie outside the overall pattern of the distribution. In rough terms outliers are those data points that have extremely high or low values as compared to the rest of the data points.
What is outlier detection and treatment?
The process out identifying/flagging the outliers in a dataset is known as outlier detection.
The process of changing/deleting the outliers in a dataset is known as outlier treatment.
Outlier treatment is necessary because data-related analyses are sensitive to the range and distribution of data points. Outliers can skew the statistical d techniques and result in less accurate results.
The statistical techniques that we will be discussing in this article to detect and treat outliers have been briefly highlighted below:
We will now go through each of the above defined phases in a detailed manner.
The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link.
Ground Work
- Before we jump into outlier detection and treatment let us first view the data
1 2 3 4 5 6 7 8 9 10 11 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np #loading the dataset df=pd.read_csv("all_seasons.csv") #viewing a snapshot of the dataset df.head() |
- Now check how many records does our data have using the “shape” command
1 |
print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" |
We can conclude from the above output that the data consists of 11,145 rows and 8 columns
- Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset
1 |
df.info()#Getting information about null values and data types using the "info command" |
Our dataset has no missing values, so we can proceed ahead with Outlier detection and treatment
Outlier Detection
- Visually identifying the outliers using box-plots
1 2 3 4 5 6 7 8 |
#Creating Box-plots for numeric columns to visually identify outliers plt.figure(figsize=(18,5)) plt.subplot(1,3,1) sns.boxplot( y=df["age"] ); plt.subplot(1,3,2) sns.boxplot( y=df["player_height"] ); plt.subplot(1,3,3) sns.boxplot( y=df["player_weight"] ); |
The dots in the box-plots represent outliers so we can conclude the following points from the above boxplots:
-
- For the “age” column, outliers only lie in the higher end of the age brackets
- For the “player_height” column, outliers lie both in the top and bottom half of the height bracket but the number of outliers at the lower end is higher than the upper end
- For the “player_weight” column, outliers lie both in the top and bottom half of the weight bracket but the number of outliers at the upper end is higher than the lower end
- Now lets see how these numeric columns are distributed and see if we can get any insights from their distribution plots
1 2 3 4 5 |
fig, ax =plt.subplots(1,3, figsize=(18, 7)) sns.histplot(df['age'], kde = True, color ='red', bins = 50, ax=ax[0]) sns.histplot(df['player_height'], kde = True, color ='red', bins = 50, ax=ax[1]) sns.histplot(df['player_weight'], kde = True, color ='red', bins = 50, ax=ax[2]) fig.show() |
We can conclude the following points from the distribution-plots above:
-
- The “age” feature is right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column
- The “player_height” feature is slightly left skewed this indicates that there are outliers towards the lower-end (left end in above plots) of the age column
- The “player_weight” feature looks slightly right skewed this indicates that there are outliers towards the higher end (right end in above plots) of the age column
Outlier Treatment
- Treating outliers in the age column by deleting the outliers
- Finding out the upper cut-off limit for the “age” feature to delete outliers
1 2 3 4 |
#Identifying outliers for age using IQR IQR = df.age.quantile(0.75) - df.age.quantile(0.25) upper_limit = df.age.quantile(0.75) + (IQR * 1.5) upper_limit |
-
- Now let’s find out the percentage and absolute count of players that are aged above 39 years
1 2 3 4 |
total = np.float(df.shape[0]) print('Total Players: {}'.format(df.age.shape[0])) print('Players aged more than 39 years: {}'.format(df[df.age>39].shape[0])) print('Percentage of players aged more than 39 years: {}'.format(df[df.age>39].shape[0]*100/total)) |
-
- Let’s now delete these outliers and replot the box-plot for the “age” feature
1 2 3 |
df_ageremoved= df[df['age']<39] df_ageremoved sns.boxplot( y=df_ageremoved["age"] ); |
- Treating outliers in the height column by capping the values at a maximum and minimum limit
-
- Finding out the upper and lower limits for the height column
1 2 3 4 5 6 7 |
#Identifying upper and lower limit for height IQR = df.height.quantile(0.75) - df.height.quantile(0.25) upper_limit = df.player_height.quantile(0.75) + (IQR * 1.5) lower_limit = df.player_height.quantile(0.25) - (IQR * 1.5) print("Upper height limit:"+ str(upper_limit )) print("Lower height limit:"+ str(lower_limit )) |
-
- Capping the outliers at the maximum and minimum limits and re-plotting the box-plot after outlier treatment
1 2 3 4 |
#Let us cap these outliers at the upper and lower limits and print the box-plot after outlier treatment df['Height_treated'] = np.where(df['player_height']> 217.28 , 217.28 , df['player_height']) df['Height_treated'] = np.where(df['Height_treated']< 186.58 , 186.58, df['Height_treated']) sns.boxplot( y=df["Height_treated"] ); |
- Treating outliers in the weight column by imputing them with mean
-
- Finding out the upper and lower limits for the weight column
1 2 3 4 5 6 7 |
#Identifying upper and lower limit for weight IQR = df.player_weight.quantile(0.75) - df.player_weight.quantile(0.25) upper_limit = df.player_weight.quantile(0.75) + (IQR * 1.5) lower_limit = df.player_weight.quantile(0.25) - (IQR * 1.5) print("Upper weight limit:"+ str(upper_limit )) print("Lower weight limit:"+ str(lower_limit )) |
-
- Imputing the outliers using the mean and re-plotting the box-plot after outlier treatment
1 2 3 4 |
#Let us impute these outliers using the mean and print the box-plot after outlier treatment df['Weight_treated'] = np.where(df['player_weight']> 137.21 , df['player_weight'].mean(), df['player_weight']) df['Weight_treated'] = np.where(df['Weight_treated']< 62.82 , df['player_weight'].mean(), df['Weight_treated']) sns.boxplot( y=df["Weight_treated"] ); |
All the outliers have been treated in the dataset and the data is now ready to be processed further.
Complete Code with Github Link
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np #loading the dataset df=pd.read_csv("all_seasons.csv") #viewing a snapshot of the dataset df.head() print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" df.info()#Getting information about null values and data types using the "info command" #Creating Box-plots for numeric columns to visually identify outliers plt.figure(figsize=(18,5)) plt.subplot(1,3,1) sns.boxplot( y=df["age"] ); plt.subplot(1,3,2) sns.boxplot( y=df["player_height"] ); plt.subplot(1,3,3) sns.boxplot( y=df["player_weight"] ); #Creating distribution-plots to bettwr understand the spread of the numeric columns fig, ax =plt.subplots(1,3, figsize=(18, 7)) sns.histplot(df['age'], kde = True, color ='red', bins = 50, ax=ax[0]) sns.histplot(df['player_height'], kde = True, color ='red', bins = 50, ax=ax[1]) sns.histplot(df['player_weight'], kde = True, color ='red', bins = 50, ax=ax[2]) fig.show() #Identifying the upper limit for age using IQR IQR = df.age.quantile(0.75) - df.age.quantile(0.25) upper_limit = df.age.quantile(0.75) + (IQR * 1.5) upper_limit #Printing total number of outliers for the age feature total = np.float(df.shape[0]) print('Total Players: {}'.format(df.age.shape[0])) print('Players aged more than 39 years: {}'.format(df[df.age>39].shape[0])) print('Percentage of players aged more than 39 years: {}'.format(df[df.age>39].shape[0]*100/total)) #Deleting the players that are aged more than the upper limit df_ageremoved= df[df['age']<39] df_ageremoved sns.boxplot( y=df_ageremoved["age"] ); #Identifying upper and lower limit for height IQR = df.player_height.quantile(0.75) - df.player_height.quantile(0.25) upper_limit = df.player_height.quantile(0.75) + (IQR * 1.5) lower_limit = df.player_height.quantile(0.25) - (IQR * 1.5) print("Upper height limit:"+ str(upper_limit )) print("Lower height limit:"+ str(lower_limit )) #Let us cap these outliers at the upper and lower limits and print the box-plot after outlier treatment df['Height_treated'] = np.where(df['player_height']> 217.28 , 217.28 , df['player_height']) df['Height_treated'] = np.where(df['Height_treated']< 186.58 , 186.58, df['Height_treated']) sns.boxplot( y=df["Height_treated"] ); #Identifying upper and lower limit for weight IQR = df.player_weight.quantile(0.75) - df.player_weight.quantile(0.25) upper_limit = df.player_weight.quantile(0.75) + (IQR * 1.5) lower_limit = df.player_weight.quantile(0.25) - (IQR * 1.5) print("Upper weight limit:"+ str(upper_limit )) print("Lower weight limit:"+ str(lower_limit )) #Let us impute these outliers using the mean and print the box-plot after outlier treatment df['Weight_treated'] = np.where(df['player_weight']> 137.21 , df['player_weight'].mean(), df['player_weight']) df['Weight_treated'] = np.where(df['Weight_treated']< 62.82 , df['player_weight'].mean(), df['Weight_treated']) sns.boxplot( y=df["Weight_treated"] ); |