Data Cleaning Using Python: Missing Value Treatment

Introduction

What is missing value treatment and why is it necessary?

Missing values treatment is a phase of Data Cleaning in which the data practitioner identifies the missing values in the dataset and uses various techniques to “take care” of the missing values so that the data can be further processed.

Methods to “take care” of missing values that we will be discussing in this article have been briefly highlighted below:

Article Flow

Article Flow

We will now go through each of the above defined phases in a detailed manner.

The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link.

Ground Work

  1. Before we jump into data cleaning let us first view the data
Head Output

Head Output

  1. Now check how many records does our data have using the “shape” command
Shape Output

Shape Output

We can conclude from the above output that the data consists of 73,861 rows and 23 columns

  1. Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset
Info Output

Info Output

By looking at the above output we can comment that the following columns have Null values:

    • Name
    • Style
    • BoilGravity
    • MashThickness
    • PitchRate
    • PrimaryTemp
    • PrimingMethod
    • PrimingAmount
    • UserID

Missing Value Identification

  1. A way to directly print the columns that contain missing values is:

Output:

Null Column Names

Null Column Names

  1. Now let’s check the count of missing values in each column
Null Values Sum

Null Values Sum

  1. Now let’s check out the percentage of missing values for each column so that we can decide that deleting rows with missing values is feasible for which feature
Null Values Percentage

Null Values Percentage

    • By looking at the above output we can conclude that features “Name”, “Style” and ”BoilGravity” contribute to very less number of Null values so it will be logical to delete the records where-ever “Name”, “Style” or ”BoilGravity” are Null
    • More than 90% records in the columns “PrimingMethod” and “PrimingAmour” are blank so it will be appropriate to drop these columns
    • Since “UserId” is an identification column it’s missing values can’t be imputed or treated so we will be dropping this column as well
    • After all the above treatment we will be left with the columns “MeshThickness”, “PitchRate” and “PrimaryTemp”, we will fill the missing values in these columns using various missing value imputation techniques

Missing Value Treatment

  1. Firstly let’s delete the records where-ever “Name”, “Style” or ”BoilGravity” are Null, this can be done by the following command:
Row Deletion Output

Row Deletion Output

  1. Now let’s delete the columns where-ever the percentage of Null Values is very high( we discussed during the missing value identification step that “PrimingMethod” and “PrimingAmount” contain more than 90% missing values):
Column Deletion Output

Column Deletion Output

  1. Now let’s start imputing the missing values for the columns “MashThickness”, “PitchRate” and “PrimaryTemp”. We will impute all three of these separately to illustrate different techniques for mussing value imputation
  2. Imputing “MashThickness” null values using mean of the column “MashThickness”
MashThickness Imputation Output

MashThickness Imputation Output

  1. Imputing “PitchRate” null values using mode of the column “PitchRate”
PitchRate Imputation Output

PitchRate Imputation Output

  1. Imputing PrimaryTemp using KNN Imputer

The process in which a KNN model is used to fill missing values is known as KNN Imputation that is illustrated in the code below:

PrimaryTemp Imputation Output

PrimaryTemp Imputation Output

We have now successfully treated all the Null values in the dataset and the data is now ready to be analyzed further.

Complete Code with Github Link

Github Link

You may also like...