Descriptive and Univariate Analysis using Python

Introduction

What are descriptive and univariate analysis?

Descriptive and univariate analysis is the first step of building data understanding. It is the phase in which the data practitioner performs various simple techniques such as viewing a snapshot of the data, getting mean and standard deviation for numerical variables, plotting density plots and histograms etc. to get a first impression about the data.

Main aim of descriptive and univariate analysis: To get a high-level impression about the data

The major techniques used for descriptive and univariate analysis that we will be discussing in this article have been briefly highlighted below:

Article Flow

Article Flow

We will now go through each of the above defined phases in a detailed manner.
The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link

Ground Work

  1. Before we dive deeper into the descriptive & univariate statistics lets first load & view a snapshot of the data using the “head” command
Head Output

Head Output

We can make the following conclusions about the data from the above information:

    • We have three character columns: “Name”, “Author” and “Genre”, out of these columns Genre is a categorical column
    • We have four numeric columns: “User Rating”, “Reviews”, ”Price” and “Year”, out of these columns one important thing to note is that the column “Year” is a time-series variable
  1. Now check how many records does our data have using the “shape” command

Shape Output

Shape Output

We can conclude from the above output that the data consists of 550 rows and 7 columns

  1. Lets now get some more generic information about the data-frame using the “info” command
Info Output

Info Output

We can make the following remarks by looking at the above output:

    • We have 7 columns
    • The data-type of three columns is object, three columns is int64 and one columns is float64
    • There are no missing values in our dataset
    • Our data has 550 number of records

Descriptive Statistics

  1. Lets now get into some descriptive statistics, this can easily be calculated by the “describe” function
Describe Output

Describe Output

Some interesting things to note from the above results are:

    • Year lies between 2009 to 2019
    • The highest rating received by any book is 9
    • The lowest rating received by any book is 3
    • The costliest book is work $105
    • The average book price is $13
    • The maximum reviews received on any book is 87,841
  1. Getting the mean and median for variables “User Rating” and “Reviews”

Output:

Mean Median Output

Mean Median Output

  1. Getting the mode (mode is the most frequently occurring value for a variable) for the “Genre” column

Mode Output

Mode Output

Therefore the most occurring “Genre” in the dataset is “Non Fiction”

  1. Getting percentage distribution of books across each genre
Pie Genre Output

Pie Genre Output

  1. Getting percentage distribution of books across each year
Pie Year Output

Pie Year Output

We can conclude from the above chart that the count of books is evenly distributed across all years at 9.1%

Univariate Analysis

In univariate analysis we pick each feature individually obtain summary statistics about each variable by creating distribution plots, histograms and box-plots.

  1. Creating distribution plots for numerical features
Distribution Plot Output

Distribution Plot Output

Looking at the histograms plotted above we can comment that the “User rating” feature is left skewed, the “Reviews” and “Price” variables are right skewed

  1. Creating box plots to understand the range of variables and locate outliers
Box Plot Output

Box Plot Output

By visually analyzing the above plots it can be inferred from the above box-plots that the:

    • outliers lie at the lower end (3.4 to 4.0) for the “User rating” variable
    • outliers lie at the upper end (greater than 40k) for the “Reviews” variable
    • outliers lie at the upper end (greater than 40) for the “Price” variable
  1. Creating violin plots to understand the distribution numerical features better
Violin Plot Output

Violin Plot Output

By visually analyzing the above plots it can be inferred from the above box-plots that the:

    • Most of the values for “User Rating” are concentrated around 75
    • Most of the values for “Reviews” are concentrated around12k
    • Most of the values for “Price” are concentrated around $13

Concluding some of the key leanings from the descriptive and univariate analysis below:

  • There are no null values in the dataset
  • The average book price is $13
  • Non-fictional books form 54% of the books in the entire dataset
  • Skewness is observed in the numerical variables

Complete Code with Github Link

Github Link

You may also like...