Data Visualization in Python: Deciding the right graph to represent the data

Introduction

Although choosing the best visualization to represent the data is a skill that is acquired over-time, there are definitely certain guidelines that must be kept in mind to ensure that the data is represented correctly.

We will be discussing when to use and how to create each of the following listed graphs in detail.

Visualization Flow-Chart

Visualization Flow-Chart

The dataset that has been used to illustrate the examples in this article is publicly available and can be downloaded from this link

Ground Work

  1. Before we jump into the visualizations let us first view the data
Head Output

Head Output

  1. Now check how many records does our data have using the “shape” command
Shape Output

Shape Output

  1. Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset

Single Variable

Single Numeric Variable

For the first set of visualizations let us consider the variable age. Age is a single numeric feature, for single numeric features we can create density plots to better understand how is the data distributed over that feature’s range.

We can also create box-plots and violin-plots to flag out outliers and give us a notion about mean, and percentiles of the variable.

Single Numeric Variable Plots

Single Numeric Variable Plots

By looking at the above plots we can conclude that the age data is right skewed and most of the players are aged between 23 to 30 years.

Single Categorical Variable

When we need to understand the distribution of a categorical features in a dataset, we can simply create a pie-chart for that categorical variable. Following is an example of how a pie-chart was created across Country.

Pie Chart

Pie Chart

We can make out from the above chart that for the top 5 countries almost 95.4% of the players belong to USA.

Multiple Variables

Numeric-Numeric

When we want to figure out relationship between two numeric variables the first thing we can do is to plot a scatter plot between these numerical variables.

Below is an example for creating a scatter-plot for the columns height and weight.

Scatter Plot across player height and weight

Scatter Plot across player height and weight

Numeric-Categorical

Simple Bar Chart

When we want to analyze one numeric and one categorical variable then the chart would depend on the information we want to convey.

If we want to show comparison across various categories then we can use a simple bar chart as created below.

Bar Chart Output

Bar Chart Output

Stacked Bar Chart

If we want to show comparison as well as composition of certain categories then we can represent this using a stacked bar chart as illustrated below:

Stacked Bar Output

Stacked Bar Output

Grouped Bar Chart

Finally if we want to show comparison across various categorical segments then the best chart to portray this information would be a grouped bar chart as illustrated below:

Grouped Barchart

Grouped Bar-chart

Numeric-Date

Last but not the least while visualizing a time-series data, the best way to represent it is using a line chart.

Time-Series Line Plot

Time-Series Line Plot

Now we have a clear understanding about which chart to choose to represent the data in the best possible manner.

Complete Code with Github Link

Github Link

You may also like...