Data Visualization in Python: Deciding the right graph to represent the data
Introduction
Although choosing the best visualization to represent the data is a skill that is acquired over-time, there are definitely certain guidelines that must be kept in mind to ensure that the data is represented correctly.
We will be discussing when to use and how to create each of the following listed graphs in detail.
The dataset that has been used to illustrate the examples in this article is publicly available and can be downloaded from this link
Ground Work
- Before we jump into the visualizations let us first view the data
1 2 3 4 5 6 7 8 9 10 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt #loading the dataset df=pd.read_csv("all_seasons.csv") #viewing a snapshot of the dataset df.head() |
- Now check how many records does our data have using the “shape” command
1 |
print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" |
- Lets now get some more generic information about the data-frame using the “info” command and see how many null values are there in the dataset
1 |
df.info()#Getting information about null values and data types using the "info command" |
Single Variable
Single Numeric Variable
For the first set of visualizations let us consider the variable age. Age is a single numeric feature, for single numeric features we can create density plots to better understand how is the data distributed over that feature’s range.
We can also create box-plots and violin-plots to flag out outliers and give us a notion about mean, and percentiles of the variable.
1 2 3 4 5 6 7 8 |
plt.figure(figsize=(18,5)) plt.subplot(1,3,1) sns.histplot(df['age'], kde = True, color ='red', bins = 30) plt.subplot(1,3,2) sns.boxplot( y=df["age"] ); plt.subplot(1,3,3) sns.violinplot( y=df["age"]); fig.show() |
By looking at the above plots we can conclude that the age data is right skewed and most of the players are aged between 23 to 30 years.
Single Categorical Variable
When we need to understand the distribution of a categorical features in a dataset, we can simply create a pie-chart for that categorical variable. Following is an example of how a pie-chart was created across Country.
1 2 3 4 5 6 7 8 |
#Getting Percentage distribution of "Players" across the dataset fot top 5 countries df_counts= df.groupby(['country'])['player_name'].agg('count').reset_index(name="Count_of_players") df_counts_top5= df_counts.sort_values(by=['Count_of_players'],ascending=False) df_counts_top5= df_counts_top5.head(5) #Calculating distribution acorss of players for top 5 Countries fig1, ax1 = plt.subplots(figsize=(10,10)) ax1.pie(df_counts_top5['Count_of_players'],labels=df_counts_top5['country'],colors =["pink","orange","teal","yellow","cyan"],autopct = '%1.1f%%') ax1.set_title('Distribution across of players for top 5 Countries',fontsize = 16) |
We can make out from the above chart that for the top 5 countries almost 95.4% of the players belong to USA.
Multiple Variables
Numeric-Numeric
When we want to figure out relationship between two numeric variables the first thing we can do is to plot a scatter plot between these numerical variables.
Below is an example for creating a scatter-plot for the columns height and weight.
1 2 |
#plotting scatter plot between Player Height and Player Height sns.relplot(x=df['player_height'], y=df['player_weight'], height=4, aspect=2/1); |
Numeric-Categorical
Simple Bar Chart
When we want to analyze one numeric and one categorical variable then the chart would depend on the information we want to convey.
If we want to show comparison across various categories then we can use a simple bar chart as created below.
1 2 3 4 5 6 7 |
#Getting count of "Players" for top 10 colleges df_counts= df.groupby(['college'])['player_name'].agg('count').reset_index(name="Count_of_players") df_counts_top10= df_counts.sort_values(by=['Count_of_players'],ascending=False) df_counts_top10= df_counts_top10.head(10) sns.catplot(x='college', y='Count_of_players', data=df_counts_top10,kind='bar',palette='GnBu',height=6,aspect=2) plt.title('Player Count across top 10 colleges') plt.show() |
Stacked Bar Chart
If we want to show comparison as well as composition of certain categories then we can represent this using a stacked bar chart as illustrated below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#Creating Weight_Brackets and Height_Brackets Columns bins = [20, 40, 60, 80, 100, 120, 140, 160,180] df['Weight_Brackets']=pd.cut(df['player_weight'], bins) bins = [160, 170, 180, 190, 200, 210, 220, 230,240] df['Height_Brackets']=pd.cut(df['player_height'], bins) f = plt.figure(figsize=(15,10)) ax = f.add_subplot(1,1,1) sns.histplot(data=df, ax=ax, stat="count", multiple="stack", x="team_abbreviation", kde=False, palette="pastel", hue="Height_Brackets", element="bars", legend=True) ax.set_title("Distribution of Height of players across Teams") ax.set_xlabel("Team") ax.set_ylabel("Player Count") |
Grouped Bar Chart
Finally if we want to show comparison across various categorical segments then the best chart to portray this information would be a grouped bar chart as illustrated below:
1 2 3 4 5 |
#Plotting height weight segments df_weight_height= df.groupby(['Height_Brackets','Weight_Brackets'],as_index=False)[('player_name')].count() p=sns.catplot(x='Height_Brackets', y='player_name', hue='Weight_Brackets', data=df_weight_height,kind='bar',palette='bright',height=4, aspect=2) plt.ylabel('Total Players') |
Numeric-Date
Last but not the least while visualizing a time-series data, the best way to represent it is using a line chart.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
#Creating dataframe for time series chart df_year_players= df.groupby(['draft_year'])['player_name'].agg('count').reset_index(name="Count_of_players") df_year_players= df_year_players[df_year_players['draft_year']!="Undrafted"] #Creating time series line chart sns.set_theme(style="whitegrid") plt.figure(figsize=(20,7)) g=sns.lineplot(x = "draft_year", y = "Count_of_players",data=df_year_players, color='red', linewidth=1.5,marker="o") plt.setp(g.get_xticklabels(), rotation=45) plt.xlabel('Year', size=20) plt.ylabel('Total Players', size=20) plt.title("Total players across years (1963-2019)", size = 20) new_ticks = [i.get_text() for i in g.get_xticklabels()] for i in range(len(df_year_players.draft_year)): plt.annotate(str(df_year_players['Count_of_players'][i]), xy=(i,df_year_players['Count_of_players'][i]+10), ha='center', va='top',size=12) plt.show() |
Now we have a clear understanding about which chart to choose to represent the data in the best possible manner.
Complete Code with Github Link
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt #loading the dataset df=pd.read_csv("all_seasons.csv") #viewing a snapshot of the dataset df.head() print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" df.info()#Getting information about null values and data types using the "info command" plt.figure(figsize=(18,5)) plt.subplot(1,3,1) sns.histplot(df['age'], kde = True, color ='red', bins = 30) plt.subplot(1,3,2) sns.boxplot( y=df["age"] ); plt.subplot(1,3,3) sns.violinplot( y=df["age"]); #Getting Percentage distribution of "Players" across the dataset fot top 5 countries df_counts= df.groupby(['country'])['player_name'].agg('count').reset_index(name="Count_of_players") df_counts_top5= df_counts.sort_values(by=['Count_of_players'],ascending=False) df_counts_top5= df_counts_top5.head(5) #Calculating distribution across of players for top 5 Countries fig1, ax1 = plt.subplots(figsize=(10,10)) ax1.pie(df_counts_top5['Count_of_players'],labels=df_counts_top5['country'],colors =["pink","orange","teal","yellow","cyan"],autopct = '%1.1f%%') ax1.set_title('Distribution acorss of players for top 5 Countries',fontsize = 16) #plotting scatter plot between Player Height and Player Height sns.relplot(x=df['player_height'], y=df['player_weight'], height=4, aspect=2/1); #Getting count of "Players" for top 10 colleges df_counts= df.groupby(['college'])['player_name'].agg('count').reset_index(name="Count_of_players") df_counts_top10= df_counts.sort_values(by=['Count_of_players'],ascending=False) df_counts_top10= df_counts_top10.head(10) sns.catplot(x='college', y='Count_of_players', data=df_counts_top10,kind='bar',palette='GnBu',height=6,aspect=2) plt.title('Player Count across top 10 colleges') plt.show() #Creating Weight_Brackets and Height_Brackets Columns bins = [20, 40, 60, 80, 100, 120, 140, 160,180] df['Weight_Brackets']=pd.cut(df['player_weight'], bins) bins = [160, 170, 180, 190, 200, 210, 220, 230,240] df['Height_Brackets']=pd.cut(df['player_height'], bins) f = plt.figure(figsize=(15,10)) ax = f.add_subplot(1,1,1) sns.histplot(data=df, ax=ax, stat="count", multiple="stack", x="team_abbreviation", kde=False, palette="pastel", hue="Height_Brackets", element="bars", legend=True) ax.set_title("Distribution of Height of players across Teams") ax.set_xlabel("Team") ax.set_ylabel("Player Count") #Plotting height weight segments df_weight_height= df.groupby(['Height_Brackets','Weight_Brackets'],as_index=False)[('player_name')].count() p=sns.catplot(x='Height_Brackets', y='player_name', hue='Weight_Brackets', data=df_weight_height,kind='bar',palette='bright',height=4, aspect=2) plt.ylabel('Total Players') #Creating dataframe for time series chart df_year_players= df.groupby(['draft_year'])['player_name'].agg('count').reset_index(name="Count_of_players") df_year_players= df_year_players[df_year_players['draft_year']!="Undrafted"] #Creating time series line chart sns.set_theme(style="whitegrid") plt.figure(figsize=(20,7)) g=sns.lineplot(x = "draft_year", y = "Count_of_players",data=df_year_players, color='red', linewidth=1.5,marker="o") plt.setp(g.get_xticklabels(), rotation=45) plt.xlabel('Year', size=20) plt.ylabel('Total Players', size=20) plt.title("Total players across years (1963-2019)", size = 20) new_ticks = [i.get_text() for i in g.get_xticklabels()] for i in range(len(df_year_players.draft_year)): plt.annotate(str(df_year_players['Count_of_players'][i]), xy=(i,df_year_players['Count_of_players'][i]+10), ha='center', va='top',size=12) plt.show() |