Visually Analyzing Multivariate Relationships using Python
Introduction
What are multivariate relationships?
The relationships that exist between two or more variables are known as multivariate relationships.
Various visual representations of multivariate relationships that we will be discussing in this article have been briefly highlighted below:
We will now go through each of the above defined visualizations in a detailed manner.
The dataset that has been used for the examples illustrated in this article is publicly available and can be downloaded from this link
Ground Work
- Before we jump into outlier detection and treatment let us first view the data
1 2 3 4 5 6 7 8 9 10 11 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np #loading the dataset df=pd.read_csv("Sales_data.csv") #viewing a snapshot of the dataset df.head() |
- Now check how many records does our data have using the “shape” command
1 |
print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" |
We can conclude from the above output that the data consists of 550,068 rows and 12 columns
Numerical Features
- Let’s plot salary and age to identify whether any relationship lies between the two
1 2 |
#plotting scatter plot between age and salary sns.relplot(x=df['Age'], y=df['EstimatedSalary'], height=6, aspect=2/1, hue=df["Age"]); |
- The above scatter plot is slightly tough to interpret so let’s divide the age feature into bins and find the mean of salary in each age bracket
1 2 3 4 5 6 7 8 9 10 |
#Creating age brackets bins = [20, 30, 40, 50, 60, 70, 80, 90,100] df['Age_Brackets']=pd.cut(df['Age'], bins) df_age_salary = df.groupby('Age_Brackets',as_index=False)[('EstimatedSalary')].mean() #Plotting Mean Salary across age sns.catplot(x='Age_Brackets', y='EstimatedSalary', data=df_age_salary,kind='bar',palette='GnBu',height=4,aspect=2) plt.title('Mean Salary across Age Brackets') plt.show() |
We can conclude from the above plot that people that the mean salary of older customers is slightly higher as compared to the younger customers
Numerical and Categorical Features
There are numerous ways to plot numeric-categorical relationships, let’s see them one by one
- Let us try to understand how people are distributed across various countries using a multi-line graph
1 2 3 4 5 6 |
#Plotting line chart to visualize the distribution of people across age and country df_geo_age = df.groupby(['Age','Geography'],as_index=False).count() plt.figure(figsize=(10,7)) sns.lineplot(x = "Age", y = "RowNumber",hue="Geography",data=df_geo_age) ax.set_ylabel("Count") |
Findings from the above plot:
-
- Most people in the dataset belong to France
- Germany and Spain contain almost equal amount of people
- Most people are aged between 30 to 50 years
- France has a huge concentration of people that are aged between 30 to 50 years
- First let’s plot the average balance across gender and geography to understand which segment has the most balance on an average
1 2 3 |
df_bal_gender_geo= df.groupby(['Gender','Geography'],as_index=False)[('Balance')].mean() p=sns.catplot(x='Geography', y='Balance', hue='Gender', data=df_bal_gender_geo,kind='bar',palette='bright',height=4, aspect=1.5) |
Findings from the above plot:
-
- An average German has more balance in their account as compared to a French and Spanish
- Males have higher mean balance in their accounts as compared to females
- German males have the highest mean balance as compared to all other segments
- Let’s now see what the male-female distribution looks like across Tenure by using a stacked bar chart
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#Plotting stacked bar chart to represent the distribution of males and females across gender f = plt.figure(figsize=(8,5)) ax = f.add_subplot(1,1,1) sns.histplot(data=df, ax=ax, stat="count", multiple="stack", x="Tenure", kde=False, palette="pastel", hue="Gender", element="bars", legend=True) ax.set(ylim=(0, 1400)) #plt.legend(["blue", "green"], loc ="top right") ax.set_title("Distribution of males and females across Tenure") ax.set_xlabel("Tenure") ax.set_ylabel("Count") |
Findings from the above plot:
-
- Most of the people fall within 1 to 9 years of tenure
- Males and females are almost equally distributed across all tenures
- Let’s now find out which geography, gender segment has the highest credit card users using a multi-grid grouped bar chart
1 2 3 4 5 6 |
# Let us now checkout the number of people across Geography, Gender that have a Credit Card df_geo_credit_gender = df.groupby(['Geography','HasCrCard','Gender'],as_index=False).count() grid_plot_credit=sns.catplot(x='Geography', y='RowNumber', hue='Gender',col='HasCrCard', data=df_geo_credit_gender,kind='bar',palette='bright',height=6, aspect=0.7) plt.legend(loc='upper right') plt.ylabel('Count') |
- Finally let’s try to create a combo-chart that reflects the count of people and estimated salary across tenure in the same plot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
#Creating dataframe to plot the combo chart df_tenure_people_count = df.groupby(['Tenure'],as_index=False).count() df_tenure_people_count= df_tenure_people_count[['Tenure','RowNumber']] df_tenure_people_count.columns = ['Tenure','Counts'] df_tenure_salary_mean = df.groupby(['Tenure'],as_index=False)[('EstimatedSalary')].mean() df_tenure_salary_mean.columns = ['Tenure','Mean_Salary'] merged_table = pd.merge(df_tenure_people_count,df_tenure_salary_mean,on='Tenure') #Create combo chart fig, ax1 = plt.subplots(figsize=(10,6)) color = 'tab:green' #bar plot creation ax1.set_title('People count and estimated salary across Tenure', fontsize=16) ax1.set_ylabel('People Counts', fontsize=16) ax1 = sns.barplot(x='Tenure', y='Counts', data = merged_table, palette='summer') ax1.tick_params(axis='y') #specify we want to share the same x-axis ax2 = ax1.twinx() color = 'tab:red' #line plot creation ax2.set_ylabel('Mean Estimated Salary', fontsize=16) ax2 = sns.lineplot(x='Tenure', y='Mean_Salary', data = merged_table, sort=False, color=color) ax2.tick_params(axis='y', color=color) #show plot ax1.legend(["People Counts"], loc="upper left") ax2.legend(["Estimated Salary"]) ax1.set(ylim=(0, 1500)) ax2.set(ylim=(90000, 120000)) plt.show() |
Findings from the above plot:
-
- Mean estimated salary is the lowest for the people with Tenure of 3 years
- Mean estimated salary is the highest for the people with Tenure of 10 years
- Lowest number of people currently have Tenure of 0 years
Conclusion
We have now successfully understood and plotted basic multivariate relationships and we can now use this knowledge to derive inter-relationships between various variables to understand the data better.
Complete Code with Github Link
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
#importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np from matplotlib.pyplot import figure #loading the dataset df=pd.read_csv("Churn_Modelling.csv") #viewing a snapshot of the dataset df.head() print(df.shape) #Getting the number of rows and columns of the dataframe using the "shape command" #plotting scatter plot between age and salary sns.relplot(x=df['Age'], y=df['EstimatedSalary'], height=6, aspect=2/1, hue=df["Age"]); #Creating age brackets bins = [20, 30, 40, 50, 60, 70, 80, 90,100] df['Age_Brackets']=pd.cut(df['Age'], bins) df_age_salary = df.groupby('Age_Brackets',as_index=False)[('EstimatedSalary')].mean() #Plotting Mean Salary across age sns.catplot(x='Age_Brackets', y='EstimatedSalary', data=df_age_salary,kind='bar',palette='GnBu',height=4,aspect=2) plt.title('Mean Salary across Age Brackets') plt.show() df_bal_gender_geo= df.groupby(['Gender','Geography'],as_index=False)[('Balance')].mean() p=sns.catplot(x='Geography', y='Balance', hue='Gender', data=df_bal_gender_geo,kind='bar',palette='bright',height=4, aspect=1.5) #Plotting stacked baar chart to represent the distribution of males and females across gender f = plt.figure(figsize=(8,5)) ax = f.add_subplot(1,1,1) sns.histplot(data=df, ax=ax, stat="count", multiple="stack", x="Tenure", kde=False, palette="pastel", hue="Gender", element="bars", legend=True) ax.set(ylim=(0, 1400)) ax.set_title("Distribution of males and females across Tenure") ax.set_xlabel("Tenure") ax.set_ylabel("Count") #Plotting line chart to visualize the distribution of people across age and country df_geo_age = df.groupby(['Age','Geography'],as_index=False).count() plt.figure(figsize=(10,7)) plt.ylabel('Count') sns.lineplot(x = "Age", y = "RowNumber",hue="Geography",data=df_geo_age) # Let us now checkout the number of people across Age, Geography, Gender that have a Credit Card df_geo_credit_gender = df.groupby(['Geography','HasCrCard','Gender'],as_index=False).count() grid_plot_credit=sns.catplot(x='Geography', y='RowNumber', hue='Gender',col='HasCrCard', data=df_geo_credit_gender,kind='bar',palette='bright',height=6, aspect=0.7) plt.legend(loc='upper right') plt.ylabel('Count') #Creating dataframe to plot the combo chart df_tenure_people_count = df.groupby(['Tenure'],as_index=False).count() df_tenure_people_count= df_tenure_people_count[['Tenure','RowNumber']] df_tenure_people_count.columns = ['Tenure','Counts'] df_tenure_salary_mean = df.groupby(['Tenure'],as_index=False)[('EstimatedSalary')].mean() df_tenure_salary_mean.columns = ['Tenure','Mean_Salary'] merged_table = pd.merge(df_tenure_people_count,df_tenure_salary_mean,on='Tenure') #Create combo chart fig, ax1 = plt.subplots(figsize=(10,6)) color = 'tab:green' #bar plot creation ax1.set_title('People count and estimated salary across Tenure', fontsize=16) ax1.set_ylabel('People Counts', fontsize=16) ax1 = sns.barplot(x='Tenure', y='Counts', data = merged_table, palette='summer') ax1.tick_params(axis='y') #specify we want to share the same x-axis ax2 = ax1.twinx() color = 'tab:red' #line plot creation ax2.set_ylabel('Mean Estimated Salary', fontsize=16) ax2 = sns.lineplot(x='Tenure', y='Mean_Salary', data = merged_table, sort=False, color=color) ax2.tick_params(axis='y', color=color) #show plot ax1.legend(["People Counts"], loc="upper left") ax2.legend(["Estimated Salary"]) ax1.set(ylim=(0, 1500)) ax2.set(ylim=(90000, 120000)) plt.show() |