Feel free to read the code on GitHub
An old Chinese proverb says: one picture says more than one thousands words. One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.
import pandas as pd
df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)
headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD']
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)
df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)
Correcting data
Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).
df.isnull().sum()
df.dtypes
Very important is reduction of the class or join some similar groups if it is not bad for the project.
df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')
df.Education.value_counts(dropna = False)
We can get assumption if somebody didn’t answer he didn’t want to give information: 'I prefer not to answer’.
import numpy as np
df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')
df.Education.value_counts(dropna = False)
df.Education.isnull().sum()
df.Major_undergraduate.value_counts(dropna = False)
Rozumiem, że NaN i 'Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’
df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')
import matplotlib as plt
df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')
df.Recent_role.value_counts(dropna=False)
df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')
Z1 = df.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count').sort_values('Male',ascending=False)
Z1
Z1.plot(kind='barh', legend=True, title='Data Scientists by Major undergraduate and Gender (Kaggle 2018)', figsize=(7, 4), color=('b','g','y'))
Z2 = df.pivot_table(index=['Country'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False).nlargest(20,'Male')
Z2
Poland in data¶
Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.
PL= df[df.Country=='Poland']
Z3 = PL.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Z3 = PL.pivot_table(index=['Recent_role'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Let’s do standard, quick Pie Plot
We can see banal, predictable visualization.
Z4 = PL.pivot_table(index=['Recent_role'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z4.plot(kind='pie', subplots=True, legend=False, title="Data Scientists by Recent_role (Kaggle 2018)",figsize=(15,7), autopct='
Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)
Better Pie Plot with interesting colors
At the beginning we can change colors and give better descriptions.
GSuite Text and Background Palette: https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/
import matplotlib.pyplot as plt
## Wielkość wykresu
plt.figure(figsize=(16,8))
## informacja że jest to wykres złożony
ax1 = plt.subplot(aspect='equal')
## ustalenie koloru
colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
## równanie podstawowe
Z5.plot(kind='pie',colors =colors , y = 'Age', ax=ax1, autopct='
# opisy, nazwy itp
ax1.set_xlabel('Something to write', fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Something to write', fontsize=11, color='grey', alpha=0.8)
ax1.set_title('Major_undergraduate in Data Scientists (Kaggle 2018)', fontsize=18, color='grey', alpha=0.8)
ax1.set_facecolor('#d8dcd6')
The best Pie Plot¶
I came across this publication and decided to do Pie Plot by this way.
https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f
To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.
PPL=Z5.reset_index()
PPL.head(5)
To pull vectors of data from the pivot table.
PPL.reset_index()
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()
fig1, ax1 = plt.subplots(figsize=(10,5))
ax1.pie(sizes, labels=labels, autopct='
ax1.axis('equal')
plt.tight_layout()
plt.show()
Colors changing
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(10,5))
colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']
ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.show()
Changing size and color of the all fonts
textprops={’fontsize’: 30, 'color’:”green”}
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(18,12))
colors = ['#e06666','#ea9999','#f4cccc','#ff0000','#434343']
ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.show()
Changing size and color of the separate fonts¶
for text in texts:
text.set_color('darkred')
for autotext in autotexts:
autotext.set_color('grey')
fig1, ax1 = plt.subplots(figsize=(15,12))
colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
for text in texts:
text.set_color('darkred')
for autotext in autotexts:
autotext.set_color('grey')
ax1.axis('equal')
plt.tight_layout()
plt.show()
Changing size and color for the chosen categories
fig1, ax1 = plt.subplots(figsize=(6,6))
colors = ['#ff9999','#747574','#99ff99','#ffcc99','#f1c232']
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
for text in texts:
text.set_color('grey')
for autotext in autotexts:
autotext.set_color('grey')
texts[0].set_fontsize(24)
texts[0].set_color('black')
texts[4].set_fontsize(33)
texts[4].set_color('green')
ax1.axis('equal')
plt.tight_layout()
plt.show()
Making a bagel ¶
fig1, ax1 = plt.subplots(figsize=(18,6))
colors = ['#a2c4c9','#b6d7a8','#747574','#99ff99','#ffcc99','#76a5af']
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle
for text in texts:
text.set_color('darkred')
for autotext in autotexts:
autotext.set_color('grey')
ax1.axis('equal')
plt.tight_layout()
plt.show()
Making the better bangle
fig1, ax1 = plt.subplots(figsize=(18,8))
colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
for text in texts:
text.set_color('grey')
for autotext in autotexts:
autotext.set_color('black')
autotext.set_fontsize(22)
texts[0].set_fontsize(18)
texts[0].set_color('black')
#draw circle
centre_circle = plt.Circle((0,0),0.40,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.set_xlabel('Year 2018', fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Data Scientist', fontsize=11, color='grey', alpha=0.8)
ax1.set_title('Data Scientist by profession', fontsize=58, color='#d0e0e3', alpha=0.8)
ax1.set_facecolor('#d8dcd6')
ax1.axis('equal')
plt.tight_layout()
plt.show()
We enter the gender variable
PL.columns
Z6 = PL.pivot_table(index=['Major_undergraduate','Gender'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z6.head(10)
To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.
PLG=Z6.reset_index()
PLG.head(2)
PLG.reset_index()
labels_gender = PLG['Gender'].to_list()
sizes_gender = PLG['Age'].to_list()
The double bangle
import matplotlib.pyplot as plt
colors_gender = ['#c2c2f0','#ffb3e6']
fig1, ax1 = plt.subplots(figsize=(18,6))
colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
plt.pie(sizes_gender,colors=colors_gender,radius=0.75,startangle=0)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
for text in texts:
text.set_color('grey')
for autotext in autotexts:
autotext.set_color('black')
#draw circle
centre_circle = plt.Circle((0,0),0.50,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.axis('equal')
plt.tight_layout()
plt.show()
The double bangle is: „one bridge too far”. This plot is beautiful but bangles are not correlated each other. To achieve adequate connection vectors should be come from one pivot tables. At the moment I have no idea how to do it (this groupby, query, pivot ….).
Trigger to create Pie Plot
Components to create perfect pie plot: labels, sizes, colors
To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.
PPL=Z5.reset_index()
PPL.head(5)
PPL.reset_index()
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()
colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
def PPieP(sizes,labels,colors):
fig1, ax1 = plt.subplots(figsize=(18,8))
patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
for text in texts:
text.set_color('grey')
for autotext in autotexts:
autotext.set_color('black')
autotext.set_fontsize(22)
texts[0].set_fontsize(18)
texts[0].set_color('black')
#draw circle
centre_circle = plt.Circle((0,0),0.40,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
ax1.set_xlabel('Year 2018', fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Data Scientist', fontsize=11, color='grey', alpha=0.8)
ax1.set_title('Data Scientist by profession', fontsize=58, color='#d0e0e3', alpha=0.8)
ax1.set_facecolor('#d8dcd6')
ax1.axis('equal')
plt.tight_layout()
plt.show()
# Variables to the trigger:
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()
#colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
#colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
#colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']
#colors = ['#cfe2f3','#9fc5e8','#6fa8dc']
colors = ['#d9ead3','#b6d7a8','#93c47d','#6aa84f']
# Trigger:
PPieP(sizes,labels,colors)
Good advice in making presentation is to prepare plots using one standard.
As says man who built my house: messy but equally!

