Perfect Plots: Pie Plot

An old Chinese proverb says: one picture says more than one thousands words. One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.

In [1]:
import pandas as pd

df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)
In [2]:
headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD'] 
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)
Out[2]:
Duration (in seconds) Gender Gender2 Age Country Education Major_undergraduate Recent_role Recent_role2 Industry Industry2 Years_of_experience compensation$USD
0 710 Female -1 45-49 United States of America Doctoral degree Other Consultant -1 Other 0 NaN NaN
1 434 Male -1 30-34 Indonesia Bachelor’s degree Engineering (non-computer focused) Other 0 Manufacturing/Fabrication -1 5-10 10-20,000
2 718 Female -1 30-34 United States of America Master’s degree Computer science (software engineering, etc.) Data Scientist -1 I am a student -1 0-1 0-10,000
3 621 Male -1 35-39 United States of America Master’s degree Social sciences (anthropology, psychology, soc… Not employed -1 NaN -1 NaN NaN
In [3]:
df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)

Correcting data

Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).

In [4]:
df.isnull().sum()
Out[4]:
Duration (in seconds)       0
Gender                      0
Age                         0
Country                     0
Education                 421
Major_undergraduate       912
Recent_role               959
Industry                 2174
Years_of_experience      2758
compensation$USD         3674
dtype: int64
In [5]:
df.dtypes
Out[5]:
Duration (in seconds)     int64
Gender                   object
Age                      object
Country                  object
Education                object
Major_undergraduate      object
Recent_role              object
Industry                 object
Years_of_experience      object
compensation$USD         object
dtype: object

Very important is reduction of the class or join some similar groups if it is not bad for the project.

In [6]:
df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')
In [7]:
df.Education.value_counts(dropna = False)
Out[7]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
Professional degree                                                    599
NaN                                                                    421
I prefer not to answer                                                 345
No formal education past high school                                   232
Name: Education, dtype: int64

We can get assumption if somebody didn’t answer he didn’t want to give information: ‘I prefer not to answer’.

In [8]:
import numpy as np

df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')
In [9]:
df.Education.value_counts(dropna = False)
Out[9]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
I prefer not to answer                                                 766
Professional degree                                                    599
No formal education past high school                                   232
Name: Education, dtype: int64
In [10]:
df.Education.isnull().sum()
Out[10]:
0
In [11]:
df.Major_undergraduate.value_counts(dropna = False)
Out[11]:
Computer science (software engineering, etc.)                    9430
Engineering (non-computer focused)                               3705
Mathematics or statistics                                        2950
A business discipline (accounting, economics, finance, etc.)     1791
Physics or astronomy                                             1110
Information technology, networking, or system administration     1029
NaN                                                               912
Medical or life sciences (biology, chemistry, medicine, etc.)     871
Other                                                             770
Social sciences (anthropology, psychology, sociology, etc.)       554
Humanities (history, literature, philosophy, etc.)                269
Environmental science or geology                                  253
I never declared a major                                          128
Fine arts or performing arts                                       87
Name: Major_undergraduate, dtype: int64

Rozumiem, że NaN i ‘Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’

In [12]:
df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')
In [13]:
import matplotlib as plt
df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x260cea62cc0>
In [14]:
df.Recent_role.value_counts(dropna=False)
Out[14]:
Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64
In [15]:
df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')
In [16]:
Z1 = df.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count').sort_values('Male',ascending=False)
Z1
Out[16]:
Gender Female Male Prefer not to say
Major_undergraduate
Computer science (software engineering, etc.) 1463 7837 130
Engineering (non-computer focused) 432 3223 50
Mathematics or statistics 660 2241 49
I never declared a major 297 1438 75
A business discipline (accounting, economics, finance, etc.) 334 1435 22
Physics or astronomy 119 968 23
Information technology, networking, or system administration 186 832 11
Medical or life sciences (biology, chemistry, medicine, etc.) 203 646 22
Social sciences (anthropology, psychology, sociology, etc.) 160 379 15
Environmental science or geology 57 190 6
Humanities (history, literature, philosophy, etc.) 74 185 10
Fine arts or performing arts 25 56 6
In [17]:
Z1.plot(kind='barh', legend=True, title='Data Scientists by Major undergraduate and Gender (Kaggle 2018)', figsize=(7, 4), color=('b','g','y'))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x260cef146d8>
In [18]:
Z2 = df.pivot_table(index=['Country'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False).nlargest(20,'Male')
Z2
Out[18]:
Gender Female Male Prefer not to say SUM
Country
SUM 4010.0 19430.0 419.0 23859
India 657.0 3719.0 41.0 4417
United States of America 1082.0 3530.0 104.0 4716
China 267.0 1337.0 40.0 1644
Other 165.0 849.0 22.0 1036
Russia 113.0 750.0 16.0 879
Brazil 65.0 666.0 5.0 736
Germany 103.0 621.0 10.0 734
Japan 34.0 557.0 6.0 597
United Kingdom of Great Britain and Northern Ireland 131.0 554.0 17.0 702
France 104.0 494.0 6.0 604
Canada 123.0 475.0 6.0 604
Spain 75.0 406.0 4.0 485
Italy 47.0 303.0 5.0 355
Australia 51.0 272.0 7.0 330
Turkey 56.0 267.0 4.0 327
I do not wish to disclose my location 83.0 250.0 61.0 394
Poland 54.0 243.0 4.0 301
Netherlands 41.0 225.0 4.0 270
Ukraine 31.0 218.0 3.0 252

Poland in data

Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.

In [19]:
PL= df[df.Country=='Poland']
In [20]:
Z3 = PL.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Out[20]:
Gender Female Male Prefer not to say SUM
Major_undergraduate
SUM 54.0 243.0 4.0 301
Computer science (software engineering, etc.) 15.0 96.0 1.0 112
Mathematics or statistics 12.0 39.0 1.0 52
A business discipline (accounting, economics, finance, etc.) 9.0 24.0 1.0 34
Physics or astronomy 4.0 20.0 NaN 24
Engineering (non-computer focused) 3.0 18.0 1.0 22
I never declared a major 2.0 15.0 NaN 17
Information technology, networking, or system administration 3.0 11.0 NaN 14
Medical or life sciences (biology, chemistry, medicine, etc.) NaN 7.0 NaN 7
Social sciences (anthropology, psychology, sociology, etc.) 5.0 7.0 NaN 12
Humanities (history, literature, philosophy, etc.) NaN 4.0 NaN 4
Environmental science or geology 1.0 2.0 NaN 3
In [21]:
Z3 = PL.pivot_table(index=['Recent_role'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Out[21]:
Gender Female Male Prefer not to say SUM
Recent_role
SUM 54.0 243.0 4.0 301
Data Scientist 15.0 58.0 NaN 73
Software Engineer 4.0 46.0 NaN 50
Student 5.0 28.0 NaN 33
Other 5.0 21.0 NaN 26
Data Analyst 9.0 19.0 1.0 29
Research Scientist 2.0 15.0 NaN 17
Consultant 1.0 10.0 NaN 11
Business Analyst 2.0 9.0 1.0 12
Manager 1.0 6.0 NaN 7
Research Assistant 2.0 6.0 NaN 8
Data Engineer 2.0 5.0 1.0 8
Not employed 3.0 5.0 1.0 9
Chief Officer 1.0 4.0 NaN 5
Product/Project Manager 1.0 3.0 NaN 4
DBA/Database Engineer NaN 3.0 NaN 3
Statistician NaN 2.0 NaN 2
Data Journalist NaN 1.0 NaN 1
Principal Investigator NaN 1.0 NaN 1
Salesperson 1.0 1.0 NaN 2

Let’s do standard, quick Pie Plot

We can see banal, predictable visualization.

In [22]:
Z4 = PL.pivot_table(index=['Recent_role'], values='Age',aggfunc='count').sort_values('Age', ascending=False)

Z4.plot(kind='pie', subplots=True, legend=False, title="Data Scientists by Recent_role (Kaggle 2018)",figsize=(15,7), autopct='%1.1f%%',fontsize=14)
Out[22]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000260CF010F60>],
      dtype=object)
In [23]:
Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)
Out[23]:
Age
Major_undergraduate
Computer science (software engineering, etc.) 112
Mathematics or statistics 52
A business discipline (accounting, economics, finance, etc.) 34
Physics or astronomy 24
Engineering (non-computer focused) 22
I never declared a major 17
Information technology, networking, or system administration 14
Social sciences (anthropology, psychology, sociology, etc.) 12
Medical or life sciences (biology, chemistry, medicine, etc.) 7
Humanities (history, literature, philosophy, etc.) 4

Better Pie Plot with interesting colors

At the beginning we can change colors and give better descriptions.

GSuite Text and Background Palette: https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/

In [24]:
import matplotlib.pyplot as plt
## Wielkość wykresu
plt.figure(figsize=(16,8))


## informacja że jest to wykres złożony
ax1 = plt.subplot(aspect='equal')



## ustalenie koloru
colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']

## równanie podstawowe
Z5.plot(kind='pie',colors =colors , y = 'Age', ax=ax1, autopct='%1.1f%%', startangle=0, shadow=False, labels=df['Major_undergraduate'], legend = False, fontsize=14)

# opisy, nazwy itp
ax1.set_xlabel('Something to write',  fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Something to write', fontsize=11,  color='grey', alpha=0.8)
ax1.set_title('Major_undergraduate in Data Scientists (Kaggle 2018)',  fontsize=18, color='grey', alpha=0.8)
ax1.set_facecolor('#d8dcd6')

The best Pie Plot

I came across this publication and decided to do Pie Plot by this way.
https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

In [25]:
PPL=Z5.reset_index()
PPL.head(5)
Out[25]:
Major_undergraduate Age
0 Computer science (software engineering, etc.) 112
1 Mathematics or statistics 52
2 A business discipline (accounting, economics, … 34
3 Physics or astronomy 24
4 Engineering (non-computer focused) 22

To pull vectors of data from the pivot table.

In [26]:
PPL.reset_index()
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()

fig1, ax1 = plt.subplots(figsize=(10,5))


ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

ax1.axis('equal')  
plt.tight_layout()
plt.show()

Colors changing

In [27]:
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(10,5))

colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']

ax1.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%', shadow=True, startangle=0)
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()

Changing size and color of the all fonts

textprops={‘fontsize’: 30, ‘color’:”green”}

In [28]:
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(18,12))

colors = ['#e06666','#ea9999','#f4cccc','#ff0000','#434343']

ax1.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%', shadow=True, startangle=0, textprops={'fontsize': 40, 'color':'#434343'})
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

Changing size and color of the separate fonts

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
In [29]:
fig1, ax1 = plt.subplots(figsize=(15,12))

colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%',textprops={'fontsize': 40}, shadow=True, startangle=0)

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

Changing size and color for the chosen categories

In [30]:
fig1, ax1 = plt.subplots(figsize=(6,6))

colors = ['#ff9999','#747574','#99ff99','#ffcc99','#f1c232']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%', shadow=True, startangle=0)


for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('grey')

    
texts[0].set_fontsize(24)
texts[0].set_color('black')
texts[4].set_fontsize(33)
texts[4].set_color('green')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.