Perfect Plot: Treemap

An old Chinese proverb says: one picture says more than one thousands words.
One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.
In [1]:
import squarify 
import pandas as pd
import matplotlib.pyplot as plt

df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)
In [2]:
headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD'] 
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)
Out[2]:
Duration (in seconds) Gender Gender2 Age Country Education Major_undergraduate Recent_role Recent_role2 Industry Industry2 Years_of_experience compensation$USD
0 710 Female -1 45-49 United States of America Doctoral degree Other Consultant -1 Other 0 NaN NaN
1 434 Male -1 30-34 Indonesia Bachelor’s degree Engineering (non-computer focused) Other 0 Manufacturing/Fabrication -1 5-10 10-20,000
2 718 Female -1 30-34 United States of America Master’s degree Computer science (software engineering, etc.) Data Scientist -1 I am a student -1 0-1 0-10,000
3 621 Male -1 35-39 United States of America Master’s degree Social sciences (anthropology, psychology, soc… Not employed -1 NaN -1 NaN NaN
In [3]:
df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)

Correcting data

Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).

In [4]:
df.isnull().sum()
Out[4]:
Duration (in seconds)       0
Gender                      0
Age                         0
Country                     0
Education                 421
Major_undergraduate       912
Recent_role               959
Industry                 2174
Years_of_experience      2758
compensation$USD         3674
dtype: int64
In [5]:
df.dtypes
Out[5]:
Duration (in seconds)     int64
Gender                   object
Age                      object
Country                  object
Education                object
Major_undergraduate      object
Recent_role              object
Industry                 object
Years_of_experience      object
compensation$USD         object
dtype: object

Very important is reduction of the class or join some similar groups if it is not bad for the project.

In [6]:
df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')
In [7]:
df.Education.value_counts(dropna = False)
Out[7]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
Professional degree                                                    599
NaN                                                                    421
I prefer not to answer                                                 345
No formal education past high school                                   232
Name: Education, dtype: int64

We can get assumption if somebody didn’t answer he didn’t want to give information: ‘I prefer not to answer’.

In [8]:
import numpy as np

df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')
In [9]:
df.Education.value_counts(dropna = False)
Out[9]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
I prefer not to answer                                                 766
Professional degree                                                    599
No formal education past high school                                   232
Name: Education, dtype: int64
In [10]:
df.Education.isnull().sum()
Out[10]:
0
In [11]:
df.Major_undergraduate.value_counts(dropna = False)
Out[11]:
Computer science (software engineering, etc.)                    9430
Engineering (non-computer focused)                               3705
Mathematics or statistics                                        2950
A business discipline (accounting, economics, finance, etc.)     1791
Physics or astronomy                                             1110
Information technology, networking, or system administration     1029
NaN                                                               912
Medical or life sciences (biology, chemistry, medicine, etc.)     871
Other                                                             770
Social sciences (anthropology, psychology, sociology, etc.)       554
Humanities (history, literature, philosophy, etc.)                269
Environmental science or geology                                  253
I never declared a major                                          128
Fine arts or performing arts                                       87
Name: Major_undergraduate, dtype: int64

Rozumiem, że NaN i ‘Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’

In [12]:
df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')
In [13]:
df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x287b9519550>
In [14]:
df.Recent_role.value_counts(dropna=False)
Out[14]:
Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64
In [15]:
df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')

Poland in data

Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.

In [16]:
PL= df[df.Country=='Poland']
In [17]:
Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)
Out[17]:
Age
Major_undergraduate
Computer science (software engineering, etc.) 112
Mathematics or statistics 52
A business discipline (accounting, economics, finance, etc.) 34
Physics or astronomy 24
Engineering (non-computer focused) 22
I never declared a major 17
Information technology, networking, or system administration 14
Social sciences (anthropology, psychology, sociology, etc.) 12
Medical or life sciences (biology, chemistry, medicine, etc.) 7
Humanities (history, literature, philosophy, etc.) 4

The Treemap

I came across this publication and decided to do Treemap by this way.
https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

In [18]:
PPL=Z5.reset_index()
PPL.head(5)
Out[18]:
Major_undergraduate Age
0 Computer science (software engineering, etc.) 112
1 Mathematics or statistics 52
2 A business discipline (accounting, economics, … 34
3 Physics or astronomy 24
4 Engineering (non-computer focused) 22

Cut out too long descriptions

In [19]:
PPL['Major_undergraduate']= PPL['Major_undergraduate'].str.split('(').apply(lambda x: x[0])
PPL['Major_undergraduate']
Out[19]:
0                                     Computer science 
1                             Mathematics or statistics
2                                A business discipline 
3                                  Physics or astronomy
4                                          Engineering 
5                              I never declared a major
6     Information technology, networking, or system ...
7                                      Social sciences 
8                             Medical or life sciences 
9                                           Humanities 
10                     Environmental science or geology
Name: Major_undergraduate, dtype: object

Adds numbers of occurrences to the descriptions

In [20]:
label = PPL['Major_undergraduate'].to_list()
label = PPL.apply(lambda x: str(x[0]) + "n (" + str(x[1]) + ")", axis=1)
label
Out[20]:
0                             Computer science n (112)
1                      Mathematics or statisticsn (52)
2                         A business discipline n (34)
3                           Physics or astronomyn (24)
4                                   Engineering n (22)
5                       I never declared a majorn (17)
6     Information technology, networking, or system ...
7                               Social sciences n (12)
8                       Medical or life sciences n (7)
9                                     Humanities n (4)
10               Environmental science or geologyn (3)
dtype: object

To pull vectors of data from the pivot table

In [21]:
PPL.reset_index()

label
sizes = PPL['Age'].to_list()

colors = ['#ff0000','#434343','#666666','#999999','#b7b7b7','#cccccc','#d9d9d9','#efefef','#ffffff','#f3f3f3']
In [22]:
import squarify
import matplotlib.pyplot as plt
In [23]:
# Plot
plt.figure(figsize=(12,8), dpi= 380)
squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.9)

plt.title('Data Scientist society in Poland (2018)',  fontdict={'fontsize': 30, 'fontweight': 'medium', 'color':'#d0e0e3','alpha':0.8, 'y':1.02})
plt.axis('off') # brak numerów na osiach
plt.show()

Trigger to create Treemap

Components to create perfect pie plot: labels, sizes, colors, title

To prepare perfect treemap first I will need to pull vectors of data from the pivot table.

To pull vectors of data from the pivot table

In [24]:
PPL.reset_index()

label = label = PPL['Major_undergraduate'].to_list()
label = PPL.apply(lambda x: str(x[0]) + "n (" + str(x[1]) + ")", axis=1)
sizes = PPL['Age'].to_list()
title = 'Data Scientist society in Poland (2018)'

# https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/
#colors = ['#274e13','#6aa84f','#93c47d', '#b6d7a8','#d9ead3','#b7b7b7','#38761d'] #green
#colors = ['#0c343d','#134f5c','#45818e','#76a5af','#a2c4c9','#d0e0e3'] #cyan
#colors = ['#7f6000','#bf9000','#f1c232','#ffd966','#ffe599','#fff2cc'] #yelow
#colors = ['#4c1130','#a64d79','#c27ba0','#d5a6bd','#ead1dc','#741b47',] #magenta
#colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6'] #purple
#colors = ['#cfe2f3','#9fc5e8','#6fa8dc'] #blue
#colors = ['#d9ead3','#b6d7a8','#93c47d','#6aa84f']

#colors = ['#ff0000','#434343','#666666','#999999','#b7b7b7','#cccccc','#d9d9d9','#efefef','#ffffff','#f3f3f3'] #=> niemieckie czasopismo
In [25]:
import squarify
import matplotlib.pyplot as plt

def Tmap(sizes, labels, colors, title):
    plt.figure(figsize=(12,8), dpi= 380)
    squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.9)

    plt.title(title,  fontdict={'fontsize': 30, 'fontweight': 'medium', 'color':'#d0e0e3','alpha':0.9, 'y':1.02})
    plt.axis('off') # brak numerów na osiach
    plt.show()
In [26]:
Tmap(sizes, label, colors, title)