An old Chinese proverb says: one picture says more than one thousands words.

One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.

import squarify 
import pandas as pd
import matplotlib.pyplot as plt

df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)

headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD'] 
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)

df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)

Correcting data

Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).

df.isnull().sum()

Duration (in seconds)       0
Gender                      0
Age                         0
Country                     0
Education                 421
Major_undergraduate       912
Recent_role               959
Industry                 2174
Years_of_experience      2758
compensation$USD         3674
dtype: int64

df.dtypes

Duration (in seconds)     int64
Gender                   object
Age                      object
Country                  object
Education                object
Major_undergraduate      object
Recent_role              object
Industry                 object
Years_of_experience      object
compensation$USD         object
dtype: object

Very important is reduction of the class or join some similar groups if it is not bad for the project.

df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')

df.Education.value_counts(dropna = False)

Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
Professional degree                                                    599
NaN                                                                    421
I prefer not to answer                                                 345
No formal education past high school                                   232
Name: Education, dtype: int64

We can get assumption if somebody didn’t answer he didn’t want to give information: 'I prefer not to answer’.

import numpy as np

df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')

df.Education.value_counts(dropna = False)

Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
I prefer not to answer                                                 766
Professional degree                                                    599
No formal education past high school                                   232
Name: Education, dtype: int64

df.Education.isnull().sum()

0

df.Major_undergraduate.value_counts(dropna = False)

Computer science (software engineering, etc.)                    9430
Engineering (non-computer focused)                               3705
Mathematics or statistics                                        2950
A business discipline (accounting, economics, finance, etc.)     1791
Physics or astronomy                                             1110
Information technology, networking, or system administration     1029
NaN                                                               912
Medical or life sciences (biology, chemistry, medicine, etc.)     871
Other                                                             770
Social sciences (anthropology, psychology, sociology, etc.)       554
Humanities (history, literature, philosophy, etc.)                269
Environmental science or geology                                  253
I never declared a major                                          128
Fine arts or performing arts                                       87
Name: Major_undergraduate, dtype: int64

Rozumiem, że NaN i 'Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’

df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')

df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x287b9519550>

Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64

0                                     Computer science 
1                             Mathematics or statistics
2                                A business discipline 
3                                  Physics or astronomy
4                                          Engineering 
5                              I never declared a major
6     Information technology, networking, or system ...
7                                      Social sciences 
8                             Medical or life sciences 
9                                           Humanities 
10                     Environmental science or geology
Name: Major_undergraduate, dtype: object

0                             Computer science n (112)
1                      Mathematics or statisticsn (52)
2                         A business discipline n (34)
3                           Physics or astronomyn (24)
4                                   Engineering n (22)
5                       I never declared a majorn (17)
6     Information technology, networking, or system ...
7                               Social sciences n (12)
8                       Medical or life sciences n (7)
9                                     Humanities n (4)
10               Environmental science or geologyn (3)
dtype: object

df.Recent_role.value_counts(dropna=False)

Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64

df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')

Poland in data

Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.

PL= df[df.Country=='Poland']

Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)

The Treemap

I came across this publication and decided to do Treemap by this way.
https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

PPL=Z5.reset_index()
PPL.head(5)

Cut out too long descriptions

PPL['Major_undergraduate']= PPL['Major_undergraduate'].str.split('(').apply(lambda x: x[0])
PPL['Major_undergraduate']

0                                     Computer science 
1                             Mathematics or statistics
2                                A business discipline 
3                                  Physics or astronomy
4                                          Engineering 
5                              I never declared a major
6     Information technology, networking, or system ...
7                                      Social sciences 
8                             Medical or life sciences 
9                                           Humanities 
10                     Environmental science or geology
Name: Major_undergraduate, dtype: object

Adds numbers of occurrences to the descriptions

label = PPL['Major_undergraduate'].to_list()
label = PPL.apply(lambda x: str(x[0]) + "n (" + str(x[1]) + ")", axis=1)
label

0                             Computer science n (112)
1                      Mathematics or statisticsn (52)
2                         A business discipline n (34)
3                           Physics or astronomyn (24)
4                                   Engineering n (22)
5                       I never declared a majorn (17)
6     Information technology, networking, or system ...
7                               Social sciences n (12)
8                       Medical or life sciences n (7)
9                                     Humanities n (4)
10               Environmental science or geologyn (3)
dtype: object

To pull vectors of data from the pivot table

PPL.reset_index()

label
sizes = PPL['Age'].to_list()

colors = ['#ff0000','#434343','#666666','#999999','#b7b7b7','#cccccc','#d9d9d9','#efefef','#ffffff','#f3f3f3']

import squarify
import matplotlib.pyplot as plt

# Plot
plt.figure(figsize=(12,8), dpi= 380)
squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.9)

plt.title('Data Scientist society in Poland (2018)',  fontdict={'fontsize': 30, 'fontweight': 'medium', 'color':'#d0e0e3','alpha':0.8, 'y':1.02})
plt.axis('off') # brak numerów na osiach
plt.show()

Trigger to create Treemap

Components to create perfect pie plot: labels, sizes, colors, title

To prepare perfect treemap first I will need to pull vectors of data from the pivot table.

To pull vectors of data from the pivot table

PPL.reset_index()

label = label = PPL['Major_undergraduate'].to_list()
label = PPL.apply(lambda x: str(x[0]) + "n (" + str(x[1]) + ")", axis=1)
sizes = PPL['Age'].to_list()
title = 'Data Scientist society in Poland (2018)'

# https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/
#colors = ['#274e13','#6aa84f','#93c47d', '#b6d7a8','#d9ead3','#b7b7b7','#38761d'] #green
#colors = ['#0c343d','#134f5c','#45818e','#76a5af','#a2c4c9','#d0e0e3'] #cyan
#colors = ['#7f6000','#bf9000','#f1c232','#ffd966','#ffe599','#fff2cc'] #yelow
#colors = ['#4c1130','#a64d79','#c27ba0','#d5a6bd','#ead1dc','#741b47',] #magenta
#colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6'] #purple
#colors = ['#cfe2f3','#9fc5e8','#6fa8dc'] #blue
#colors = ['#d9ead3','#b6d7a8','#93c47d','#6aa84f']

#colors = ['#ff0000','#434343','#666666','#999999','#b7b7b7','#cccccc','#d9d9d9','#efefef','#ffffff','#f3f3f3'] #=> niemieckie czasopismo

import squarify
import matplotlib.pyplot as plt

def Tmap(sizes, labels, colors, title):
    plt.figure(figsize=(12,8), dpi= 380)
    squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.9)

    plt.title(title,  fontdict={'fontsize': 30, 'fontweight': 'medium', 'color':'#d0e0e3','alpha':0.9, 'y':1.02})
    plt.axis('off') # brak numerów na osiach
    plt.show()

Tmap(sizes, label, colors, title)

	Duration (in seconds)	Gender	Gender2	Age	Country	Education	Major_undergraduate	Recent_role	Recent_role2	Industry	Industry2	Years_of_experience	compensation$USD
0	710	Female	-1	45-49	United States of America	Doctoral degree	Other	Consultant	-1	Other	0	NaN	NaN
1	434	Male	-1	30-34	Indonesia	Bachelor’s degree	Engineering (non-computer focused)	Other	0	Manufacturing/Fabrication	-1	5-10	10-20,000
2	718	Female	-1	30-34	United States of America	Master’s degree	Computer science (software engineering, etc.)	Data Scientist	-1	I am a student	-1	0-1	0-10,000
3	621	Male	-1	35-39	United States of America	Master’s degree	Social sciences (anthropology, psychology, soc…	Not employed	-1	NaN	-1	NaN	NaN

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Perfect Plot: Treemap

Correcting data

Poland in data

The Treemap

Cut out too long descriptions

Adds numbers of occurrences to the descriptions

To pull vectors of data from the pivot table

Trigger to create Treemap

Components to create perfect pie plot: labels, sizes, colors, title

To pull vectors of data from the pivot table

	Age
Major_undergraduate
Computer science (software engineering, etc.)	112
Mathematics or statistics	52
A business discipline (accounting, economics, finance, etc.)	34
Physics or astronomy	24
Engineering (non-computer focused)	22
I never declared a major	17
Information technology, networking, or system administration	14
Social sciences (anthropology, psychology, sociology, etc.)	12
Medical or life sciences (biology, chemistry, medicine, etc.)	7
Humanities (history, literature, philosophy, etc.)	4