An old Chinese proverb says: one picture says more than one thousands words. One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.

import pandas as pd

df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)

headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD'] 
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)

df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)

Correcting data

Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).

df.isnull().sum()

Duration (in seconds)       0
Gender                      0
Age                         0
Country                     0
Education                 421
Major_undergraduate       912
Recent_role               959
Industry                 2174
Years_of_experience      2758
compensation$USD         3674
dtype: int64

df.dtypes

Duration (in seconds)     int64
Gender                   object
Age                      object
Country                  object
Education                object
Major_undergraduate      object
Recent_role              object
Industry                 object
Years_of_experience      object
compensation$USD         object
dtype: object

Very important is reduction of the class or join some similar groups if it is not bad for the project.

df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')

df.Education.value_counts(dropna = False)

Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
Professional degree                                                    599
NaN                                                                    421
I prefer not to answer                                                 345
No formal education past high school                                   232
Name: Education, dtype: int64

We can get assumption if somebody didn’t answer he didn’t want to give information: 'I prefer not to answer’.

import numpy as np

df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')

df.Education.value_counts(dropna = False)

Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
I prefer not to answer                                                 766
Professional degree                                                    599
No formal education past high school                                   232
Name: Education, dtype: int64

df.Education.isnull().sum()

0

df.Major_undergraduate.value_counts(dropna = False)

Computer science (software engineering, etc.)                    9430
Engineering (non-computer focused)                               3705
Mathematics or statistics                                        2950
A business discipline (accounting, economics, finance, etc.)     1791
Physics or astronomy                                             1110
Information technology, networking, or system administration     1029
NaN                                                               912
Medical or life sciences (biology, chemistry, medicine, etc.)     871
Other                                                             770
Social sciences (anthropology, psychology, sociology, etc.)       554
Humanities (history, literature, philosophy, etc.)                269
Environmental science or geology                                  253
I never declared a major                                          128
Fine arts or performing arts                                       87
Name: Major_undergraduate, dtype: int64

Rozumiem, że NaN i 'Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’

df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')

import matplotlib as plt
df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')

df.Recent_role.value_counts(dropna=False)

Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64

df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')

Z1 = df.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count').sort_values('Male',ascending=False)
Z1

Z1.plot(kind='barh', legend=True, title='Data Scientists by Major undergraduate and Gender (Kaggle 2018)', figsize=(7, 4), color=('b','g','y'))

array([],
      dtype=object)

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Z2 = df.pivot_table(index=['Country'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False).nlargest(20,'Male')
Z2

Poland in data¶

Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.

PL= df[df.Country=='Poland']

Z3 = PL.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3

Z3 = PL.pivot_table(index=['Recent_role'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3

Let’s do standard, quick Pie Plot

We can see banal, predictable visualization.

Z4 = PL.pivot_table(index=['Recent_role'], values='Age',aggfunc='count').sort_values('Age', ascending=False)

Z4.plot(kind='pie', subplots=True, legend=False, title="Data Scientists by Recent_role (Kaggle 2018)",figsize=(15,7), autopct='

array([],
      dtype=object)

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)

Better Pie Plot with interesting colors

At the beginning we can change colors and give better descriptions.

GSuite Text and Background Palette: https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/

import matplotlib.pyplot as plt
## Wielkość wykresu
plt.figure(figsize=(16,8))


## informacja że jest to wykres złożony
ax1 = plt.subplot(aspect='equal')



## ustalenie koloru
colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']

## równanie podstawowe
Z5.plot(kind='pie',colors =colors , y = 'Age', ax=ax1, autopct='

# opisy, nazwy itp
ax1.set_xlabel('Something to write',  fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Something to write', fontsize=11,  color='grey', alpha=0.8)
ax1.set_title('Major_undergraduate in Data Scientists (Kaggle 2018)',  fontsize=18, color='grey', alpha=0.8)
ax1.set_facecolor('#d8dcd6')

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

The best Pie Plot¶

I came across this publication and decided to do Pie Plot by this way.
https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

PPL=Z5.reset_index()
PPL.head(5)

To pull vectors of data from the pivot table.

PPL.reset_index()
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()

fig1, ax1 = plt.subplots(figsize=(10,5))


ax1.pie(sizes, labels=labels, autopct='

ax1.axis('equal')  
plt.tight_layout()
plt.show()

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Colors changing

# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(10,5))

colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']

ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Changing size and color of the all fonts

textprops={’fontsize’: 30, 'color’:”green”}

# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(18,12))

colors = ['#e06666','#ea9999','#f4cccc','#ff0000','#434343']

ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Changing size and color of the separate fonts¶

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')

fig1, ax1 = plt.subplots(figsize=(15,12))

colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Changing size and color for the chosen categories

https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

fig1, ax1 = plt.subplots(figsize=(6,6))

colors = ['#ff9999','#747574','#99ff99','#ffcc99','#f1c232']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('grey')

    
texts[0].set_fontsize(24)
texts[0].set_color('black')
texts[4].set_fontsize(33)
texts[4].set_color('green')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Making a bagel ¶

fig1, ax1 = plt.subplots(figsize=(18,6))

colors = ['#a2c4c9','#b6d7a8','#747574','#99ff99','#ffcc99','#76a5af']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
    
ax1.axis('equal')  
plt.tight_layout()

plt.show()

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Making the better bangle

fig1, ax1 = plt.subplots(figsize=(18,8))

colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('black')
    autotext.set_fontsize(22)

texts[0].set_fontsize(18)
texts[0].set_color('black')
    
#draw circle
centre_circle = plt.Circle((0,0),0.40,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)    
    
    
ax1.set_xlabel('Year 2018',  fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Data Scientist', fontsize=11,  color='grey', alpha=0.8)
ax1.set_title('Data Scientist by profession',  fontsize=58, color='#d0e0e3', alpha=0.8)
ax1.set_facecolor('#d8dcd6')


ax1.axis('equal')  
plt.tight_layout()

plt.show()

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

We enter the gender variable

PL.columns

Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')

Z6 = PL.pivot_table(index=['Major_undergraduate','Gender'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z6.head(10)

https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

PLG=Z6.reset_index()
PLG.head(2)

PLG.reset_index()
labels_gender = PLG['Gender'].to_list()
sizes_gender = PLG['Age'].to_list()

The double bangle

import matplotlib.pyplot as plt


colors_gender = ['#c2c2f0','#ffb3e6']
 

fig1, ax1 = plt.subplots(figsize=(18,6))

colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='

plt.pie(sizes_gender,colors=colors_gender,radius=0.75,startangle=0)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)




for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('black')

    
#draw circle
centre_circle = plt.Circle((0,0),0.50,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)    
    
    
ax1.axis('equal')  
plt.tight_layout()

plt.show()

The double bangle is: „one bridge too far”. This plot is beautiful but bangles are not correlated each other. To achieve adequate connection vectors should be come from one pivot tables. At the moment I have no idea how to do it (this groupby, query, pivot ….).

Trigger to create Pie Plot

Components to create perfect pie plot: labels, sizes, colors

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

PPL=Z5.reset_index()
PPL.head(5)
PPL.reset_index()

labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()

colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']

def PPieP(sizes,labels,colors):
    fig1, ax1 = plt.subplots(figsize=(18,8))

    patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


    for text in texts:
        text.set_color('grey')
    for autotext in autotexts:
        autotext.set_color('black')
        autotext.set_fontsize(22)

    texts[0].set_fontsize(18)
    texts[0].set_color('black')
    
    #draw circle
    centre_circle = plt.Circle((0,0),0.40,fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)    
    
    
    ax1.set_xlabel('Year 2018',  fontsize=15, color='darkred', alpha=1)
    ax1.set_ylabel('Data Scientist', fontsize=11,  color='grey', alpha=0.8)
    ax1.set_title('Data Scientist by profession',  fontsize=58, color='#d0e0e3', alpha=0.8)
    ax1.set_facecolor('#d8dcd6')


    ax1.axis('equal')  
    plt.tight_layout()

    plt.show()

# Variables to the trigger:

labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()
#colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
#colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
#colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']
#colors = ['#cfe2f3','#9fc5e8','#6fa8dc']
colors = ['#d9ead3','#b6d7a8','#93c47d','#6aa84f']



# Trigger:

PPieP(sizes,labels,colors)

Good advice in making presentation is to prepare plots using one standard.

As says man who built my house: messy but equally!

Gender	Female	Male	Prefer not to say	SUM
Country
SUM	4010.0	19430.0	419.0	23859
India	657.0	3719.0	41.0	4417
United States of America	1082.0	3530.0	104.0	4716
China	267.0	1337.0	40.0	1644
Other	165.0	849.0	22.0	1036
Russia	113.0	750.0	16.0	879
Brazil	65.0	666.0	5.0	736
Germany	103.0	621.0	10.0	734
Japan	34.0	557.0	6.0	597
United Kingdom of Great Britain and Northern Ireland	131.0	554.0	17.0	702
France	104.0	494.0	6.0	604
Canada	123.0	475.0	6.0	604
Spain	75.0	406.0	4.0	485
Italy	47.0	303.0	5.0	355
Australia	51.0	272.0	7.0	330
Turkey	56.0	267.0	4.0	327
I do not wish to disclose my location	83.0	250.0	61.0	394
Poland	54.0	243.0	4.0	301
Netherlands	41.0	225.0	4.0	270
Ukraine	31.0	218.0	3.0	252

Python - THE DATA SCIENCE LIBRARY

Perfect Plots: Pie Plot

Correcting data

Poland in data¶

Let’s do standard, quick Pie Plot

Better Pie Plot with interesting colors

The best Pie Plot¶

To pull vectors of data from the pivot table.

Colors changing

Changing size and color of the all fonts

Changing size and color of the separate fonts¶

Changing size and color for the chosen categories

Making a bagel ¶

Making the better bangle

We enter the gender variable

The double bangle

Trigger to create Pie Plot

Components to create perfect pie plot: labels, sizes, colors

As says man who built my house: messy but equally!

Estimation of the result of the empirical research with machine learning tools (part 1)

Part one: preliminary graphical analysis to research of coefficients dependence

Machine learning tools

Case study: laboratory classification of active chemical substance Poliaxid

Graphical analysis to research of coefficients dependence

Next part:

Why go from Excel to Python?

Why go from Excel to Python?

The astronomy scale problems

Two easy Big Data tools to keep control on the astronomical scale business

What use Big Data tools to keep control on the astronomical scale business?

Big Data tools to keep control on the astronomical scale business

Algorithm of comparison real events to standard values

	Duration (in seconds)	Gender	Gender2	Age	Country	Education	Major_undergraduate	Recent_role	Recent_role2	Industry	Industry2	Years_of_experience	compensation$USD
0	710	Female	-1	45-49	United States of America	Doctoral degree	Other	Consultant	-1	Other	0	NaN	NaN
1	434	Male	-1	30-34	Indonesia	Bachelor’s degree	Engineering (non-computer focused)	Other	0	Manufacturing/Fabrication	-1	5-10	10-20,000
2	718	Female	-1	30-34	United States of America	Master’s degree	Computer science (software engineering, etc.)	Data Scientist	-1	I am a student	-1	0-1	0-10,000
3	621	Male	-1	35-39	United States of America	Master’s degree	Social sciences (anthropology, psychology, soc…	Not employed	-1	NaN	-1	NaN	NaN

Gender	Female	Male	Prefer not to say
Major_undergraduate
Computer science (software engineering, etc.)	1463	7837	130
Engineering (non-computer focused)	432	3223	50
Mathematics or statistics	660	2241	49
I never declared a major	297	1438	75
A business discipline (accounting, economics, finance, etc.)	334	1435	22
Physics or astronomy	119	968	23
Information technology, networking, or system administration	186	832	11
Medical or life sciences (biology, chemistry, medicine, etc.)	203	646	22
Social sciences (anthropology, psychology, sociology, etc.)	160	379	15
Environmental science or geology	57	190	6
Humanities (history, literature, philosophy, etc.)	74	185	10
Fine arts or performing arts	25	56	6

Gender	Female	Male	Prefer not to say	SUM
Recent_role
SUM	54.0	243.0	4.0	301
Data Scientist	15.0	58.0	NaN	73
Software Engineer	4.0	46.0	NaN	50
Student	5.0	28.0	NaN	33
Other	5.0	21.0	NaN	26
Data Analyst	9.0	19.0	1.0	29
Research Scientist	2.0	15.0	NaN	17
Consultant	1.0	10.0	NaN	11
Business Analyst	2.0	9.0	1.0	12
Manager	1.0	6.0	NaN	7
Research Assistant	2.0	6.0	NaN	8
Data Engineer	2.0	5.0	1.0	8
Not employed	3.0	5.0	1.0	9
Chief Officer	1.0	4.0	NaN	5
Product/Project Manager	1.0	3.0	NaN	4
DBA/Database Engineer	NaN	3.0	NaN	3
Statistician	NaN	2.0	NaN	2
Data Journalist	NaN	1.0	NaN	1
Principal Investigator	NaN	1.0	NaN	1
Salesperson	1.0	1.0	NaN	2