Python - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/python/ Wojciech Moszczyński Mon, 13 Dec 2021 17:44:02 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Python - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/python/ 32 32 Perfect Plots: Pie Plot https://sigmaquality.pl/data-plots/perfect-plots_-pie-plot/ Thu, 17 Oct 2019 19:22:00 +0000 http://sigmaquality.pl/perfect-plots_-pie-plot/ Feel free to read the code on GitHub   An old Chinese proverb says: one picture says more than one thousands words. One good plot [...]

Artykuł Perfect Plots: Pie Plot pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feel free to read the code on GitHub

 

An old Chinese proverb says: one picture says more than one thousands words. One good plot can rescue entire presentation. One poor picture can drown down all good speech. After plenty of shame appointment and boring presentations I decided to improve my tools of visualisation.

In [1]:
import pandas as pd

df1 = pd.read_csv('c:/11/freeFormResponses.csv', skiprows = 1)
In [2]:
headers = ['Duration (in seconds)', 'Gender', 'Gender2','Age','Country','Education', 'Major_undergraduate','Recent_role', 'Recent_role2', 'Industry','Industry2' ,'Years_of_experience', 'compensation$USD'] 
df = pd.read_csv('c:/11/multipleChoiceResponses.csv', usecols=[0,1,2,3,4,5,6,7,8,9,10,11,12], header=None, names=headers, skiprows=2)
df.head(4)
Out[2]:
  Duration (in seconds) Gender Gender2 Age Country Education Major_undergraduate Recent_role Recent_role2 Industry Industry2 Years_of_experience compensation$USD
0 710 Female -1 45-49 United States of America Doctoral degree Other Consultant -1 Other 0 NaN NaN
1 434 Male -1 30-34 Indonesia Bachelor’s degree Engineering (non-computer focused) Other 0 Manufacturing/Fabrication -1 5-10 10-20,000
2 718 Female -1 30-34 United States of America Master’s degree Computer science (software engineering, etc.) Data Scientist -1 I am a student -1 0-1 0-10,000
3 621 Male -1 35-39 United States of America Master’s degree Social sciences (anthropology, psychology, soc… Not employed -1 NaN -1 NaN NaN
In [3]:
df.drop(['Gender2','Recent_role2','Industry2'], axis=1, inplace=True)
 

Correcting data

Every time when we want to do plot we will need to check and improve data. Especially check of unique occurrences and elimination of minority of rubbish and NaN cells (lack of data).

In [4]:
df.isnull().sum()
Out[4]:
Duration (in seconds)       0
Gender                      0
Age                         0
Country                     0
Education                 421
Major_undergraduate       912
Recent_role               959
Industry                 2174
Years_of_experience      2758
compensation$USD         3674
dtype: int64
In [5]:
df.dtypes
Out[5]:
Duration (in seconds)     int64
Gender                   object
Age                      object
Country                  object
Education                object
Major_undergraduate      object
Recent_role              object
Industry                 object
Years_of_experience      object
compensation$USD         object
dtype: object
 

Very important is reduction of the class or join some similar groups if it is not bad for the project.

In [6]:
df['Gender']=df['Gender'].replace('Prefer to self-describe', 'Prefer not to say')
In [7]:
df.Education.value_counts(dropna = False)
Out[7]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
Professional degree                                                    599
NaN                                                                    421
I prefer not to answer                                                 345
No formal education past high school                                   232
Name: Education, dtype: int64
 

We can get assumption if somebody didn’t answer he didn’t want to give information: 'I prefer not to answer’.

In [8]:
import numpy as np

df['Education']=df['Education'].replace(np.NaN, 'I prefer not to answer')
In [9]:
df.Education.value_counts(dropna = False)
Out[9]:
Master’s degree                                                      10855
Bachelor’s degree                                                     7083
Doctoral degree                                                       3357
Some college/university study without earning a bachelor’s degree      967
I prefer not to answer                                                 766
Professional degree                                                    599
No formal education past high school                                   232
Name: Education, dtype: int64
In [10]:
df.Education.isnull().sum()
Out[10]:
0
In [11]:
df.Major_undergraduate.value_counts(dropna = False)
Out[11]:
Computer science (software engineering, etc.)                    9430
Engineering (non-computer focused)                               3705
Mathematics or statistics                                        2950
A business discipline (accounting, economics, finance, etc.)     1791
Physics or astronomy                                             1110
Information technology, networking, or system administration     1029
NaN                                                               912
Medical or life sciences (biology, chemistry, medicine, etc.)     871
Other                                                             770
Social sciences (anthropology, psychology, sociology, etc.)       554
Humanities (history, literature, philosophy, etc.)                269
Environmental science or geology                                  253
I never declared a major                                          128
Fine arts or performing arts                                       87
Name: Major_undergraduate, dtype: int64
 

Rozumiem, że NaN i 'Other’ jest wtedy, gdy ktoś nie chce zadeklarować swojej specjalizacji:’I never declared a major’

In [12]:
df['Major_undergraduate']=df['Major_undergraduate'].replace(np.NaN, 'I never declared a major')
df['Major_undergraduate']=df['Major_undergraduate'].replace('Other', 'I never declared a major')
In [13]:
import matplotlib as plt
df.Major_undergraduate.value_counts(dropna = False, normalize=True).plot(kind='barh')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x260cea62cc0>
In [14]:
df.Recent_role.value_counts(dropna=False)
Out[14]:
Student                    5253
Data Scientist             4137
Software Engineer          3130
Data Analyst               1922
Other                      1322
Research Scientist         1189
NaN                         959
Not employed                842
Consultant                  785
Business Analyst            772
Data Engineer               737
Research Assistant          600
Manager                     590
Product/Project Manager     428
Chief Officer               360
Statistician                237
DBA/Database Engineer       145
Developer Advocate          117
Marketing Analyst           115
Salesperson                 102
Principal Investigator       97
Data Journalist              20
Name: Recent_role, dtype: int64
In [15]:
df['Recent_role']=df['Recent_role'].replace(np.NaN, 'Other')
In [16]:
Z1 = df.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count').sort_values('Male',ascending=False)
Z1
Out[16]:
Gender Female Male Prefer not to say
Major_undergraduate      
Computer science (software engineering, etc.) 1463 7837 130
Engineering (non-computer focused) 432 3223 50
Mathematics or statistics 660 2241 49
I never declared a major 297 1438 75
A business discipline (accounting, economics, finance, etc.) 334 1435 22
Physics or astronomy 119 968 23
Information technology, networking, or system administration 186 832 11
Medical or life sciences (biology, chemistry, medicine, etc.) 203 646 22
Social sciences (anthropology, psychology, sociology, etc.) 160 379 15
Environmental science or geology 57 190 6
Humanities (history, literature, philosophy, etc.) 74 185 10
Fine arts or performing arts 25 56 6
In [17]:
Z1.plot(kind='barh', legend=True, title='Data Scientists by Major undergraduate and Gender (Kaggle 2018)', figsize=(7, 4), color=('b','g','y'))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x260cef146d8>
In [18]:
Z2 = df.pivot_table(index=['Country'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False).nlargest(20,'Male')
Z2
Out[18]:
Gender Female Male Prefer not to say SUM
Country        
SUM 4010.0 19430.0 419.0 23859
India 657.0 3719.0 41.0 4417
United States of America 1082.0 3530.0 104.0 4716
China 267.0 1337.0 40.0 1644
Other 165.0 849.0 22.0 1036
Russia 113.0 750.0 16.0 879
Brazil 65.0 666.0 5.0 736
Germany 103.0 621.0 10.0 734
Japan 34.0 557.0 6.0 597
United Kingdom of Great Britain and Northern Ireland 131.0 554.0 17.0 702
France 104.0 494.0 6.0 604
Canada 123.0 475.0 6.0 604
Spain 75.0 406.0 4.0 485
Italy 47.0 303.0 5.0 355
Australia 51.0 272.0 7.0 330
Turkey 56.0 267.0 4.0 327
I do not wish to disclose my location 83.0 250.0 61.0 394
Poland 54.0 243.0 4.0 301
Netherlands 41.0 225.0 4.0 270
Ukraine 31.0 218.0 3.0 252
 

Poland in data

Because I am from Poland, most interesting data for me is information from my country. I separate data about Poland from original data.

In [19]:
PL= df[df.Country=='Poland']
In [20]:
Z3 = PL.pivot_table(index=['Major_undergraduate'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Out[20]:
Gender Female Male Prefer not to say SUM
Major_undergraduate        
SUM 54.0 243.0 4.0 301
Computer science (software engineering, etc.) 15.0 96.0 1.0 112
Mathematics or statistics 12.0 39.0 1.0 52
A business discipline (accounting, economics, finance, etc.) 9.0 24.0 1.0 34
Physics or astronomy 4.0 20.0 NaN 24
Engineering (non-computer focused) 3.0 18.0 1.0 22
I never declared a major 2.0 15.0 NaN 17
Information technology, networking, or system administration 3.0 11.0 NaN 14
Medical or life sciences (biology, chemistry, medicine, etc.) NaN 7.0 NaN 7
Social sciences (anthropology, psychology, sociology, etc.) 5.0 7.0 NaN 12
Humanities (history, literature, philosophy, etc.) NaN 4.0 NaN 4
Environmental science or geology 1.0 2.0 NaN 3
In [21]:
Z3 = PL.pivot_table(index=['Recent_role'], columns = 'Gender', values='Age',aggfunc='count', margins=True, margins_name='SUM').sort_values('Male',ascending=False)
Z3
Out[21]:
Gender Female Male Prefer not to say SUM
Recent_role        
SUM 54.0 243.0 4.0 301
Data Scientist 15.0 58.0 NaN 73
Software Engineer 4.0 46.0 NaN 50
Student 5.0 28.0 NaN 33
Other 5.0 21.0 NaN 26
Data Analyst 9.0 19.0 1.0 29
Research Scientist 2.0 15.0 NaN 17
Consultant 1.0 10.0 NaN 11
Business Analyst 2.0 9.0 1.0 12
Manager 1.0 6.0 NaN 7
Research Assistant 2.0 6.0 NaN 8
Data Engineer 2.0 5.0 1.0 8
Not employed 3.0 5.0 1.0 9
Chief Officer 1.0 4.0 NaN 5
Product/Project Manager 1.0 3.0 NaN 4
DBA/Database Engineer NaN 3.0 NaN 3
Statistician NaN 2.0 NaN 2
Data Journalist NaN 1.0 NaN 1
Principal Investigator NaN 1.0 NaN 1
Salesperson 1.0 1.0 NaN 2
 

Let’s do standard, quick Pie Plot

We can see banal, predictable visualization.

In [22]:
Z4 = PL.pivot_table(index=['Recent_role'], values='Age',aggfunc='count').sort_values('Age', ascending=False)

Z4.plot(kind='pie', subplots=True, legend=False, title="Data Scientists by Recent_role (Kaggle 2018)",figsize=(15,7), autopct='
Out[22]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000260CF010F60>],
      dtype=object)
In [23]:
Z5 = PL.pivot_table(index=['Major_undergraduate'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z5.head(10)
Out[23]:
  Age
Major_undergraduate  
Computer science (software engineering, etc.) 112
Mathematics or statistics 52
A business discipline (accounting, economics, finance, etc.) 34
Physics or astronomy 24
Engineering (non-computer focused) 22
I never declared a major 17
Information technology, networking, or system administration 14
Social sciences (anthropology, psychology, sociology, etc.) 12
Medical or life sciences (biology, chemistry, medicine, etc.) 7
Humanities (history, literature, philosophy, etc.) 4
 

Better Pie Plot with interesting colors

At the beginning we can change colors and give better descriptions.

GSuite Text and Background Palette: https://yagisanatode.com/2019/08/06/google-apps-script-hexadecimal-color-codes-for-google-docs-sheets-and-slides-standart-palette/

In [24]:
import matplotlib.pyplot as plt
## Wielkość wykresu
plt.figure(figsize=(16,8))


## informacja że jest to wykres złożony
ax1 = plt.subplot(aspect='equal')



## ustalenie koloru
colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']

## równanie podstawowe
Z5.plot(kind='pie',colors =colors , y = 'Age', ax=ax1, autopct='

# opisy, nazwy itp
ax1.set_xlabel('Something to write',  fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Something to write', fontsize=11,  color='grey', alpha=0.8)
ax1.set_title('Major_undergraduate in Data Scientists (Kaggle 2018)',  fontsize=18, color='grey', alpha=0.8)
ax1.set_facecolor('#d8dcd6')
 

The best Pie Plot

I came across this publication and decided to do Pie Plot by this way.
https://medium.com/@kvnamipara/a-better-visualisation-of-pie-charts-by-matplotlib-935b7667d77f

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

In [25]:
PPL=Z5.reset_index()
PPL.head(5)
Out[25]:
  Major_undergraduate Age
0 Computer science (software engineering, etc.) 112
1 Mathematics or statistics 52
2 A business discipline (accounting, economics, … 34
3 Physics or astronomy 24
4 Engineering (non-computer focused) 22
 

To pull vectors of data from the pivot table.

In [26]:
PPL.reset_index()
labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()

fig1, ax1 = plt.subplots(figsize=(10,5))


ax1.pie(sizes, labels=labels, autopct='

ax1.axis('equal')  
plt.tight_layout()
plt.show()
 

Colors changing

In [27]:
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(10,5))

colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']

ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()
 

Changing size and color of the all fonts

textprops={’fontsize’: 30, 'color’:”green”}

In [28]:
# linia wskazuje że będzie to wykres złożony - wymiary: 6:6
fig1, ax1 = plt.subplots(figsize=(18,12))

colors = ['#e06666','#ea9999','#f4cccc','#ff0000','#434343']

ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:10: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  # Remove the CWD from sys.path while we load stuff.
 

Changing size and color of the separate fonts

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
In [29]:
fig1, ax1 = plt.subplots(figsize=(15,12))

colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:13: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  del sys.path[0]
 

Changing size and color for the chosen categories

In [30]:
fig1, ax1 = plt.subplots(figsize=(6,6))

colors = ['#ff9999','#747574','#99ff99','#ffcc99','#f1c232']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('grey')

    
texts[0].set_fontsize(24)
texts[0].set_color('black')
texts[4].set_fontsize(33)
texts[4].set_color('green')
    
ax1.axis('equal')  
plt.tight_layout()
plt.show()
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:20: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
 

Making a bagel

In [31]:
fig1, ax1 = plt.subplots(figsize=(18,6))

colors = ['#a2c4c9','#b6d7a8','#747574','#99ff99','#ffcc99','#76a5af']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='
# Equal aspect ratio ensures that pie is drawn as a circle

for text in texts:
    text.set_color('darkred')
for autotext in autotexts:
    autotext.set_color('grey')
    
ax1.axis('equal')  
plt.tight_layout()

plt.show()
 

Making the better bangle

In [32]:
fig1, ax1 = plt.subplots(figsize=(18,8))

colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('black')
    autotext.set_fontsize(22)

texts[0].set_fontsize(18)
texts[0].set_color('black')
    
#draw circle
centre_circle = plt.Circle((0,0),0.40,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)    
    
    
ax1.set_xlabel('Year 2018',  fontsize=15, color='darkred', alpha=1)
ax1.set_ylabel('Data Scientist', fontsize=11,  color='grey', alpha=0.8)
ax1.set_title('Data Scientist by profession',  fontsize=58, color='#d0e0e3', alpha=0.8)
ax1.set_facecolor('#d8dcd6')


ax1.axis('equal')  
plt.tight_layout()

plt.show()
 

We enter the gender variable

In [33]:
PL.columns
Out[33]:
Index(['Duration (in seconds)', 'Gender', 'Age', 'Country', 'Education',
       'Major_undergraduate', 'Recent_role', 'Industry', 'Years_of_experience',
       'compensation$USD'],
      dtype='object')
In [34]:
Z6 = PL.pivot_table(index=['Major_undergraduate','Gender'], values='Age',aggfunc='count').sort_values('Age', ascending=False)
Z6.head(10)
Out[34]:
    Age
Major_undergraduate Gender  
Computer science (software engineering, etc.) Male 96
Mathematics or statistics Male 39
A business discipline (accounting, economics, finance, etc.) Male 24
Physics or astronomy Male 20
Engineering (non-computer focused) Male 18
Computer science (software engineering, etc.) Female 15
I never declared a major Male 15
Mathematics or statistics Female 12
Information technology, networking, or system administration Male 11
A business discipline (accounting, economics, finance, etc.) Female 9
 

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

In [35]:
PLG=Z6.reset_index()
PLG.head(2)
Out[35]:
  Major_undergraduate Gender Age
0 Computer science (software engineering, etc.) Male 96
1 Mathematics or statistics Male 39
In [36]:
PLG.reset_index()
labels_gender = PLG['Gender'].to_list()
sizes_gender = PLG['Age'].to_list()
 

The double bangle

In [37]:
import matplotlib.pyplot as plt


colors_gender = ['#c2c2f0','#ffb3e6']
 

fig1, ax1 = plt.subplots(figsize=(18,6))

colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']

patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='

plt.pie(sizes_gender,colors=colors_gender,radius=0.75,startangle=0)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)




for text in texts:
    text.set_color('grey')
for autotext in autotexts:
    autotext.set_color('black')

    
#draw circle
centre_circle = plt.Circle((0,0),0.50,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)    
    
    
ax1.axis('equal')  
plt.tight_layout()

plt.show()
 

The double bangle is: „one bridge too far”. This plot is beautiful but bangles are not correlated each other. To achieve adequate connection vectors should be come from one pivot tables. At the moment I have no idea how to do it (this groupby, query, pivot ….).

 

Trigger to create Pie Plot

Components to create perfect pie plot: labels, sizes, colors

To prepare perfect pie plot first I will need to pull vectors of data from the pivot table.

In [38]:
PPL=Z5.reset_index()
PPL.head(5)
PPL.reset_index()

labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()

colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
In [39]:
def PPieP(sizes,labels,colors):
    fig1, ax1 = plt.subplots(figsize=(18,8))

    patches, texts, autotexts = ax1.pie(sizes, colors=colors, labels=labels, autopct='


    for text in texts:
        text.set_color('grey')
    for autotext in autotexts:
        autotext.set_color('black')
        autotext.set_fontsize(22)

    texts[0].set_fontsize(18)
    texts[0].set_color('black')
    
    #draw circle
    centre_circle = plt.Circle((0,0),0.40,fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)    
    
    
    ax1.set_xlabel('Year 2018',  fontsize=15, color='darkred', alpha=1)
    ax1.set_ylabel('Data Scientist', fontsize=11,  color='grey', alpha=0.8)
    ax1.set_title('Data Scientist by profession',  fontsize=58, color='#d0e0e3', alpha=0.8)
    ax1.set_facecolor('#d8dcd6')


    ax1.axis('equal')  
    plt.tight_layout()

    plt.show()
In [40]:
# Variables to the trigger:

labels = PPL['Major_undergraduate'].to_list()
sizes = PPL['Age'].to_list()
#colors = ['#a2c4c9','#76a5af','#c9daf8','#a4c2f4', '#cfe2f3']
#colors = ['#ff0000','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#e6b8af','#b6d7a8','#e06666','#747574','#ffd966','#ffcc99','#ea9999']
#colors = ['#93c47d','#b6d7a8','#d9ead3','#d0e0e3','#a2c4c9','#76a5af']
#colors = ['#c27ba0','#d5a6bd','#ead1dc','#ffffff','#a64d79','#d9d2e9','#b4a7d6']
#colors = ['#cfe2f3','#9fc5e8','#6fa8dc']
colors = ['#d9ead3','#b6d7a8','#93c47d','#6aa84f']



# Trigger:

PPieP(sizes,labels,colors)
 

Good advice in making presentation is to prepare plots using one standard.

As says man who built my house: messy but equally!

Artykuł Perfect Plots: Pie Plot pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Estimation of the result of the empirical research with machine learning tools (part 1) https://sigmaquality.pl/uncategorized/estimation-of-the-result-of-the-empirical-research-with-machine-learning-tools/ Sat, 02 Mar 2019 19:38:00 +0000 http://sigmaquality.pl/?p=5621 Machine learning tools Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory [...]

Artykuł Estimation of the result of the empirical research with machine learning tools (part 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

 Part one: preliminary graphical analysis to research of coefficients dependence 

Machine learning tools

Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory research.

Costs of empirical verification are counted to the Technical cost of production. In production of some chemical active substantiation is necessary to lead laboratory empirical classification to allocate product to separated class of quality.

This research can turn out very expensive.  In the case of short runs of production, cost of this classification can make all production unprofitable.

With the help can come machine learning tools, who can replace expensive laboratory investigation by theoretical judgment.

Application of effective prediction model can decrease necessity of costly empirical research to the reasonable minimum.

Manual classification would be made in special situation where mode would be ineffective or in case of checking process by random testing.

Case study: laboratory classification of active chemical substance Poliaxid

We will now follow process of making model of machine learning based on the classification by the Random Forest method. Chemical plant produces small amounts expensive chemical substance named Poliaxid. This substance must meet very rigorous quality requirements. For each charge have to pass special laboratory verification. This empirical trials are expensive and long-lasting. Their cost significantly influence on the overall cost of production. Process of Poliaxid production is monitored by many gauges. Computer save eleven variables such trace contents of some chemical substances, acidity and density of the substance. There are remarked the level of some of the collected coefficients have relationship with result of the end quality classification. Cause of effect relationship drive to the conclusion — it is possible to create classification model to explain overall process. In this case study we use base, able to download from this address: source

This base contains results of 1593 trials and eleven coefficients saved during the process for each of the trial.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
del df['nr.']
df.head(5)

In the last column named: “quality class” we can find results of the laboratory classification.

Classes 1 and 0 mean the best quality of the substance. Results 2, 3 and 4 means the worst quality.

Before we start make machine learning model we ought to look at the data. We do it thanks matrix plots. These plots show us which coefficient is good predictor, display overall dependencies between exogenic and endogenic ratios.

Graphical analysis to research of coefficients dependence

The action that should precede the construction of the model should be graphical overview.

In this way we obtain information whether model is possible to do.

First we ought to divide results from result column: “quality class” in to two categories: 'First' and 'Second'.

df['Qual_G'] = df['quality class'].apply(lambda x: 'First' if x < 2 else 'Second')
df.sample(3)

At the end of table appear new column: "Qual_G".

Now we create vector of correlation between independent coefficients and result factor in column: 'quality class'.

CORREL = df.corr().sort_values('quality class')
CORREL['quality class']

Correlation vector points significant influences exogenic factors on the results of empirical classification.

We chose most effective predictors among all eleven variables. We put this variables in to the matrix correlation plot.

This matrix plot contain two colors. Blue dots means firs quality. Thanks to this all dependencies is clearly displayed.

import seaborn as sns

sns.pairplot(data=df[['factorB', 'citric catoda','sulfur in nodinol', 'noracid', 'lacapon','Qual_G']], hue='Qual_G', dropna=True)

Matrix display clearly patterns of dependencies between variables. Easily see part of coefficients have significant impact on the classification the first or second quality class.

Dichotomic division is good to display dependencies. Let's see what happen when we use division for 5 class of quality. We use this classes that was made by laboratory. We took only two most effective predictors. Despite this plot is illegible.

In the next part of this letter we use machine learning tools to make theoretical classification.

Next part:

Estimation of the result of the empirical research with machine learning tools (part 2)

Artykuł Estimation of the result of the empirical research with machine learning tools (part 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Why go from Excel to Python? https://sigmaquality.pl/uncategorized/why-go-from-excel-to-python/ Wed, 05 Sep 2018 19:24:00 +0000 http://sigmaquality.pl/?p=5569 I observe many analytics and data experts wonders about the question: Why go from Excel to Python? I understand this people very good. People have [...]

Artykuł Why go from Excel to Python? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

I observe many analytics and data experts wonders about the question: Why go from Excel to Python?

I understand this people very good. People have long experienced in Excel. They know how to rationally use limited memory resources and space for calculations. They know how to fill Excel's gaps with other external applications and systems.

There are two real barriers in launching Python in professional work in place of Excel: functionality and time.

 

In Python, we have to learn long time this actions that anyone can do it in Excel easily without any training. This is frustrating and unjust. Many easily functionalities is difficult in Python (for example calculation of a dominant).

The second reason is the lack of time. If we have to do something NOW, we have no time for experiments with uncertain solutions. Analytics always have no time!

Why go from Excel to Python?

My professional experience with Python Real problem appear in face of giant data bases, thousands dimensions and tens of thousands of entities and products that need to be analyzed on a regular basis. When in a short time we have to provide answers for astronomy scale problems.

We read a lot in Internet about huge, complex solutions, about machine learning and bi dashboards. From my professional experience: for cosmic problems, a simple solution is multiplied on a large scale. There may be conditions at work to deal with scientific work. I had conditions in which I had to monitor the business model in gigantic scale. This has been realized by simple Data Science tools.

The astronomy scale problems

Last four years I was working as a Data Scientist for big producer from branch wood processing. I was involved monitoring huge system of raw material and products flow. Thousands index of raw materials, 8 kinds of transports, own fleet of 300 of lorries and half thousand alien lorries. Raw materials were accepting four plants and six wood yards.

Similarly, a huge and complicated system was in sending products to the customers. I tried to carry out all the processes in Excel sheets. There was a very efficient database, and I was able to portion the information into Excel. In Excel there was a huge problem with the efficiency and time of the analysis. I tried to improve performance by increasing the memory in my laptop.

Many analytics don't want to better get to know and use Python. If expert use Excel during all his professional work, know many advanced tricks and complex applications, feels he have enough tools to solve every problem. When he meets the scale, he understands Why go from Excel to Python.

 In next publication I show two main Data scientist tools used to control astronomical scale logistic model. 

Artykuł Why go from Excel to Python? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Two easy Big Data tools to keep control on the astronomical scale business https://sigmaquality.pl/uncategorized/two-easy-big-data-tools-to-keep-control-on-the-astronomical-scale-business/ Wed, 05 Sep 2018 19:24:00 +0000 http://sigmaquality.pl/?p=5575 What use Big Data tools to keep control on the astronomical scale business? In last publication I tried to convince that sometimes better to abandon [...]

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

What use Big Data tools to keep control on the astronomical scale business?

In last publication I tried to convince that sometimes better to abandon Excel. Especially when we face with giant numbers of dimensions, thousands of variables and huge scale. I found myself in such circumstances five years ago, when I start working for big producer of chipboard. My duty was monitoring and detecting anomalies in very big logistic processes. Daily I was monitoring moreover a thousand operation of reception raw materials and sending goods in Poland. In such kind of processes most important is effective detection of anomalies. If anomaly appeared, all complex devices were launched to explain situation.

I was working as the Data Scientist on over interesting projects. In the same time, simultaneously, built by me, autonomic system was detecting and reporting anomalies.

Big Data tools to keep control on the astronomical scale business

I used two simple algorithms: Algorithm of comparison real events to standard values and algorithm of the density of the probability of normal distribution.

This two directions of research, in very basic form, were replicated to big number of applications. These systems working effectively detecting anomalies and fraud. The system did not work in real time, but when I was recharging the data. The reason was frequent corrections in the documents created during the operation and shortly thereafter. My system would react too quickly. A day later the process was stable and ready for testing.

Algorithm of comparison real events to standard values

I show how works first algorithm on the extremely easy example. This method can be applied to very complex processes, monitoring huge number of variables in various configuration.

Please open this example database. Source here!

In first column we see date of reception of raw materials. Second column is the kind of assortment. In third column we find the price of transactions. In my practice I had many thousands of assortments another many columns with other cost, dimensions, providers, categories and delivery conditions. I created special columns with ratios who combine this dimensions and costs. In this example I will show only simple mechanism how to do it in Pandas.

We open example database.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/Base1.csv', index_col=0, parse_dates=['Date of transaction'])
df.head()

We generate maximum prices from the list of transaction.

df.pivot_table(index='Assortment', values='Purchase price', aggfunc={max})

Thanks to this I get information where appear the biggest cost of purchase for each assortments. This is very easy function in Python, but not very easy do it in Excel because need to use array function.

Excel array function used to giant database need really big potential of hardware and long time of calculation. Python calculate it faster. Now we add additional column with the month.

df['Month'] = df['Date of transaction'].dt.month
df.head()

Now we can display extreme values for each assortment in each month.

df.pivot_table(index=['Month','Assortment'], values='Purchase price', aggfunc={max, min, np.mean})

 

 

As usual controlling or purchasing department have special agreement with suppliers of raw materials. We get maximum prices table for each assortment or/end each supplier. It is difficult to check couple of thousands transactions using searching Excel functions that need a lot of time. Faster is done it in Pandas.

We open price table for assortments. Source here!

Price_Max_Min = pd.read_excel('c:/2/tablePR.xlsx', index_col=0)
Price_Max_Min.head()

 

Now to our main table with data of transaction we are adding column with max and min price from control table for each assortment.

T2 = Price_Max_Min.set_index('Asso')['Max'].to_dict()
df['Max_price'] = df['Assortment'].map(T2)

T3 = Price_Max_Min.set_index('Asso')['Min'].to_dict()
df['Min_price'] = df['Assortment'].map(T3)

df.head()

It is noneffective and ridiculous to do scandal when the maximum price is exceeded by several percents. First we add column who calculate percent of exceed from the indicated value.

df['
df['
df.head()

Good manager is the lazy manager. We have computer to find all cases where max price was exceeded over 15

df['Warning!'] = df['
df.head()

Now we have to catch all significantly price exceeds in entire monthly transaction report. In my previous work such report could have even more over three hundred thousands operations a month. Each transaction had hundreds records of information to compare. Excel was unable to work effectively in such environment. Let's see 5 random exceeds from in our transactions.

df[df['Warning!']=='Warning! High exceed! '].sample(5)

If we want to catch something, we have to start from the top. We display top ten of the biggest exceeds in the year.

kot = df[df['Warning!']=='Warning! High exceed! '].sort_values('
kot.nlargest(10, '

Now we display all exceeds from the September.

df[(df['Warning!']=='Warning! High exceed! ')&(df['Month']==9)].sort_values('

In next publication I am showing second Big Data tool, who helped me survive as Data Scientist in difficult environment.

Big Data tools to keep control on the astronomical scale business.

 

 

 

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>