Pandas - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/pandas/ Wojciech Moszczyński Wed, 27 Nov 2019 18:56:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Pandas - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/pandas/ 32 32 Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) https://sigmaquality.pl/tensorflow-3/tutorial_-linear-regression-in-tensorflow-part_1/ Wed, 27 Nov 2019 18:56:00 +0000 http://sigmaquality.pl/tutorial_-linear-regression-in-tensorflow-part_1/   Part 1. Preliminary data preparation   AirQualityUCI Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality In [1]: import pandas as pd df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';') df.head(3) Out[1]:   Date [...]

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
 

Part 1. Preliminary data preparation

 
In [1]:
import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)
Out[1]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502 NaN NaN
 

Data Set Information:


The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Supplementing data for further analysis

 

Attribute Information:

Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in °C
Relative Humidity (
AH Absolute Humidity


Tutorial: Supplementing data for further analysis

 

Step 1. Data completeness check

In [2]:
df.isnull().sum()
Out[2]:
Date              114
Time              114
CO(GT)            114
PT08.S1(CO)       114
NMHC(GT)          114
C6H6(GT)          114
PT08.S2(NMHC)     114
NOx(GT)           114
PT08.S3(NOx)      114
NO2(GT)           114
PT08.S4(NO2)      114
PT08.S5(O3)       114
T                 114
RH                114
AH                114
Unnamed: 15      9471
Unnamed: 16      9471
dtype: int64
 

There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.

In [3]:
df.agg(['min', 'max', 'mean', 'median'])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesnumpylibnanfunctions.py:1112: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]:
  PT08.S1(CO) NMHC(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) Unnamed: 15 Unnamed: 16
min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 NaN NaN
max 2040.000000 1189.000000 2214.000000 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 NaN NaN
mean 1048.990061 -159.090093 894.595276 168.616971 794.990168 58.148873 1391.479641 975.072032 NaN NaN
median 1053.000000 -200.000000 895.000000 141.000000 794.000000 96.000000 1446.000000 942.000000 NaN NaN
In [4]:
df.shape
Out[4]:
(9471, 17)
 

We delete two empty columns.

In [5]:
del df['Unnamed: 15']
del df['Unnamed: 16']
 

Step 1: Preliminary analysis of data gaps

One more look at how many NaN cells there are.

In [6]:
df.isnull().sum()
Out[6]:
Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64
 

Now I will try to see these empty cells.

In [7]:
df[df['NMHC(GT)'].isnull()]
Out[7]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9357 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9358 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9359 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9360 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9361 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9466 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

114 rows × 15 columns

 

These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.

In [8]:
df = df.dropna(how='all')
df.isnull().sum()
Out[8]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.

In [9]:
import numpy as np

df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)
 

The value of -200 has been changed to NaN and we will see how many empty records there are now.

In [10]:
df.isnull().sum()
Out[10]:
Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64
 

Chart of missing data structure.

In [11]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show
Out[11]:
<function matplotlib.pyplot.show(*args, **kw)>
 

The NMHC (GT) variable is the most incomplete, we eliminate it from research

In [12]:
del df['NMHC(GT)']
 

We displaying records with missing data – Function isna ()

In [13]:
df1 = df[df.isna().any(axis=1)]
df1
Out[13]:
  Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9 11/03/2004 03.00.00 0,6 1010.0 1,7 561.0 NaN 1705.0 NaN 1235.0 501.0 10,3 60,2 0,7517
10 11/03/2004 04.00.00 NaN 1011.0 1,3 527.0 21.0 1818.0 34.0 1197.0 445.0 10,1 60,5 0,7465
33 12/03/2004 03.00.00 0,8 889.0 1,9 574.0 NaN 1680.0 NaN 1187.0 512.0 7,0 62,3 0,6261
34 12/03/2004 04.00.00 NaN 831.0 1,1 506.0 21.0 1893.0 32.0 1134.0 384.0 6,1 65,9 0,6248
39 12/03/2004 09.00.00 NaN 1545.0 22,1 1353.0 NaN 767.0 NaN 2058.0 1588.0 9,2 56,2 0,6561
9058 23/03/2005 04.00.00 NaN 993.0 2,3 604.0 85.0 848.0 65.0 1160.0 762.0 14,5 66,4 1,0919
9130 26/03/2005 04.00.00 NaN 1122.0 6,0 811.0 181.0 641.0 92.0 1336.0 1122.0 16,2 71,2 1,3013
9202 29/03/2005 04.00.00 NaN 883.0 1,3 530.0 63.0 997.0 46.0 1102.0 617.0 13,7 68,2 1,0611
9274 01/04/2005 04.00.00 NaN 818.0 0,8 473.0 47.0 1257.0 41.0 898.0 323.0 13,7 48,8 0,7606
9346 04/04/2005 04.00.00 NaN 864.0 0,8 478.0 52.0 1116.0 43.0 958.0 489.0 11,8 56,0 0,7743

2416 rows × 14 columns

 

Step 2: Check the level of direct correlation to complete the data

CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.

In [14]:
df.dtypes
Out[14]:
Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object
In [15]:
# df['CO(GT)'].astype(float)
 

ValueError: could not convert string to float: '2,6′

It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.

In [16]:
df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')
 

We change the format from object to float

In [17]:
df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)
In [18]:
df.dtypes
Out[18]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
 

We can now check the level of direct correlation.

In [19]:
df.corr()
Out[19]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [20]:
sns.set(style="ticks")

corr = df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
            square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8b43390>
 

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the CO (GT) variable.

I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.

In [21]:
df.corr()
Out[21]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [22]:
df.dtypes
Out[22]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
In [23]:
print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())
missing value in CO(GT):  1683
 

CO (GT) correlation with other variables.

In [24]:
CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
Out[24]:
  CO(GT)
PT08.S3(NOx) -0.703446
T 0.022109
AH 0.048556
RH 0.048890
PT08.S4(NO2) 0.630703
NO2(GT) 0.683343
NOx(GT) 0.795028
PT08.S5(O3) 0.854182
PT08.S1(CO) 0.879288
PT08.S2(NMHC) 0.915514
C6H6(GT) 0.931078
CO(GT) 1.000000
 

The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).

In [25]:
df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [26]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  383
In [27]:
df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [28]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  370
 

Now I do simple refilling – the last good value.

In [29]:
df['CO(GT)'].fillna(method='ffill', inplace=True)   
In [30]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [31]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  366
In [32]:
df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())
In [33]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'NOx(GT)’

In [34]:
print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())
brakuje wartości:  1639
In [35]:
CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')
Out[35]:
  NOx(GT)
PT08.S3(NOx) -0.655707
T -0.269683
AH -0.149323
RH 0.221032
PT08.S4(NO2) 0.233731
PT08.S2(NMHC) 0.704435
PT08.S1(CO) 0.713654
C6H6(GT) 0.718839
NO2(GT) 0.763111
PT08.S5(O3) 0.787046
CO(GT) 0.795028
NOx(GT) 1.000000
In [36]:
df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())
In [37]:
print('missing value: ',df['NOx(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [38]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  1642
In [39]:
CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')
Out[39]:
  NO2(GT)
PT08.S3(NOx) -0.652083
AH -0.335022
T -0.186533
RH -0.091759
PT08.S4(NO2) 0.157360
C6H6(GT) 0.614474
PT08.S1(CO) 0.641529
PT08.S2(NMHC) 0.646245
CO(GT) 0.683343
PT08.S5(O3) 0.708128
NOx(GT) 0.763111
NO2(GT) 1.000000
In [40]:
df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [41]:
df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [42]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  0
In [43]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>
 

I complete the records where the entire measuring device did not work.

In the drawing it can be seen as solid lines.

In [44]:
df.shape
Out[44]:
(9357, 14)
In [45]:
df.fillna(method='ffill', inplace=True)
In [46]:
df.shape
Out[46]:
(9357, 14)
In [47]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>
In [48]:
df.isnull().sum()
Out[48]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.

Let’s save the completed file to disk

df.to_csv('c:/TF/AirQ_filled.csv')
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)

Tutorial: Supplementing data for further analysis

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Estimation of the result of the empirical research with machine learning tools (part 1) https://sigmaquality.pl/uncategorized/estimation-of-the-result-of-the-empirical-research-with-machine-learning-tools/ Sat, 02 Mar 2019 19:38:00 +0000 http://sigmaquality.pl/?p=5621 Machine learning tools Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory [...]

Artykuł Estimation of the result of the empirical research with machine learning tools (part 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

 Part one: preliminary graphical analysis to research of coefficients dependence 

Machine learning tools

Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory research.

Costs of empirical verification are counted to the Technical cost of production. In production of some chemical active substantiation is necessary to lead laboratory empirical classification to allocate product to separated class of quality.

This research can turn out very expensive.  In the case of short runs of production, cost of this classification can make all production unprofitable.

With the help can come machine learning tools, who can replace expensive laboratory investigation by theoretical judgment.

Application of effective prediction model can decrease necessity of costly empirical research to the reasonable minimum.

Manual classification would be made in special situation where mode would be ineffective or in case of checking process by random testing.

Case study: laboratory classification of active chemical substance Poliaxid

We will now follow process of making model of machine learning based on the classification by the Random Forest method. Chemical plant produces small amounts expensive chemical substance named Poliaxid. This substance must meet very rigorous quality requirements. For each charge have to pass special laboratory verification. This empirical trials are expensive and long-lasting. Their cost significantly influence on the overall cost of production. Process of Poliaxid production is monitored by many gauges. Computer save eleven variables such trace contents of some chemical substances, acidity and density of the substance. There are remarked the level of some of the collected coefficients have relationship with result of the end quality classification. Cause of effect relationship drive to the conclusion — it is possible to create classification model to explain overall process. In this case study we use base, able to download from this address: source

This base contains results of 1593 trials and eleven coefficients saved during the process for each of the trial.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
del df['nr.']
df.head(5)

In the last column named: “quality class” we can find results of the laboratory classification.

Classes 1 and 0 mean the best quality of the substance. Results 2, 3 and 4 means the worst quality.

Before we start make machine learning model we ought to look at the data. We do it thanks matrix plots. These plots show us which coefficient is good predictor, display overall dependencies between exogenic and endogenic ratios.

Graphical analysis to research of coefficients dependence

The action that should precede the construction of the model should be graphical overview.

In this way we obtain information whether model is possible to do.

First we ought to divide results from result column: “quality class” in to two categories: 'First' and 'Second'.

df['Qual_G'] = df['quality class'].apply(lambda x: 'First' if x < 2 else 'Second')
df.sample(3)

At the end of table appear new column: "Qual_G".

Now we create vector of correlation between independent coefficients and result factor in column: 'quality class'.

CORREL = df.corr().sort_values('quality class')
CORREL['quality class']

Correlation vector points significant influences exogenic factors on the results of empirical classification.

We chose most effective predictors among all eleven variables. We put this variables in to the matrix correlation plot.

This matrix plot contain two colors. Blue dots means firs quality. Thanks to this all dependencies is clearly displayed.

import seaborn as sns

sns.pairplot(data=df[['factorB', 'citric catoda','sulfur in nodinol', 'noracid', 'lacapon','Qual_G']], hue='Qual_G', dropna=True)

Matrix display clearly patterns of dependencies between variables. Easily see part of coefficients have significant impact on the classification the first or second quality class.

Dichotomic division is good to display dependencies. Let's see what happen when we use division for 5 class of quality. We use this classes that was made by laboratory. We took only two most effective predictors. Despite this plot is illegible.

In the next part of this letter we use machine learning tools to make theoretical classification.

Next part:

Estimation of the result of the empirical research with machine learning tools (part 2)

Artykuł Estimation of the result of the empirical research with machine learning tools (part 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Two easy Big Data tools to keep control on the astronomical scale business https://sigmaquality.pl/uncategorized/two-easy-big-data-tools-to-keep-control-on-the-astronomical-scale-business/ Wed, 05 Sep 2018 19:24:00 +0000 http://sigmaquality.pl/?p=5575 What use Big Data tools to keep control on the astronomical scale business? In last publication I tried to convince that sometimes better to abandon [...]

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

What use Big Data tools to keep control on the astronomical scale business?

In last publication I tried to convince that sometimes better to abandon Excel. Especially when we face with giant numbers of dimensions, thousands of variables and huge scale. I found myself in such circumstances five years ago, when I start working for big producer of chipboard. My duty was monitoring and detecting anomalies in very big logistic processes. Daily I was monitoring moreover a thousand operation of reception raw materials and sending goods in Poland. In such kind of processes most important is effective detection of anomalies. If anomaly appeared, all complex devices were launched to explain situation.

I was working as the Data Scientist on over interesting projects. In the same time, simultaneously, built by me, autonomic system was detecting and reporting anomalies.

Big Data tools to keep control on the astronomical scale business

I used two simple algorithms: Algorithm of comparison real events to standard values and algorithm of the density of the probability of normal distribution.

This two directions of research, in very basic form, were replicated to big number of applications. These systems working effectively detecting anomalies and fraud. The system did not work in real time, but when I was recharging the data. The reason was frequent corrections in the documents created during the operation and shortly thereafter. My system would react too quickly. A day later the process was stable and ready for testing.

Algorithm of comparison real events to standard values

I show how works first algorithm on the extremely easy example. This method can be applied to very complex processes, monitoring huge number of variables in various configuration.

Please open this example database. Source here!

In first column we see date of reception of raw materials. Second column is the kind of assortment. In third column we find the price of transactions. In my practice I had many thousands of assortments another many columns with other cost, dimensions, providers, categories and delivery conditions. I created special columns with ratios who combine this dimensions and costs. In this example I will show only simple mechanism how to do it in Pandas.

We open example database.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/Base1.csv', index_col=0, parse_dates=['Date of transaction'])
df.head()

We generate maximum prices from the list of transaction.

df.pivot_table(index='Assortment', values='Purchase price', aggfunc={max})

Thanks to this I get information where appear the biggest cost of purchase for each assortments. This is very easy function in Python, but not very easy do it in Excel because need to use array function.

Excel array function used to giant database need really big potential of hardware and long time of calculation. Python calculate it faster. Now we add additional column with the month.

df['Month'] = df['Date of transaction'].dt.month
df.head()

Now we can display extreme values for each assortment in each month.

df.pivot_table(index=['Month','Assortment'], values='Purchase price', aggfunc={max, min, np.mean})

 

 

As usual controlling or purchasing department have special agreement with suppliers of raw materials. We get maximum prices table for each assortment or/end each supplier. It is difficult to check couple of thousands transactions using searching Excel functions that need a lot of time. Faster is done it in Pandas.

We open price table for assortments. Source here!

Price_Max_Min = pd.read_excel('c:/2/tablePR.xlsx', index_col=0)
Price_Max_Min.head()

 

Now to our main table with data of transaction we are adding column with max and min price from control table for each assortment.

T2 = Price_Max_Min.set_index('Asso')['Max'].to_dict()
df['Max_price'] = df['Assortment'].map(T2)

T3 = Price_Max_Min.set_index('Asso')['Min'].to_dict()
df['Min_price'] = df['Assortment'].map(T3)

df.head()

It is noneffective and ridiculous to do scandal when the maximum price is exceeded by several percents. First we add column who calculate percent of exceed from the indicated value.

df['
df['
df.head()

Good manager is the lazy manager. We have computer to find all cases where max price was exceeded over 15

df['Warning!'] = df['
df.head()

Now we have to catch all significantly price exceeds in entire monthly transaction report. In my previous work such report could have even more over three hundred thousands operations a month. Each transaction had hundreds records of information to compare. Excel was unable to work effectively in such environment. Let's see 5 random exceeds from in our transactions.

df[df['Warning!']=='Warning! High exceed! '].sample(5)

If we want to catch something, we have to start from the top. We display top ten of the biggest exceeds in the year.

kot = df[df['Warning!']=='Warning! High exceed! '].sort_values('
kot.nlargest(10, '

Now we display all exceeds from the September.

df[(df['Warning!']=='Warning! High exceed! ')&(df['Month']==9)].sort_values('

In next publication I am showing second Big Data tool, who helped me survive as Data Scientist in difficult environment.

Big Data tools to keep control on the astronomical scale business.

 

 

 

Artykuł Two easy Big Data tools to keep control on the astronomical scale business pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Who had a chance to survive on the Titanic? https://sigmaquality.pl/uncategorized/who-had-a-chance-to-survive-on-the-titanic/ Mon, 04 Sep 2017 06:32:00 +0000 http://sigmaquality.pl/?p=5551 How to survive on the Titanic? Long time I was thinking, that this sample is a something as any railway statistic or other theoretical base. [...]

Artykuł Who had a chance to survive on the Titanic? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

How to survive on the Titanic?

Long time I was thinking, that this sample is a something as any railway statistic or other theoretical base. When I realise, that it is real list of Titanic passengers any investigations with it became more exciting.
Let's see ones again what we can see in the data about this apocaliptic catastrof.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/kaggletrain.csv')
df.head(3)

 

I am particularly interested as a man, what chances of survival men had. This story is quite popular. We all remember the shouts: "Only women and children enter the shawl." Was that really true?

How many mens were there?

len(df[df['Sex']=='male'])

577

Number of all passengers on the list.

len(df['Sex'])

What percent of passengers were men?

print("Men as the percent of passengers:  ",(len(df[df['Sex']=='male'])/len(df['Sex']))*100)

Age has influence on the survival on the Titanic? I think about kids. I see, yes! Statistically younger passengers survived!

df.groupby('Survived').mean()[['Age']]

We remember in that time existed deep differences between social classes. This differences exist also on the board of Titanic. We know from the film, there are three boards, three classes. We remember people from third class how difficult had their situation during the catastrophe.

Let's check their chance for survive on the Titanic.

pd.crosstab(df.Pclass,df.Survived).plot(kind='bar')
plt.title('How many people survived in their classes ')
plt.xlabel('Class')
plt.ylabel('How many people ')
plt.savefig('Tytanik wg klas')

Is it true that mostly men were killed on Titanide, and women in empty boats were mindlessly staring at the sinking ship. How big chance for survival had poor male from third class?

pd.crosstab(df.Sex,df.Survived).plot(kind='bar')
plt.title('Chance for survival according to the sex')
plt.xlabel('Class')
plt.ylabel('How many people ')
plt.savefig('Tytanic-class')

We have just know, most victims were from third class. In another plot we saw that most fatalities was men.
I have no complex analysis contains simultaneously information about sex and class.
Thanks to this we will have a chance check what was a situations of women according to the class.

Is it true pictures with rich women in empty boats against the backdrop of a monstrous ship?

pd.crosstab(index = [df.Survived], columns = [df.Sex, df.Pclass])

kot = pd.crosstab(index = [df.Survived], columns = [df.Sex, df.Pclass])

plt.figure(figsize=(10,2))
sns.heatmap(kot, cmap="BuPu", annot=True, cbar=True)

As we see only three women from the almost hundred died of first class. Same proportion for women was in second class. Half woman survived on the Titanic in class third.

pd.crosstab(index = [df.Survived], columns = [df.Sex, df.Pclass], normalize='columns').applymap('{:.2f}'.format)

Men had a chance to survive on the Titanic?

Men from third class had only 14 Men from the first class had 37 Overall most of men from all classes were victims in this catastrophe.

 

Artykuł Who had a chance to survive on the Titanic? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>