machine lerning - THE DATA SCIENCE LIBRARY

Estimation of the result of the empirical research with machine learning tools (part 1)

admin — Sat, 02 Mar 2019 19:38:00 +0000

Part one: preliminary graphical analysis to research of coefficients dependence

Machine learning tools

Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory research.

Costs of empirical verification are counted to the Technical cost of production. In production of some chemical active substantiation is necessary to lead laboratory empirical classification to allocate product to separated class of quality.

This research can turn out very expensive. In the case of short runs of production, cost of this classification can make all production unprofitable.

With the help can come machine learning tools, who can replace expensive laboratory investigation by theoretical judgment.

Application of effective prediction model can decrease necessity of costly empirical research to the reasonable minimum.

Manual classification would be made in special situation where mode would be ineffective or in case of checking process by random testing.

Case study: laboratory classification of active chemical substance Poliaxid

We will now follow process of making model of machine learning based on the classification by the Random Forest method. Chemical plant produces small amounts expensive chemical substance named Poliaxid. This substance must meet very rigorous quality requirements. For each charge have to pass special laboratory verification. This empirical trials are expensive and long-lasting. Their cost significantly influence on the overall cost of production. Process of Poliaxid production is monitored by many gauges. Computer save eleven variables such trace contents of some chemical substances, acidity and density of the substance. There are remarked the level of some of the collected coefficients have relationship with result of the end quality classification. Cause of effect relationship drive to the conclusion — it is possible to create classification model to explain overall process. In this case study we use base, able to download from this address: source

This base contains results of 1593 trials and eleven coefficients saved during the process for each of the trial.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
del df['nr.']
df.head(5)

In the last column named: “quality class” we can find results of the laboratory classification.

Classes 1 and 0 mean the best quality of the substance. Results 2, 3 and 4 means the worst quality.

Before we start make machine learning model we ought to look at the data. We do it thanks matrix plots. These plots show us which coefficient is good predictor, display overall dependencies between exogenic and endogenic ratios.

Graphical analysis to research of coefficients dependence

The action that should precede the construction of the model should be graphical overview.

In this way we obtain information whether model is possible to do.

First we ought to divide results from result column: “quality class” in to two categories: 'First' and 'Second'.

df['Qual_G'] = df['quality class'].apply(lambda x: 'First' if x < 2 else 'Second')
df.sample(3)

At the end of table appear new column: "Qual_G".

Now we create vector of correlation between independent coefficients and result factor in column: 'quality class'.

CORREL = df.corr().sort_values('quality class')
CORREL['quality class']

Correlation vector points significant influences exogenic factors on the results of empirical classification.

We chose most effective predictors among all eleven variables. We put this variables in to the matrix correlation plot.

This matrix plot contain two colors. Blue dots means firs quality. Thanks to this all dependencies is clearly displayed.

import seaborn as sns

sns.pairplot(data=df[['factorB', 'citric catoda','sulfur in nodinol', 'noracid', 'lacapon','Qual_G']], hue='Qual_G', dropna=True)

Matrix display clearly patterns of dependencies between variables. Easily see part of coefficients have significant impact on the classification the first or second quality class.

Dichotomic division is good to display dependencies. Let's see what happen when we use division for 5 class of quality. We use this classes that was made by laboratory. We took only two most effective predictors. Despite this plot is illegible.

In the next part of this letter we use machine learning tools to make theoretical classification.

Next part:

Estimation of the result of the empirical research with machine learning tools (part 2)

Artykuł Estimation of the result of the empirical research with machine learning tools (part 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

Analyzing of the incidence of diabetes. Random Forest method

admin — Wed, 05 Sep 2018 19:24:00 +0000

Application of Machine Learning in clinical trials

Our aim is to build machine learning model to predict of the incidence of a diabetic. Survey with description that we use in our investigation you can find here: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

We isolated from the model column contains level of glucose in the blood. This exogenous variable was characterized too big correlation with endogenic value (result value). Existing of very good estimator can dominate over little poor estimators.

More over variable glucose did bring nothing to the model. High level of glucose indicates existence of diabetes. It is not factor who causes this illness.

Needed libraries were launched. Next I display first 5 rows of the data.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age', 'Outcome'])
df.head(5)

As we can easily remark, database has inaccuracies. For example, it is impossible to patient having zero millimeters thick skin. Let's leave everything as it is. At the moment we don't correct it.

X = df.drop('Outcome', axis=1) 
y = df['Outcome']

Preparation of the data set to modeling

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

test_size

Parameter ‘test_size’: We assume 80

Parameter ‘random_state’ determines stability. It doesn't matter what number we put there.

If this parameter is absence in code, algorithm will be generated always another random values. This is no good for stable of the model.

stratify

Parameter 'stratify=y' causing that structure of trial data are same as structure of population. Thanks this parameter proportion of specific features will be same in trial and in the population.

Pipeline as a connection of transforming and estimating

Pipeline is a module who combine process of data standardization and estimation.

preprocessing.StandardScaler

Transformation by parameter: 'preprocessing.StandardScaler()' consist in data standardization to form a normal distribution, where mean is 0 and standar deviation 1.

RandomForestRegressor

Estimator: 'RandomForestRegressor(n_estimators=100)' estimates independent variables in to the course of the dependent variable.

Estimator is set on the level 100. It means one hundred decision trees. We can set more trees but this may delay the receipt of the result without significant improve the predictive abilities.

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

Choosing of the hyperparameters

Now we create hyperparameters for tuning model. In this part of code we declare which kind of tools we use to this operation.

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

max_features

The parameter: 'max_features' specifies number of functions for process of looking divisions.

We have three methods of looking for division: ['auto', 'sqrt', 'log2']. The program automatically choose the best method.

max_features: "auto" no subset of features is performed on trees, so "random forest" is actually a packed set of ordinary regression trees. With the AUTO setting, the code will take all the functions that make sense from each tree. Here, no restrictions are placed on individual trees.

max_features: "sqrt", This option takes the square root of the total number of functions in a single pass. For example, if the total number of variables is 100, we can only take 10 of them in a single tree.

max_depth indicates deep of the decision tree.

Next line of code leads to the process of tuning of prediction algorithm. Object 'fit' this is tuning function.

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

Code above is response for good tuning of the model. From the great number of interactions code chooses the best variant.

Exist two method of searching the best hyperparameters for tuning of the model:

Searching by the grid
Random searching

GridSearchCV – looking for hyperparameters by the grid is a nonrandom but systematically.

Let's make small refresh of information

By pipeline we declared method of data transformation: StandardScaler() and method of estimation: RandomForestRegressor()

In the next step we pointed hyperparameters: max_features and max_depth.

Code depend how good operate classification on the data assembly.

Come time to check how good our model explain reality!

Our aim was make model on the base of the value of variables from medical investigations.

We wanted to find which of the variables are the good or best predictors. Our model shows 0 when patient is healthy and 1 when has disease.

Code below answers, which method for Random Forest method was chosen as the best method of model tuning.

import pprint
pparam=pprint.PrettyPrinter(indent=2)
print(clf.best_params_)

Our model doesn't make binary answer for ours questions. We can change it by the short code.

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

Model evaluation by Confusion Matrix

To the evaluation of our model we can use Confusion Matrix.

Matrix answer how good is the model. Compare answers from the test assemble with the answer from the model prediction.

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

Matrix has two dimension because model generate binary answers.

How to interpret the Confusion Matrix?

Matrix points how many times model pointed good answer, how many times made mistake.

Imagine that it is the scheme of a Confusion Matrix.

Numbers in the white plots it is good answers, numbers in the black plots it is mistakes.

Intuitively we will be better evaluate models who have significantly bigger values in white plots than in the black plots.

We can use special indicator to good understand model goodness.

Confusion Matrix indicators

Accuracy ACC of the model is interpreted as the accuracy of the classification. Is calculated as the sum of the numbers on the white areas divided by sum of the all numbers in the matrix. The higher the percentage value the better.

ACC = A+D / A+B+C+D.

(87+20) / (87+20+13+34) = 69

Precision or positive predictive value (PPV) level of verifiability of model predictions. Is calculated as: PPV = A / A+B.

87 / (87+13) = 87

Sensitivity, recall, hit rate, or true positive rate (TPR) – How many current patients with illness were detected by the model as the patients with illness? Is calculated as: TPR = A / A+C.

87 / (87+34) = 72

F-Score

It is difficult compare models with low level of precision and high level of recall and vice versa.

This one compares simultaneously recall and precision.

F-Score = (2* Precision* Recall)/( Precision+ Recall) = 2A/(2A+B+C)

2*87 / (2*87 + 34 + 13) = 79

All this calculation we can acquire by code below:

print(classification_report(y_test, y_pred))

At the end we can save model on the disc.

joblib.dump(clf, 'c:/1/rf_abc.pkl')

To open model you can use such command.

clf2 = joblib.load('c:/1/rf_abc.pkl')

Entire code

import pandas as pd
import numpy as np

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age', 'Outcome'])
df.head(5)

# wskazanie danych objaśniających i wynikowych
X = df.drop('Outcome', axis=1) #<-- wszystkie kolumny poza CLASS są zmiennymi opisującymi X
y = df['Outcome']              #<-- Class jest zmienną opisywaną y


#df[['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']] = df[['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']].replace(0,np.nan)
#df = df.dropna(how='any')

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

print(classification_report(y_test, y_pred)) 

print("Accuracy: ",np.round(metrics.accuracy_score(y_test, y_pred), decimals=2))
print("Precision:",np.round(metrics.precision_score(y_test, y_pred), decimals=2))
print("Recall:   ",np.round(metrics.recall_score(y_test, y_pred), decimals=2))

Artykuł Analyzing of the incidence of diabetes. Random Forest method pochodzi z serwisu THE DATA SCIENCE LIBRARY.

Prognozowanie wyjściowej mocy elektrowni w cyklu kombinowanym przy użyciu modelu regresji liniowej (część 1)

admin — Wed, 05 Sep 2018 19:24:00 +0000

Dzisiaj zajmujemy się prognozowaniem wyjściowej mocy elektrowni przy użyciu modelu regresji liniowej.

Regresja liniowa

Użycie klasycznych modeli regresji liniowej może okazać się bardzo skutecznym narzędziem w procesie optymalizacji procesów wytwórczych. Jednoczynnikowa i wieloraka regresja liniowa powstała na bazie metody najmniejszych kwadratów opisanej w 1805 roku i spopularyzowanej przez francuskiego matematyka Adrien-Marie Legendre.

Pracochłonność oraz wysokie wymagania merytorycznie przy modelowaniu procesów były głównym powodami małej popularności tej metody optymalizacji.

Pojawienie się komputerów osobistych, dynamiczny wzrost wydajności ich wydajności oraz upowszechnienie programowania w językach R oraz Python spowodowało wzrost zainteresowania wykorzystaniem złożonych narzędzi ekonometrycznych.

W tej publikacji chciałbym przedstawić przykład zastosowania modelu regresji wielorakiej wykorzystując zbiór danych opracowanych przez naukowców z Namık Kemal University.

(Pınar Tüfekci, Çorlu Wydział Inżynierii, Namık Kemal University, TR-59860 Çorlu, Tekirdağ, Turcja)

Źródło danych: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Forma artykułu

W tej i w następnej części tej publikacji przedstawiony został proces tworzenia modeli prostej regresji liniowej. Autor artykułu zrezygnował z przedstawiania tego procesu w formie wzorów matematycznych ponieważ uznał, że bardziej praktycznie będzie pokazanie użycie nowoczesnych bibliotek języka Python. Wszystkie wykorzystane tu narzędzia ją bezpłatne i powszechnie dostępne w Internecie. W tej publikacji wykorzystana została głównie biblioteka statsmodels zbudowaną na bazie pakietów NumPy i SciPy języka Python. Wszystkie operacje zostały przeprowadzone w notatniku Jupyter należącym do bezpłatnego pakietu Anaconda. Czytelnik wprowadzając podany w publikacji kod uzyska te same wyniki, co może stanowić zachętę do głębszego zainteresowania zaprezentowanymi tutaj narzędziami.

Przegląd danych

Zbiór danych zawiera 9568 pomiarów z 6 lat (2006-2011), zebranych w elektrowni o cyklu łączonym. Podczas pomiaru elektrownia była wykorzystywana do pracy z pełnym obciążeniem.

Przedmiotem pomiaru były następujące zmienne:

średnie godziny zmienne temperatury temperatury (T),
zawartości satelitarne (R)
wilgotność względna (RH)
próżnia spalin (V).

Te zmienne niezależne umożliwiały przegląd godzinowej wydajności energii elektrycznej netto (EP).

Elektrownia o cyklu kombinowanym (CCPP) składa się z turbin gazowych (GT), turbin parowych (ST) i generatorów pary z odzyskiem ciepła. W elektrowni typu CCPP energia elektryczna jest wytwarzana przez turbiny gazowe i parowe, które są powiązane w jednym cyklu i gdzie nośniki energii są przenoszone z jednej turbiny do drugiej. Podczas gdy mierzona próżnia ma wpływ na turbinę parową, pozostałe trzy zmienne oddziaływania wpływają na wydajność turbin gazowych GT.

Prognozowanie przy użyciu modelu regresji liniowej

Otwieramy źródło danych i potrzebne biblioteki języka Python. Obliczenia zostały przeprowadzone w notatniku Jupyter.

import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
#from sklearn.cross_validation import train_test_split
import math

## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')
df.sample(5)

Zbiór pomiarów ma postać tabeli, gdzie kolumny reprezentują zmienne niezależne. Ostatnia kolumna oznaczona jako PE reprezentuje zmienną zależną (wynikową).

Poniżej została przeprowadzona analiza wielkość zbioru i formatu danych.

df.shape

Dane składają się z 9 568 wierszy w pięciu kolumnach.

(9568, 5)

df.dtypes

Dane mają postać liczbową.

Dla lepszej czytelności zmieniamy oznaczenia kolumn.

df.columns

df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
df.sample(5)

Nazwy kolumn zostały zmienione.

Warunki podstawowe przy zastosowaniu modelu regresji liniowej

Niedotrzymanie podstawowych warunkami regresji może prowadzić do stworzenia modelu nie odzwierciedlającego prawdziwych relacji pomiędzy zmiennymi niezależnymi i zmienną zależną.

Między predyktorami i zmienną wynikową musi być zachowana relacja liniowa.
Nie może istnieć silna korelacja pomiędzy zmiennymi niezależnymi.
Wartości resztkowe powinny mieć rozkład normalność. Indywidualne reszty modelu powinny być podobne w swej zbiorowości oraz mieć rozkład normalny.
Homogeniczność wariancji - wariancja błędu powinna być stała.
Niezależność - błędy związane z jedną obserwacją nie mogą być skorelowane z błędami innych obserwacji.

Analiza danych liniowych między zmienną zależną a zmiennymi niezależnymi

CORREL = df.corr().sort_values('Energy_output')
CORREL['Energy_output']

Test korelacji ujawnił bardzo dużą korelację ujemną Temperatury i Próżni spalin z wartością produkowanego prądu.

Zakresy zmiennych niezależnych:

Zestawienie zawiera minimalne i maksymalne wahania wartości zmiennych egzogenicznych:

Temperatura (T) w zakresie od 1,81 ° C do 37,11 ° C,
Obejmuje powierzchnię (AP) w zakresie od 992,89 do 1033,30 milibara,
Wilgotność względna (RH) w zakresie od 25,56
Próżnia spalin (V) w zakresie od 25,36 do 81,56 cm Hg

Dane są pobierane z czujników rozmieszczonych w zakładzie, które co sekundę rejestrują zmienne otoczenia. Zmienne nie zostały objęte normalizacji (standaryzacją danych).

Zmienna zależna (wynikowa) to godzinowa produkcja energii elektrycznej netto (EP) zarejestrowana w zakresie od 420.26 do 495,76 MW

Jak wspomnieliśmy istnieje bardzo wysoka korelacja ujemna między zmienną wynikową produkcją energii elektrycznej (EP) a temperatura i próżnia spalin.

Jednoczynnikowy model regresji liniowej

lm = smf.ols(formula = 'Energy_output ~ Temperature', data = df).fit()
lm.summary()

Jednoczynnikowy model regresji w oparciu o zakresy dostępne są najlepsze właściwości prognostyczne.

plt.figure()
plt.scatter(df.Temperature, df.Energy_output, c = 'grey')
plt.plot(df.Temperature, lm.params[0] + lm.params[1] * df.Temperature, c = 'r')
plt.xlabel('Temperature')
plt.ylabel('Energy_output')
plt.title("Linear Regression Plot")

Parametr r² ujawnia niezwykle dobre właściwości predykcyjne modelu. Poniższy wykres obrazuje jak dobry jest wykonany przez nas model.

Wieloczynnikowy model regresji liniowej

lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
print (lm.summary())

Model wieloczynnikowej regresji liniowej cechuje się doskonałymi zdolnościami predykcyjnymi. Dlatego jego wyniki wzbudzają niepokój.

Analiza danych użytych do budowy modelu

df.describe()

Analiza rozkładu zmiennej wynikowej

plt.rcParams['figure.figsize'] = (5, 4)
sns.distplot(df['Energy_output'])

Wykres rozkładu prawdopodobieństwa wykazał istnienie dwóch ekstremów (wykres bimodalny).

Analiza korelacji pomiędzy zmiennymi niezależnymi

plt.rcParams['figure.figsize'] = (5, 4)
sns.heatmap (df.corr (), cmap="YlGnBu")

Jak pamiętamy jednym z warunków prawidłowego modelu regresji liniowej jest brak korelacji pomiędzy zmiennymi egzogenicznymi.

Widać wyraźnie że między zmiennymi opisującymi ‘Temperature" oraz "Exhaust_Vacuum" występują bardzo wysokie korelacje dodatnie.

Poniżej przedstawiłem inną formę prezentacji tej samej macierzy korelacji.

sns.heatmap (df.corr (), cmap="coolwarm", annot=True, cbar=False)

Sprawdzamy czy pomiary są kompletne

df.isnull().sum()

Graficzna analiza wpływu zmiennych niezależnych na zmienną wynikową

Aby zobaczyć jaki jest wpływ wszystkich zmiennych niezależnych na zmienną zależną dzielę zbiór zmiennych wynikowych na dwie części. Do tabeli danych zostaje dodana kolumna zawierająca dwa stany produkcji energii.

Ewa = ['małą moc', 'duża moc']

df['moc'] = pd.qcut(df['Energy_output'],2, labels=Ewa)
df.sample(2)

Tworzymy wykres zależności.

sns.pairplot(data=df[['Temperature' ,'Exhaust_Vacuum','Ambient_Pressure', 'Relative_Humidity', 'moc']], hue='moc', dropna=True, height=2)

Wykres zależności kolejny raz wykazał wysoką korelację wzorów między Temperaturą (T) i Próżnią spalin (V) .

sns.jointplot(x='Temperature', y='Exhaust_Vacuum', data=df)

W drugiej części tego artykułu przeprowadzony zostanie proces weryfikacji warunków podstawowych tworzenia liniowej regresji wielorakiej.

Artykuł Prognozowanie wyjściowej mocy elektrowni w cyklu kombinowanym przy użyciu modelu regresji liniowej (część 1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.