SVM - THE DATA SCIENCE LIBRARY

Estimation of the result of the empirical research with machine learning tools (part 2)

admin — Sat, 02 Mar 2019 19:38:00 +0000

Artificial intelligence in process of classification

In first part of this publication described the problem of additional classification of the quality classes. Every charge of Poliaxid have to go through rigoristic, expensive test to classification to the proper class of quality. [Source of data]

In last study we showed that some quantity factors associated with production have significant impact on the final quality of the substance.

Existing correlation lead to the conclusion that it is possible effective model of artificial intelligence is applied

It leads to the two conclusions:

Laborious method of classification could be replaced by theoretical model.
Persons who monitor production process could be informed by the model about probability of final quality of the substance.

Machine learning procedure allows us make try to build such model.

We open the base in Python.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
df.head(5)

We divide set of data in to the independent variables X and dependent variable y, the result of the process.

X = df.drop('quality class', axis=1) 
y = df['quality class']

Now we divide database into the training and test underset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

Pipeline merge standardization and estimation. We took as the estimation method of Random Forest.

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

Hyperparameters of the random forest regression are declared.

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

Tune model using cross-validation pipeline.

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

Random forest is the popular method using regression engine to obtain result.

Many scientists think that this is incorrect. Andrew Ng, in Machine Learning course at Coursers, explains why this is a bad idea - see his Lecture 6.1 - Logistic Regression | Classification at YouTubee. https://www.youtube.com/watch?v=-la3q9d7AKQ&t=0s&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=33

I, as an apprentice, will lead this model to the end. Without this rounding Confusion Matrix would be impossible to use because y from the test set has discrete form but predicted y would be in format continuous.

Here we have array with the result of prediction our model. You can see continuous form of result.

y_pred

Empirical result has discrete form.

y.value_counts()

We make rounding continuous data to the discrete form.

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

Typical regression equation should not be used to the classification, but logistic regression seems to can make classification.

This is occasion to compare linear regression, logistic regression and typical tool used to classification Support Vector Machine.

Now we make evaluation of the model. We use confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

print(classification_report(y_test, y_pred))

Regression Random Forest with a temporary adaptation to discrete results seems to be good!

According to the f1-score ratio, model of artificial intelligence can good classify for these classes which have many occurrences.

For example 0 class has 13 occurrence and model can't judge this class. In opposite to the class 0 is class 1. There are 136 test values and model can properly judge classes in 78

In next part of this investigation we will test models of artificial intelligence intended to the make classification.

Next part:

Estimation of the result of the empirical research with machine learning tools. Classification with SVM Support Vector Machine (part 3)

Artykuł Estimation of the result of the empirical research with machine learning tools (part 2) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

Estimation of the result of the empirical research with machine learning tools. Classification with SVM Support Vector Machine (part 3)

admin — Sat, 02 Mar 2019 19:38:00 +0000

This time we use model which is designed to classification application.

The SVM Support Vector Machine algorithm is included among the learning machine estimators based on the classification and regression analysis processes.

The SVM classifier uses an optimization algorithm based on maximizing the margin of the hyperplan. The SVM algorithm is designed to conduct the best possible classification of results. Vectors separating hyperspace can be linear or (thanks to the SVC function) non-linear.

In our model, we will use the linear SVM Support Vector Machine classifier.

We download all needed libraries and database file.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pprint

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
del df['nr.']
df.head(3)

Next we point the column y with the result of the calculations, and columns with independent variables X.

X=df.drop(['quality class'],axis=1)
Y=df['quality class']

Pipeline put together standardization and classification.

ew = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(ew)

parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score =

I can see which superparamiters was chosen by the grid as the best.

pparam=pprint.PrettyPrinter(indent=2)
print(grid.best_params_)

y_pred = grid.predict(X_test)

Classificator is evaluation by confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

The SVM Support Vector Machine is not better clasificator than not adapted to such a role Random Forest regression.

Artykuł Estimation of the result of the empirical research with machine learning tools. Classification with SVM Support Vector Machine (part 3) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

Zastosowanie estymatora liniowego Support Vector Machine (SVM) do tworzenia prognozy zapadalność na cukrzycę

admin — Wed, 05 Sep 2018 19:24:00 +0000

Support Vector Machine zaliczany jest do estymatorów machine learning uczenia z nadzorem w oparciu o procesy klasyfikacji i analizy regresji.

Klasyfikator SVM wykorzystuje algorytm optymalizacji w oparciu o maksymalizację marginesu hiperplanu. Algorytm SVM jest przeznaczony do prowadzenia możliwie najlepszej klasyfikacji wyników. Wektory oddzielające hiperprzestrzenie mogą mieć charakter liniowy lub (dzięki funkcji SVC) nieliniowy.

W naszym modelu zastosujemy liniowy klasyfikator SVM.

Ćwiczenie przeprowadzimy na próbce badań laboratoryjnych przeprowadzonych na 768 pacjentów. Dane do ćwiczenia z pełnym opisem można znaleźć tutaj: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

Na początek otworzymy bazę wraz z niezbędnymi bibliotekami.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pprint

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC


from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib


df = pd.read_csv('c:/1/diabetes.csv', usecols=['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age', 'Outcome'])
df.head(3)

Dokonujemy wyboru funkcji celu y jako kolumny 'Outcome’ oraz usuwamy tą kolumnę ze zbioru zmiennych opisujących X.

X=df.drop(['Outcome'],axis=1)
Y=df['Outcome']

Transformacja i klasyfikacja

Tworzymy obiektu pipeline jako połączenie transformatora i klasyfikatora.

Transformator ‘StandardScaler()’ standaryzuje zmienne do populacji opisanej na standaryzowanym rozkładzie normalnym.

Jako estymatora użyliśmy Support Vector Machine (SVM) specjalizującym się w klasyfikacji próbek w kierunku maksymalizacji marginesu pomiędzy grupami danych.

W tym badaniu przyjmiemy klasyfikator liniowy.

ew = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(ew)

Istnieją dwa parametry dla jądra SVM, mianowicie C i gamma. Powinniśmy ustawić siatkę parametrów przy użyciu wielokrotności 10. Deklarujemy przestrzeń hiperparametrów. Najlepsza konfiguracja hiperparametrów zostanie wybrana w drodze dostrajania modelu przez GridSearchCV.

parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}

Tworzymy zestaw danych do dalszej transformacji i estymacji. Przyjmujemy 20

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=30, stratify=Y)

Parametr stratify=y powoduje odwzorowanie struktury estymatora do struktury populacji. Dzięki temu parametrowi proporcje określonych wartości w próbce testowej będą taki sam, jak proporcja w próbce treningowej.

Dostrajanie modelu

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)

Powyższy kod odpowiada za dostrajanie modelu.

Istnieją dwa sposoby szukania najlepszych hiperpartametrów do dostrojenia modelu:

szukanie przez siatkę (tzw. ‘Grid’)
szukanie losowe

GridSearchCV – nielosowa metoda dostrajania modelu siatki.

W Pipeline wskazaliśmy metody transformatora: StandardScaler() i estymator: Support Vector Machine.

Następnie w kodzie Hyperparameters – wskazaliśmy parametry: 'SVM__C'oraz 'SVM__gamma'.

Ponieważ używamy GridSearchCV, ta funkcja decyduje o najlepszej wartości w 'SVM__C' i innych zadanych hiperparametrach w zależności od tego, jak dobrze klasyfikator działa na zbiorze danych.

Zobaczmy jakie parametry są najlepsze według GridSearchCV:

pparam=pprint.PrettyPrinter(indent=2)
print(grid.best_params_)

Czas sprawdzić, na ile nasz model jest dobry, na ile trafnie opisuje rzeczywistość.

Przypomnijmy że celem modelu było wskazanie na podstawie wyników badań czy dany pacjent jest chory na cukrzycę czy nie. Czyli odpowiedź modelu w postaci zmiennej zależnej y powinien wynosić 0 lub 1.

y_pred = grid.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

Ocena modelu przez Confusion Matrix

Do oceny naszego modelu tworzącego odpowiedzi binarne, użyjemy Confusion Matrix.

Macierz wskazuje na ile model trafnie typuje odpowiedzi. Zestawia się tu odpowiedzi ze zbioru testowego z odpowiedziami uzyskanymi w drodze predykcji.

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Macierz ma dwa wymiary ponieważ odpowiedzi mają charakter binarny.

Jak interpretować Confusion Matrix?

Macierz wskazuje ile razy model trafnie wytypował odpowiedź a ile razy się pomylił.

Liczby w polach białych oznaczają ilość trafnych typowań, zaś liczby w czarnych polach oznaczają typowania błędne. Intuicyjnie będziemy lepiej oceniali modele, które mają istotną przewagę w polach białych nad polami czarnymi.

Celność modelu (accuracy (ACC)) jest interpretowana jako dokładność klasyfikacji. Jest liczona jako suma liczb z białych pól do sumy liczb ze wszystkich pól: ACC = A+D / A+B+C+D. Czym wyższa wartość procentowa tym lepiej.

Precyzja modelu (precision or positive predictive value (PPV)) czyli poziom sprawdzalności prognoz modelu. Jest liczona jako PPV = A / A+B.

Odwołanie modelu (sensitivity, recall, hit rate, or true positive rate (TPR))– w ilu przypadkach obecni pacjenci z cukrzycą są identyfikowani przez model jako chorzy. Jest liczona jako TPR = A / A+C.

F-Score

Trudno jest porównać model z niskim Precision i wysokim Recall i odwrotnie.

Aby porównać modele należy użyć F-Score, które mierzy jednocześnie Recall i Precision.

F-Score = (2* Precision* Recall)/( Precision+ Recall) = 2A/(2A+B+C)

Porównanie estymatorów

Teraz zestawimy model oparty o estymator Support Vector Machine z modelem z poprzedniego przykładu opartym o estymator Random Forest.

Widzimy, że wyniki estymacji Support Vector Machine versus Random Forest niewiele się różnią.

Praktyczne użycie modelu Machine learning

Model możemy zapisać używając kodu:

joblib.dump(grid, 'c:/1/rf_SVM.pkl')

Możemy odczytać zapisany wcześniej model:

clf2 = joblib.load('c:/1/rf_SVM.pkl')

Wyniki predykcji możemy podstawić do wynikowych danych empirycznych.

lf2.predict(X_test)
X = df.drop('Outcome', axis=1)
WYNIK = clf2.predict(X)
df['MODEL'] = pd.Series(WYNIK)

df['Result'] = df['MODEL'] - df['Outcome']
df[['Outcome','MODEL', 'Result']].sample(10)

cdf['Result'].value_counts()

df['Result'].value_counts(normalize=True)

Artykuł Zastosowanie estymatora liniowego Support Vector Machine (SVM) do tworzenia prognozy zapadalność na cukrzycę pochodzi z serwisu THE DATA SCIENCE LIBRARY.