Estimation of the result of the empirical research with machine learning tools (part 2)

 Artificial intelligence in process of classification 

In first part of this publication described the problem of additional classification of the quality classes. Every charge of Poliaxid have to go through rigoristic, expensive test to classification to the proper class of quality. [Source of data]

In last study we showed that some quantity factors associated with production have significant impact on the final quality of the substance.

 Existing correlation lead to the conclusion that it is possible effective model of artificial intelligence is applied 

It leads to the two conclusions:

  • Laborious method of classification could be replaced by theoretical model.
  • Persons who monitor production process could be informed by the model about probability of final quality of the substance.

Machine learning procedure allows us make try to build such model.

We open the base in Python.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
df.head(5)

We divide set of data in to the independent variables X and dependent variable y, the result of the process.

X = df.drop('quality class', axis=1) 
y = df['quality class']    

Now we divide database into the training and test underset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

Pipeline merge standardization and estimation. We took as the estimation method of Random Forest.

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

Hyperparameters of the random forest regression are declared.

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

Tune model using cross-validation pipeline.

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

Random forest is the popular method using regression engine to obtain result.

Many scientists think that this is incorrect. Andrew Ng, in Machine Learning course at Coursers, explains why this is a bad idea - see his Lecture 6.1 - Logistic Regression | Classification at YouTubee. https://www.youtube.com/watch?v=-la3q9d7AKQ&t=0s&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=33

I, as an apprentice, will lead this model to the end. Without this rounding Confusion Matrix would be impossible to use because y from the test set has discrete form but predicted y would be in format continuous.

Here we have array with the result of prediction our model. You can see continuous form of result.

y_pred

Empirical result has discrete form.

y.value_counts()

We make rounding continuous data to the discrete form.

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

Typical regression equation should not be used to the classification, but logistic regression seems to can make classification.

This is occasion to compare linear regression, logistic regression and typical tool used to classification Support Vector Machine.

Now we make evaluation of the model. We use confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

print(classification_report(y_test, y_pred))

Regression Random Forest with a temporary adaptation to discrete results seems to be good!

According to the f1-score ratio, model of artificial intelligence can good classify for these classes which have many occurrences.

For example 0 class has 13 occurrence and model can't judge this class. In opposite to the class 0 is class 1. There are 136 test values and model can properly judge classes in  78% of cases.

In next part of this investigation we will test models of artificial intelligence intended to the make classification.

Next part:

Estimation of the result of the empirical research with machine learning tools. Classification with SVM Support Vector Machine (part 3)