Quality of Poliaxid
Estimation of the result of the empirical research with machine learning tools
Part one: preliminary graphical analysis to research of coefficients dependence
Thanks using predictive and classification models for the area of machine learning tools is possible significant decrease cost of the verification laboratory research.
Costs of empirical verification are counted to the Technical cost of production. In production of some chemical active substantiation is necessary to lead laboratory empirical classification to allocate product to separated class of quality.
This research can turn out very expensive. In the case of short runs of production, cost of this classification can make all production unprofitable.
With the help can come machine learning tools, who can replace expensive laboratory investigation by theoretical judgment.
Application of effective prediction model can decrease necessity of costly empirical research to the reasonable minimum.
Manual classification would be made in special situation where mode would be ineffective or in case of checking process by random testing.
Case study: laboratory classification of active chemical substance Poliaxid
We will now follow process of making model of machine learning based on the classification by the Random Forest method. Chemical plant produces small amounts expensive chemical substance named
Poliaxid.
This substance must meet very rigorous quality requirements. For each charge have to pass special laboratory verification. This empirical trials are expensive and long-lasting. Their cost significantly influence on the overall cost of production. Process of
Poliaxid
Production is monitored by many gauges. Computer save eleven variables such trace contents of some chemical substances, acidity and density of the substance. There are remarked the level of some of the collected coefficients have relationship with result of the end quality classification. Cause of effect relationship drive to the conclusion — it is possible to create classification model to explain overall process. In this case study we use base, able to download from this address: http://sigmaquality.pl/wp-content/uploads/2019/09/poliaxid.csv
This base contains results of 1593 trials and eleven coefficients saved during the process for each of the trial.
import pandas as pd
import numpy as np
df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
#del df['Unnamed: 0.1']
df.head(5)
In the last column named: “quality class” we can find results of the laboratory classification.
Classes 1 and 0 mean the best quality of the substance. Results 2, 3 and 4means the worst quality.
Before we start make machine learning model we ought to look at the data. We do it thanks matrix plots. These plots show us which coefficient is good predictor, display overall dependencies between exogenic and endogenic ratios.
Graphical analysis to research of coefficients dependence
The action that should precede the construction of the model should be graphical overview.
In this way we obtain information whether model is possible to do.
First we ought to divide results from result column: “quality class” in to two categories: 'First’ and 'Second’.
df['Qual_G'] = df['quality class'].apply(lambda x: 'First' if x < 2 else 'Second')
#del df['quality class']
#del df['nr.']
df.sample(3)
At the end of table appear new column: „Qual_G”.
df.columns
Now we create vector of correlation between independent coefficients and result factor in column: 'quality class’.
CORREL = df.corr().sort_values('quality class')
CORREL['quality class']
Correlation vector points significant influences exogenic factors on the results of empirical classification.
We chose most effective predictors among all eleven variables. We put this variables in to the matrix correlation plot.
This matrix plot contain two colors. Blue dots means first quality. Thanks to this all dependencies is clearly displayed.
Matrix display clearly patterns of dependencies between variables. Easily see part of coefficients have significant impact on the classification the first or second quality class.
import seaborn as sns
sns.pairplot(data=df[['factorB', 'citric catoda','sulfur in nodinol', 'noracid', 'lacapon','Qual_G']], hue='Qual_G', dropna=True)
import seaborn as sns
sns.pairplot(data=df[['factorB', 'lacapon','quality class']], hue='quality class', dropna=True)
Dichotomic division is good to display dependencies. Let’s see what happen when we use division for 5 class of quality. We use this classes that was made by laboratory.
Random Forest algorithm
We can use two approach.
First: we wont to know what is exactly class of quality has Poliaxid.
Second: we wont to know Poliaxid has first class of quality or not.
Now we realize first approach: multicategorical predictor.
Multicategorical prediction (Random Forest)
Existing correlation lead to the conclusion that it is possible effective model of artificial intelligence is applied.
It leads to the two conclusions:
• Laborious method of classification could be replaced by theoretical model.
• Persons who monitor production process could be informed by the model about probability of final quality of the substance.
Machine learning procedure allows us make try to build such model.
#X.dtypes
We divide set of data in to the independent variables X and dependent variable y, the result of the process.
df.head()
Before we start composite model we ought to check if all variables have numeric format. There are one variable has incorrect format. We will change it.
X = df.drop(['quality class','Qual_G'], axis=1)
y = df['quality class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
- Now we divide database into the training and test underset.
- Pipeline merge standardization and estimation. We took as the estimation method of Random Forest.
- Hyperparameters of the random forest regression are declared.
- Tune model using cross-validation pipeline.
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))
hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]}
# 7. Tune model using cross-validation pipeline
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)
# 8. Refit on the entire training set
# No additional code needed if clf.refit == True (default is True)
# 9. Sprawdzanie wyników klasyfikacji przy użyciu zestawu testowego
y_pred = clf.predict(X_test)
#y_pred = np.round(y_pred, decimals=0)
I, as an apprentice, will lead this model to the end. Without this rounding Confusion Matrix would be impossible to use because y from the test set has discrete form but predicted y would be in format continuous.
Now we check what was the super parameters used.
print("The Best parameter:",clf.best_params_)
print("The Best estimator:",clf.best_estimator_)
We check how balanced are the result variables.
y.value_counts()
Here we have array with the result of prediction our model. You can see continuous form of result.
y_pred
We make rounding continuous data to the discrete form.
y_pred = np.round(y_pred, decimals=0)
y_pred.astype(int)
y_pred
Now we make evaluation of the model. We use confusion matrix.
## confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix
# classification_report
print(classification_report(y_test, y_pred))
print("Accuracy: ",np.round(metrics.accuracy_score(y_test, y_pred), decimals=2))
### Target is multiclass but average='binary'
#print("Precision:",np.round(metrics.precision_score(y_test, y_pred), decimals=2))
#print("Recall: ",np.round(metrics.recall_score(y_test, y_pred), decimals=2))
Random Forest with a temporary adaptation to discrete results seems to be good!
According to the f1-score ratio, model of artificial intelligence can good classify for these classes which have many occurrences.
For example 0 class has 13 occurrence and model can’t judge this class. In opposite to the class 0 is class 1. There are 136 test values and model can properly judge classes in 82
In next part of this investigation we will test models of artificial intelligence intended to the make classification.
Categorical prediction (Random Forest)¶
For example 0 class has only 13 occurrence and model can’t forecast this class properly. Class 1 has more numerals. There are 136 test values and model can properly judge classes in 82
In next part of this investigation we will test models of random forest regression (produce results in continuous form) converted to the categorical classification (results are converted from continuous to discrete form). Such conversion this is not entirely correct.
Random forest is the popular method using regression engine to obtain discrete result.
Many scientists think that this is incorrect. Andrew Ng, in Machine Learning course at Coursers, explains why this is a bad idea – see his Lecture 6.1 – Logistic Regression | Classification at YouTubee. https://www.youtube.com/watch?v=-la3q9d7AKQ&t=0s&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=33
In next part we will use logistic regression that is entirely correct in such situation.
In place of 'quality class’ we use result variable: Qual_G. It has string format, with the two values: 'First’ and 'Second’. We should convert this values in to the discrete numeric form.
In logistic regression and other kind of models is assumed that 1 is the primary parameter. In our investigation such parameter is first class of Poliaxid. So we assume: „quality class” worse than 1 as the class 0 (that’s mean another class than 1) and „quality class” equal 1 as 1.
We change designation in column Qual_G.
df.dtypes
df.Qual_G = df.Qual_G.str.replace('First','1')
df.Qual_G = df.Qual_G.str.replace('Second','0')
df.Qual_G = df.Qual_G.astype(int)
df.Qual_G.dtypes
In discrete categorical regression important is to result sets was balanced. I mean number of state 1 and state 0 should be similar. When exist significant disproportion in result set, model can be predicting defectively.
df.Qual_G.value_counts()
In this cause dependent variable set seems to be balanced.
X = df.drop(['quality class','Qual_G'], axis=1)
y = df['Qual_G']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
- Now we divide database into the training and test underset.
- Pipeline merge standardization and estimation. We took as the estimation method of Random Forest.
- Hyperparameters of the random forest regression are declared.
- Tune model using cross-validation pipeline.
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))
hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]}
# 7. Tune model using cross-validation pipeline
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred
We make rounding continuous data to the discrete form.
y_pred = np.round(y_pred, decimals=0)
y_pred.astype(int)
y_pred
## confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix
print(classification_report(y_test, y_pred))
print("Accuracy: ",np.round(metrics.accuracy_score(y_test, y_pred), decimals=2))
print("Precision:",np.round(metrics.precision_score(y_test, y_pred), decimals=2))
print("Recall: ",np.round(metrics.recall_score(y_test, y_pred), decimals=2))
Logistic regression¶
from sklearn.linear_model import LogisticRegression
X = df.drop(['quality class','Qual_G'], axis=1)
y = df['Qual_G']
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = 148)
We are configure settings for grid.
Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)
Explanation for the code:
Parameteres
Parameteres = {’C’: np.power(10.0, np.arange(-3, 3))} array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]) We have tipical setting for the grid.
warm_start
Using 'warm_start=True’ causes utilization of last model setting to next run of the model. Thank to this model speeds up time of finding convergence. Parameter warm_start is useful in making multiple convergences with the same model using various settings.
scoring = 'roc_auc’
The ROC plot estimates the best of setting of classification. Finding the area under the ROC curve is the most popular method of evaluation of classification efficiency by the grid.
jobs = 5¶
Number of tasks running simultaneously
cv = 2
Number of cross verifications. Model takes the form of the equation:
LR_Grid.fit(Xtrain, ytrain)
ypred = LR_Grid.predict(Xtest)
## confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix
print(classification_report(ytest, ypred))
print("Accuracy: ",np.round(metrics.accuracy_score(y_test, y_pred), decimals=2))
print("Precision:",np.round(metrics.precision_score(y_test, y_pred), decimals=2))
print("Recall: ",np.round(metrics.recall_score(y_test, y_pred), decimals=2))
