Tensorflow linear classifier – example 1

EN201220191421

Source of data: poliaxid

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('c:/2/poliaxid.csv', index_col=0)
del df['nr.']
df.head(5)
Out[1]:
factorA factorB citric catoda residual butanol caroton stable nodinol sulfur in nodinol density pH noracid lacapon quality class
0 4.933333 0.466667 0.000000 1.266667 0.050667 7.333333 22.666667 0.665200 2.340000 0.373333 6.266667 1
1 5.200000 0.586667 0.000000 1.733333 0.065333 16.666667 44.666667 0.664533 2.133333 0.453333 6.533333 1
2 5.200000 0.506667 0.026667 1.533333 0.061333 10.000000 36.000000 0.664667 2.173333 0.433333 6.533333 1
3 7.466667 0.186667 0.373333 1.266667 0.050000 11.333333 40.000000 0.665333 2.106667 0.386667 6.533333 2
4 4.933333 0.466667 0.000000 1.266667 0.050667 7.333333 22.666667 0.665200 2.340000 0.373333 6.266667 1

1. We analyze the completeness of the data

In [2]:
df.isnull().sum()
Out[2]:
factorA              0
factorB              0
citric catoda        0
residual butanol     0
caroton              0
stable nodinol       0
sulfur in nodinol    0
density              0
pH                   0
noracid              0
lacapon              0
quality class        0
dtype: int64

The data is complete

2. Result variable analysis – Creating result classes

In [3]:
df['quality class'].value_counts().plot(kind='bar')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x219bca29cf8>

The owner of the process was interviewed – the owner is interested in separating substances in classes 0 and 1 from the other quality classes. So we create two quality classes.

In [4]:
df['quality class'].dtypes
Out[4]:
dtype('int64')

We map 5 classes into two classes

In [5]:
df['quality_class2'] = df['quality class'].apply(lambda x: 1 if x <= 1 else 0)
In [6]:
df['quality_class2'].value_counts()
Out[6]:
0    852
1    742
Name: quality_class2, dtype: int64

The sets are balanced, so there is no need to perform class equalization (e.g. oversampling).

Correction of variable names.

In [7]:
df.columns = ['factorA', 'factorB', 'citric_catoda', 'residual_butanol', 'caroton',
       'stable_nodinol', 'sulfur_in_nodinol', 'density', 'pH', 'noracid',
       'lacapon', 'quality_class','quality_class2']

3. We divide variables into independent variables and a dependent variable

In [8]:
X = df.drop('quality_class2', axis=1) 
y = df['quality_class2']  

4. We divide the data into a test and training set

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
In [10]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(1275, 12) (1275,) (319, 12) (319,)

5. Initial data analysis

In [11]:
df.dtypes
Out[11]:
factorA              float64
factorB              float64
citric_catoda        float64
residual_butanol     float64
caroton              float64
stable_nodinol       float64
sulfur_in_nodinol    float64
density              float64
pH                   float64
noracid              float64
lacapon              float64
quality_class          int64
quality_class2         int64
dtype: object
In [12]:
df.head()
Out[12]:
factorA factorB citric_catoda residual_butanol caroton stable_nodinol sulfur_in_nodinol density pH noracid lacapon quality_class quality_class2
0 4.933333 0.466667 0.000000 1.266667 0.050667 7.333333 22.666667 0.665200 2.340000 0.373333 6.266667 1 1
1 5.200000 0.586667 0.000000 1.733333 0.065333 16.666667 44.666667 0.664533 2.133333 0.453333 6.533333 1 1
2 5.200000 0.506667 0.026667 1.533333 0.061333 10.000000 36.000000 0.664667 2.173333 0.433333 6.533333 1 1
3 7.466667 0.186667 0.373333 1.266667 0.050000 11.333333 40.000000 0.665333 2.106667 0.386667 6.533333 2 0
4 4.933333 0.466667 0.000000 1.266667 0.050667 7.333333 22.666667 0.665200 2.340000 0.373333 6.266667 1 1
In [13]:
df.columns
Out[13]:
Index(['factorA', 'factorB', 'citric_catoda', 'residual_butanol', 'caroton',
       'stable_nodinol', 'sulfur_in_nodinol', 'density', 'pH', 'noracid',
       'lacapon', 'quality_class', 'quality_class2'],
      dtype='object')
In [14]:
CORREL = df.corr().sort_values('quality_class2')
CORREL['quality_class2'].to_frame().sort_values('quality_class2')
Out[14]:
quality_class2
quality_class -0.857720
lacapon -0.434428
noracid -0.217694
citric_catoda -0.158729
factorA -0.096015
residual_butanol 0.002878
pH 0.003047
stable_nodinol 0.062147
caroton 0.109112
density 0.159405
sulfur_in_nodinol 0.232313
factorB 0.320739
quality_class2 1.000000
In [15]:
plt.figure(figsize=(10,3))
CORREL['quality_class2'].to_frame().sort_values('quality_class2')
CORREL['quality_class2'].plot(kind='barh', color='red')
plt.title('Korelacja ze zmienną wynikową', fontsize=20)
plt.xlabel('Poziom korelacji')
plt.ylabel('Zmienne nezależne ciągłe')
Out[15]:
Text(0, 0.5, 'Zmienne nezależne ciągłe')

Correlation of variables by output variable

In [16]:
import seaborn as sns

plt.figure(figsize=(8,8))
sns.heatmap(CORREL, cmap="YlGnBu", annot=True, cbar=True)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x219c4fedac8>

Correlation of variables

In [17]:
CORREL = df.corr()
plt.figure(figsize=(8,8))
sns.heatmap(CORREL, cmap="YlGnBu", annot=True, cbar=True)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x219c54034a8>
In [18]:
import tensorflow as tf
feat_column = tf.contrib.layers.real_valued_column('features', dimension=13)
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
In [19]:
estimator = tf.estimator.LinearClassifier(feature_columns=[feat_column],
                                          n_classes=2,
                                          model_dir = "kernel_e"
                                         )
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'kernel_e', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000219C989F470>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Step 3: Change from continuous variables to Tensorflow variables, function:

tf.feature_column.numeric_column

Single variable processed to form a TF tensor

In [20]:
age = tf.feature_column.numeric_column('caroton')
age
Out[20]:
_NumericColumn(key='caroton', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

Zmiana wszystkich zmiennych numerycznych na zmienne TensorFlow

In [21]:
COLUMNS  = ['factorA', 'factorB', 'citric_catoda', 'residual_butanol', 'caroton',
       'stable_nodinol', 'sulfur_in_nodinol', 'density', 'pH', 'noracid',
       'lacapon', 'quality_class','quality_class2']

features = [tf.feature_column.numeric_column(k) for k in COLUMNS]
features
Out[21]:
[_NumericColumn(key='factorA', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='factorB', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='citric_catoda', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='residual_butanol', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='caroton', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='stable_nodinol', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='sulfur_in_nodinol', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='density', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='pH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='noracid', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='lacapon', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='quality_class', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='quality_class2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 5. Creating a linear classification model

5.1 Define of classifier

In [22]:
model = tf.estimator.LinearClassifier(
    n_classes = 2,
    model_dir="ongoing/train5", 
    feature_columns=features)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'ongoing/train5', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000219C989FA58>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

5.2 Create the input function

In [23]:
COLUMNS  = ['factorA', 'factorB', 'citric_catoda', 'residual_butanol', 'caroton',
       'stable_nodinol', 'sulfur_in_nodinol', 'density', 'pH', 'noracid',
       'lacapon', 'quality_class','quality_class2']
LABEL= 'quality_class2'
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):
    return tf.estimator.inputs.pandas_input_fn(
       x=pd.DataFrame({k: data_set[k].values for k in COLUMNS}),
       y = pd.Series(data_set[LABEL].values),
       batch_size=n_batch,   
       num_epochs=num_epochs,
       shuffle=shuffle)

5.3 Train the model

Preparation of a set of training variables

In [30]:
df_train = pd.concat([X_train, y_train], axis=1, sort=False) 
df_test = pd.concat([X_test, y_test], axis=1, sort=False) 

Correction of variable names

In [25]:
model.train(input_fn=get_input_fn(df_train, 
                                      num_epochs=None,
                                      n_batch = 128,
                                      shuffle=False),
                                      steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into ongoing/train5model.ckpt.
INFO:tensorflow:loss = 88.72288, step = 1
INFO:tensorflow:global_step/sec: 287.186
INFO:tensorflow:loss = 15.2773905, step = 101 (0.348 sec)
INFO:tensorflow:global_step/sec: 330.966
INFO:tensorflow:loss = 8.641602, step = 201 (0.302 sec)
INFO:tensorflow:global_step/sec: 298.718
INFO:tensorflow:loss = 5.899028, step = 301 (0.335 sec)
INFO:tensorflow:global_step/sec: 289.405
INFO:tensorflow:loss = 4.651561, step = 401 (0.346 sec)
INFO:tensorflow:global_step/sec: 307.797
INFO:tensorflow:loss = 4.288885, step = 501 (0.325 sec)
INFO:tensorflow:global_step/sec: 270.935
INFO:tensorflow:loss = 3.759474, step = 601 (0.369 sec)
INFO:tensorflow:global_step/sec: 309.218
INFO:tensorflow:loss = 3.1157045, step = 701 (0.323 sec)
INFO:tensorflow:global_step/sec: 331.573
INFO:tensorflow:loss = 2.6179585, step = 801 (0.302 sec)
INFO:tensorflow:global_step/sec: 318.118
INFO:tensorflow:loss = 2.471261, step = 901 (0.314 sec)
INFO:tensorflow:Saving checkpoints for 1000 into ongoing/train5model.ckpt.
INFO:tensorflow:Loss for final step: 2.152443.
Out[25]:
<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x219c989f550>

5.4 To evaluate the performance of model

In [31]:
model.evaluate(input_fn=get_input_fn(df_test, 
                                      num_epochs=1,
                                      n_batch = 128,
                                      shuffle=False),
                                      steps=1000)
INFO:tensorflow:Starting evaluation at 2019-12-20-12:57:55
INFO:tensorflow:Restoring parameters from ongoing/train5model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2019-12-20-12:57:56
INFO:tensorflow:Saving dict for global step 1000: accuracy = 1.0, accuracy_baseline = 0.5360502, auc = 1.0, auc_precision_recall = 1.0, average_loss = 0.01771097, global_step = 1000, label/mean = 0.46394983, loss = 1.8832666, prediction/mean = 0.46446657
Out[31]:
{'accuracy': 1.0,
 'accuracy_baseline': 0.5360502,
 'auc': 1.0,
 'auc_precision_recall': 1.0,
 'average_loss': 0.01771097,
 'label/mean': 0.46394983,
 'loss': 1.8832666,
 'prediction/mean': 0.46446657,
 'global_step': 1000}

he model is perfect and that worries me…