TensorFlow - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/tensorflow/ Wojciech Moszczyński Thu, 12 Dec 2019 19:09:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png TensorFlow - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/tensorflow/ 32 32 Exercises on mathematical operations in TensorFlow 1.4 https://sigmaquality.pl/tensorflow-3/exercises-on-mathematical-operations-in-tensorflow-1-4/ Thu, 12 Dec 2019 19:09:00 +0000 http://sigmaquality.pl/exercises-on-mathematical-operations-in-tensorflow-1-4/ EN121220190807 Practice makes perfect In [1]: import pandas as pd import tensorflow as tf import matplotlib.pyplot as plt Exercise 1: Perform the following equation on the [...]

Artykuł Exercises on mathematical operations in TensorFlow 1.4 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

EN121220190807

Practice makes perfect

In [1]:
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

Exercise 1: Perform the following equation on the tensor.

image.png

In [2]:
R = tf.constant(12, tf.int16, name="twelve") 
print(R)
Tensor("twelve:0", shape=(), dtype=int16)
In [3]:
F = tf.constant(4, tf.int16, name="twelve") 
print(F)
Tensor("twelve_1:0", shape=(), dtype=int16)
In [4]:
KOT = tf.div (R, F)
print(KOT)
Tensor("div:0", shape=(), dtype=int16)
In [5]:
with tf.Session() as sess:    
    result_2 = KOT.eval()
print(result_2) 
3
In [6]:
sess = tf.Session()
print(sess.run(KOT))
3

Exercise 2: Perform the following equation on the tensor

image1.png

In [7]:
F = tf.constant(8) 
R = tf.constant(2) 
FFT = tf.multiply (F, R)
In [8]:
sess = tf.Session()
print(sess.run(FFT))
16
In [9]:
with tf.Session() as sess:    
    result_2 = FFT.eval()
print(result_2) 
16

Exercise 3: Mathematical operations on tensors

image2.png

In [10]:
A = tf.constant(5) 
B = tf.constant(2) 
PKO = tf.pow(A, B)
In [11]:
sess = tf.Session()
print(sess.run(PKO))
25

Exercise 4: Mathematical operations on tensors

You do not need to specify the data type because tensorflow alone guesses what type of data was used in the constant.

image3.png

In [12]:
A = tf.constant(15.9) 
B = tf.constant(2.1) 
SKO = tf.subtract(A, B)
In [13]:
sess = tf.Session()
print(sess.run(SKO))
13.799999

Exercise 5: Perform the following equation on the tensor

Tensorflow does not like to have different data formats.
We create a mathematical formula in Python:

image4.png

In [14]:
A = tf.constant(1.7, tf.float32) 
B = tf.constant(2.4, tf.float32)
C = tf.constant(15, tf.float32)

SZK = tf.add(tf.subtract(A, B), C)
In [15]:
sess = tf.Session()
print(sess.run(SZK))
14.3

Exercise 5: Perform the following equation on the tensor

Tensorflow does not like to have different data formats.
We create a mathematical formula in Python:
image5.png

In [16]:
A = tf.constant(2.07) 
B = tf.constant(1.3)
C = tf.constant(2.7)

SSF = tf.subtract(A, B)
PKO = tf.pow(SSF,C)
In [17]:
with tf.Session() as sess:    
    result_2 = PKO.eval()
print(result_2) 
0.49377024

Exercise 6: Mathematical operations on tensors

We create a mathematical formula in Python:
image6.png

We can use the math library resources:
https://docs.python.org/3/library/math.html

In [18]:
import math
a = math.pi
a
Out[18]:
3.141592653589793
In [19]:
C = tf.constant(2.1) 
D = tf.constant(math.pi)

GG = tf.multiply(C, D)
PKK = tf.sqrt(GG)
In [20]:
with tf.Session() as sess:    
    result_2 = PKK.eval()
print("The x value sought is: ",result_2) 
The x value sought is:  2.5685296

Exercise 7: actions on Tensorflow tensors

Please calculate in Tensorflow the value of n:
image7.png

Let’s check how much it will be:

In [21]:
math.pow(15,(1/3))
Out[21]:
2.46621207433047

In Tensorflow, be careful about the fixed format.

In [22]:
C = tf.constant(15, tf.float32) 
A = tf.constant(3, tf.float32)
H = tf.constant(1, tf.float32)
G = tf.div(H,A)

PK = tf.pow(C,G)
In [23]:
with tf.Session() as sess:    
    result_2 = G.eval()
print(result_2) 
0.33333334
In [24]:
with tf.Session() as sess:    
    result_2 = PK.eval()
print(result_2) 
2.466212

Exercise 8: actions on Tensorflow tensors

Please calculate in Tensorflow the value of n:
image8.png

We can use the math library resources: https://docs.python.org/3/library/math.html

In [25]:
math.e
Out[25]:
2.718281828459045
In [26]:
math.pow(math.e,(1/4))
Out[26]:
1.2840254166877414
In [27]:
C = tf.constant(math.e, tf.float32) 
A = tf.constant(4, tf.float32)
H = tf.constant(1, tf.float32)
G = tf.div(H,A)

ZHP = tf.pow(C,G)
In [28]:
with tf.Session() as sess:    
    result_2 = ZHP.eval()
print(result_2) 
1.2840254

Exercise 9. Change of the tensor type from float to int

TensorFlow automatically selects the data type when the argument is not specified when creating the tensor.

In [29]:
PKP = tf.constant(3.123456789, tf.float32)
ZNP = tf.cast(PKP, dtype=tf.int32)

print(PKP.dtype)
print(ZNP.dtype)
<dtype: 'float32'>
<dtype: 'int32'>

Artykuł Exercises on mathematical operations in TensorFlow 1.4 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Exercises on matrix operations in TensorFlow 1.4 https://sigmaquality.pl/tensorflow-3/exercises-on-matrix-operations-in-tensorflow-1-4/ Wed, 11 Dec 2019 20:28:00 +0000 http://sigmaquality.pl/exercises-on-matrix-operations-in-tensorflow-1-4/ EN111220192126 Exercises on matrix operations in TensorFlow 1.4 Practice makes perfect In [1]: import pandas as pd import tensorflow as tf import matplotlib.pyplot as plt Exercise [...]

Artykuł Exercises on matrix operations in TensorFlow 1.4 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
EN111220192126

Exercises on matrix operations in TensorFlow 1.4

Practice makes perfect

In [1]:
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

Exercise 1. Please define a tensor: constant = vector [1, 2, 3] in format int16

image.png

$ begin{bmatrix}
1 \
2 \
3
end{bmatrix}$

In [2]:
vector1 = tf.constant ([1,3,5], tf.int16)
print(vector1)
Tensor("Const:0", shape=(3,), dtype=int16)

Exercise 2. Please define a tensor: constant = matrix [1, 2, 3, 4] in format int16

image2.png

$begin{bmatrix}
1 & 2 \
3 & 4
end{bmatrix}$

In [3]:
matrix1 = tf.constant ([[1, 2],[3, 4]], tf.int16)
print (matrix1)
Tensor("Const_1:0", shape=(2, 2), dtype=int16)

Exercise 3. Please define a tensor: constant = matrix [1, 2, 3, 4, 5, 6] in format int16

image3.png

$begin{bmatrix}
1 & 2 \
3 & 4 \
5 & 6
end{bmatrix}$

In [4]:
matrix2 = tf.constant ([[[1, 2],  [3, 4], [5, 6]]], tf.int16)
print (matrix2)
Tensor("Const_2:0", shape=(1, 3, 2), dtype=int16)
In [5]:
matrix2.shape
Out[5]:
TensorShape([Dimension(1), Dimension(3), Dimension(2)])

Exercise 4. Create a tensor in the form of a vector of a specific shape 5, filled with zeros

image4.png

$begin{bmatrix}
0 \
0 \
0 \
0 \
0
end{bmatrix}$

In [6]:
kot = tf.zeros(5)
print(kot)
kot.shape
Tensor("zeros:0", shape=(5,), dtype=float32)
Out[6]:
TensorShape([Dimension(5)])

Exercise 5. Create a tensor in the form of a matrix with a specific 4×4 shape, filled with only ones

image5.png

$begin{bmatrix}
1 & 1 & 1 & 1\
1 & 1 & 1 & 1\
1 & 1 & 1 & 1\
1 & 1 & 1 & 1
end{bmatrix}$

In [7]:
fok = tf.ones ([4, 4])
print(fok)
fok.shape
Tensor("ones:0", shape=(4, 4), dtype=float32)
Out[7]:
TensorShape([Dimension(4), Dimension(4)])

Exercise 6. Matrix adding

image6.png

$ begin{eqnarray}
nonumber
begin{bmatrix}
2 & 5\
4 & 5\
end{bmatrix}
&=&
begin{bmatrix}
1 & 3\
2 & 4\
end{bmatrix}+
begin{bmatrix}
1 & 2\
2 & 1\
end{bmatrix}
end{eqnarray} $

To add two matrices, their dimensions must be equal. That is, the number of rows and the number of columns of the first and second matrices must be equal.

In [8]:
matrix3 = tf.constant ([[[1, 3], [2, 4]]], tf.int16)
matrix4 = tf.constant ([[[1, 2], [2, 1]]], tf.int16)
In [9]:
PZU = tf.add(matrix3, matrix4)
In [10]:
sess = tf.Session()
print(sess.run(PZU))
[[[2 5]
  [4 5]]]

Exercise 7. Matrix subtraction

image7.png

$ begin{eqnarray}

nonumber
begin{bmatrix}
0 & 1\
0 & 3\
end{bmatrix}
&=&
begin{bmatrix}
1 & 3\
2 & 4\
end{bmatrix}-
begin{bmatrix}
1 & 2\
2 & 1\
end{bmatrix}
end{eqnarray} $

To subtract two matrices their dimensions must be equal. That is, the number of rows and the number of columns of the first and second matrices must be equal.

In [11]:
PZU = tf.subtract(matrix3, matrix4)
In [12]:
sess = tf.Session()
print(sess.run(PZU))
[[[0 1]
  [0 3]]]

Exercise 8. Multiply the matrix by a number

image8.png

$ begin{eqnarray}

nonumber
begin{bmatrix}
5 & 10 \
15 & 20 \
end{bmatrix}
&=&
begin{bmatrix}
1 & 2 \
3 & 4 \
end{bmatrix}*
end{eqnarray}$ 5

In [13]:
matrix1 = tf.constant ([[1, 2],[3, 4]], tf.int16)
In [14]:
FOK = tf.multiply (matrix1, 5)
In [15]:
sess = tf.Session()
print(sess.run(FOK))
[[ 5 10]
 [15 20]]

Exercise 9. Multiply the matrix by a number TensorFlow

image9.png

$ begin{eqnarray}

nonumber
begin{bmatrix}
10 & 35 & 20\
25 & 10 & 5\
30 & 15 & 20\
end{bmatrix}
&=&
begin{bmatrix}
2 & 7 & 4\
5 & 2 & 1\
6 & 3 & 4\
end{bmatrix}*
end{eqnarray}$ 5

In [16]:
matrix5 = tf.constant ([[2,7,4],[5,2,1],[6,3,4]], tf.int16)
In [17]:
ZHP = tf.multiply (matrix1, 5)
In [18]:
sess = tf.Session()
print(sess.run(ZHP))
[[ 5 10]
 [15 20]]

Exercise 10. Multiply the matrix by a matrix TensorFlow

Matrix multiplication is not alternating.

image11.png

$ begin{eqnarray}

nonumber
begin{bmatrix}
33 & 40\
86 & 66\
end{bmatrix}
&=&
begin{bmatrix}
1 & 4 & 6\
8 & 2 & 4\
end{bmatrix}*
begin{bmatrix}
9 & 6\
3 & 7\
2 & 1\
end{bmatrix}
end{eqnarray} $

image.png

In [19]:
matrix6 = tf.constant([[1,4,6],[8,2,4]],tf.int32)
matrix7 = tf.constant([[9,6],[3,7],[2,1]],tf.int32)

KPU = tf.matmul(matrix6,matrix7)
In [20]:
sess = tf.Session()
print(sess.run(KPU))
[[33 40]
 [86 66]]

Exercise 11. Multiply the matrix by a matrix TensorFlow

Matrix multiplication is not alternating.

image.png

In [21]:
KPU = tf.matmul(matrix7,matrix6)
In [22]:
sess = tf.Session()
print(sess.run(KPU))
[[57 48 78]
 [59 26 46]
 [10 10 16]]

Exercise 11. Multiplication of C⋅D square matrices

image.png

In [23]:
matrix8 = tf.constant([[5,6,1],[8,7,9],[1,5,2]],tf.int32)
matrix9 = tf.constant([[4,6,7],[2,5,1],[0,3,9]],tf.int32)
In [24]:
PKS = tf.matmul(matrix8,matrix9)
In [25]:
sess = tf.Session()
print(sess.run(PKS))
[[ 32  63  50]
 [ 46 110 144]
 [ 14  37  30]]

Exercise 11.Transpose the matrix

image12.png

$A=begin{bmatrix}
1 & 3 & 9 \
7 & 2 & 5 \
end{bmatrix}$

$A^T=begin{bmatrix}
1 & 7 \
3 & 2 \
9 & 5 \
end{bmatrix}$

In [26]:
x = tf.constant([[1, 3, 9], [7, 2, 5]])
GAP = tf.transpose(x)
In [27]:
sess = tf.Session()
print(sess.run(GAP))
[[1 7]
 [3 2]
 [9 5]]

Exercise 12. Determinant of the matrix

image.png

$A = begin{bmatrix}
1 & 2 \
3 & 4
end{bmatrix}$

$|A| = (1*4)-(3*2)= -2$

In [28]:
matrix1 = tf.constant ([[1, 2],[3, 4]], tf.float32)
In [29]:
PKO = tf.matrix_determinant(matrix1)
In [30]:
sess = tf.Session()
print(sess.run(PKO))
-2.0

Exercise 13. Diagonal matrix

Diagonal matrix – a matrix, usually square [a], whose all coefficients lying outside the main diagonal (main diagonal) are zero. In other words, it is an upper- and lower-triangular matrix at the same time.

In [31]:
PPS = tf.diag([1.2, 1.5, 1.0, 7.1, 2, 8.3])
In [32]:
sess = tf.Session()
print(sess.run(PPS))
[[1.2 0.  0.  0.  0.  0. ]
 [0.  1.5 0.  0.  0.  0. ]
 [0.  0.  1.  0.  0.  0. ]
 [0.  0.  0.  7.1 0.  0. ]
 [0.  0.  0.  0.  2.  0. ]
 [0.  0.  0.  0.  0.  8.3]]

Exercise 14. Outputs random values from a truncated normal distribution

https://docs.w3cub.com/tensorflow~python/tf/truncated_normal/

The generated values follow a normal distribution with specified mean and standard deviation, except that values whose magnitude is more than 2 standard deviations from the mean are dropped and re-picked.

In [33]:
ABC = tf.truncated_normal([2, 3])
In [34]:
sess = tf.Session()
print(sess.run(ABC))
[[ 0.05371307  1.4564506  -1.7267214 ]
 [-1.7192566   1.5986782   0.91717476]]
In [35]:
print(ABC.shape)
(2, 3)

Random values for the mean: 7 and standard deviation: 2. Generate matrix: 3 by 5, the values should be in the format: float32.

In [39]:
KSU = tf.truncated_normal([3,5],mean=7,stddev=2.0,dtype=tf.float32)
sess = tf.Session()

Exercise 15. Creates a tensor filled with a scalar value.

This operation creates a tensor of shape dims and fills it with value.
https://docs.w3cub.com/tensorflow~python/tf/fill/

In [40]:
print(sess.run(tf.fill([4,2],4)))
[[4 4]
 [4 4]
 [4 4]
 [4 4]]

Macierz 4×2 wypełniona wartościami 4

Exercise 16. Outputs random values from a uniform distribution.

The generated values follow a uniform distribution in the range [minval, maxval). The lower bound minval is included in the range, while the upper bound maxval is excluded.

https://docs.w3cub.com/tensorflow~python/tf/random_uniform/

In [41]:
GAD = tf.random_uniform([3,7], minval=4, maxval=8, dtype=tf.float32)
In [42]:
sess = tf.Session()
print(sess.run(GAD))
[[5.8656416 4.652042  4.3458786 5.7581544 6.0479655 6.6700726 6.035503 ]
 [5.729369  7.1762366 6.052859  6.8669724 6.278867  5.959975  7.3483486]
 [7.441143  7.3443627 6.4590683 5.6526484 5.656465  4.5448027 5.9519987]]

Exercise 17. Transforming the matrix into a Tensorflow matrix

We have some kind of matrix and we want to include it in the tensorflow model.

In [43]:
import numpy as np
KAT = np.array([[1., 2., 3.],[-3., -7., -1.],[0., 5., -2.]])
KAT
Out[43]:
array([[ 1.,  2.,  3.],
       [-3., -7., -1.],
       [ 0.,  5., -2.]])
In [44]:
DOK = tf.convert_to_tensor(KAT)
In [45]:
sess = tf.Session()
print(sess.run(DOK))
[[ 1.  2.  3.]
 [-3. -7. -1.]
 [ 0.  5. -2.]]

Artykuł Exercises on matrix operations in TensorFlow 1.4 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Zabezpieczone: TENSORFLOW LINE CLASSIFICATOR (base example) https://sigmaquality.pl/tensorflow-3/praca-bazowa_-klasyfikator-liniowy-tensor-flow/ Wed, 11 Dec 2019 17:23:00 +0000 http://sigmaquality.pl/praca-bazowa_-klasyfikator-liniowy-tensor-flow/ Brak zajawki, ponieważ wpis jest zabezpieczony hasłem.

Artykuł Zabezpieczone: TENSORFLOW LINE CLASSIFICATOR (base example) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Ta treść jest chroniona hasłem. Aby ją wyświetlić, wpisz poniżej swoje hasło:

Artykuł Zabezpieczone: TENSORFLOW LINE CLASSIFICATOR (base example) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tensorflow – Calculation of R square for linear regression https://sigmaquality.pl/tensorflow-3/tensorflow-calculation-of-r-square-for-linear-regression/ Tue, 03 Dec 2019 19:49:00 +0000 http://sigmaquality.pl/tensorflow-calculation-of-r-square-for-linear-regression/ Parking Birmingham occupancy Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham In [35]: import pandas as pd df = pd.read_csv('c:/TF/ParkingBirmingham.csv') df.head(3) Out[35]: SystemCodeNumber Capacity Occupancy LastUpdated 0 BHMBCCMKT01 577 61 [...]

Artykuł Tensorflow – Calculation of R square for linear regression pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Parking Birmingham occupancy

In [35]:
import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)
Out[35]:
SystemCodeNumber Capacity Occupancy LastUpdated
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42
In [2]:
df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes
Out[2]:
SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object
In [3]:
df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday
In [4]:
df.head(4)
Out[4]:
SystemCodeNumber Capacity Occupancy LastUpdated month hour weekday_name weekday
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42 10 7 Tuesday 1
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42 10 8 Tuesday 1
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42 10 8 Tuesday 1
3 BHMBCCMKT01 577 107 2016-10-04 09:32:46 10 9 Tuesday 1
In [5]:
df = df.loc[df['SystemCodeNumber']=='BHMMBMMBX01'] 
df.shape
Out[5]:
(1312, 8)
In [6]:
import tensorflow as tf
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

In [7]:
FEATURES = ['month', 'hour', 'weekday'] 
LABEL = 'Occupancy'
In [8]:
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES] 
PKS
Out[8]:
[_NumericColumn(key='month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='hour', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called „ABC” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [9]:
estimator = tf.estimator.LinearRegressor( feature_columns=PKS, model_dir="ABC")
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'ABC', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000147BB11B940>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [10]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True): 
    return tf.estimator.inputs.pandas_input_fn( x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
                                               y = pd.Series(data_set[LABEL].values), batch_size=n_batch, num_epochs=num_epochs, shuffle=shuffle)

Step 3: Model training

  • To feed the model you can use the function created above: get_input_fn.
  • Then you instruct the model to iterate 1000 times.
  • Remember that you do not specify the number of epochs (num_epochs).
  • It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [11]:
df_train=df.sample(frac=0.8,random_state=200) 
df_test=df.drop(df_train.index) 
print(df_train.shape, df_test.shape)
(1050, 8) (262, 8)
In [12]:
estimator.train(input_fn=get_input_fn(df_train, num_epochs=None, n_batch = 128, shuffle=False), steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-20000
INFO:tensorflow:Saving checkpoints for 20001 into ABCmodel.ckpt.
INFO:tensorflow:loss = 1604473.0, step = 20001
INFO:tensorflow:global_step/sec: 524.813
INFO:tensorflow:loss = 1890832.8, step = 20101 (0.191 sec)
INFO:tensorflow:global_step/sec: 595.828
INFO:tensorflow:loss = 1691072.0, step = 20201 (0.183 sec)
INFO:tensorflow:global_step/sec: 581.214
INFO:tensorflow:loss = 1660972.2, step = 20301 (0.172 sec)
INFO:tensorflow:global_step/sec: 577.628
INFO:tensorflow:loss = 1830299.8, step = 20401 (0.158 sec)
INFO:tensorflow:global_step/sec: 591.553
INFO:tensorflow:loss = 1564311.5, step = 20501 (0.169 sec)
INFO:tensorflow:global_step/sec: 659.048
INFO:tensorflow:loss = 1851407.0, step = 20601 (0.167 sec)
INFO:tensorflow:global_step/sec: 565.153
INFO:tensorflow:loss = 1717692.1, step = 20701 (0.161 sec)
INFO:tensorflow:global_step/sec: 597.055
INFO:tensorflow:loss = 1668234.1, step = 20801 (0.167 sec)
INFO:tensorflow:global_step/sec: 597.223
INFO:tensorflow:loss = 1785292.5, step = 20901 (0.167 sec)
INFO:tensorflow:Saving checkpoints for 21000 into ABCmodel.ckpt.
INFO:tensorflow:Loss for final step: 1761262.6.
Out[12]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x147bb11bd30>

Step 4. Model evaluation

To enter a test set, use the following code:

In [13]:
ev = estimator.evaluate( input_fn=get_input_fn(df_test, num_epochs=1, n_batch = 128, shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-12-03-10:35:11
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-21000
INFO:tensorflow:Finished evaluation at 2019-12-03-10:35:11
INFO:tensorflow:Saving dict for global step 21000: average_loss = 12334.496, global_step = 21000, loss = 1077212.6

Step 5. Calculation of R Square

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

In [14]:
y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [15]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-21000
In [16]:
predictions
Out[16]:
[array([319.3249], dtype=float32),
 array([437.01642], dtype=float32),
 array([476.24692], dtype=float32),
 array([495.86215], dtype=float32),

 The model gave us a result string y. I am now processing this result string into a list.
In [17]:
import numpy as np

conc = np.vstack(predictions)
conc
Out[17]:
array([[319.3249 ],
       [437.01642],
       [476.24692],
       [495.86215],
       [326.4933 ],
       [424.56955],
       [444.1848 ],
      
In [18]:
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[18]:
dtype('float32')

Now I’m creating a list of real y values from the test set.

In [19]:
y = df_test['Occupancy'].values
y = y.astype('float32')
y.dtype
Out[19]:
dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

In [20]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[20]:
y         float32
y_pred    float32
dtype: object
In [21]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

To use this function, both variables must have the same data type.

In [22]:
y.dtype
Out[22]:
dtype('float32')
In [23]:
kot.dtype
Out[23]:
dtype('float32')
In [24]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
In [25]:
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
In [26]:
r2 = tf.subtract(1.0, tf.div(residual, total))
In [27]:
r2
Out[27]:
<tf.Tensor 'Sub_2:0' shape=() dtype=float32>
In [28]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.13424665

Calculation of R Square parameter using Pandas

In [29]:
PZU.head(5)
Out[29]:
y y_pred
0 264.0 319.324890
1 651.0 437.016418
2 572.0 476.246918
3 471.0 495.862152
4 282.0 326.493286
In [30]:
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Out[30]:
y y_pred SSE
0 264.0 319.324890 3060.843506
1 651.0 437.016418 45788.972656
2 572.0 476.246918 9168.652344

Point 2. We calculate the average empirical value of y

In [31]:
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Out[31]:
y y_pred SSE ave_y
0 264.0 319.324890 3060.843506 463.973297
1 651.0 437.016418 45788.972656 463.973297
2 572.0 476.246918 9168.652344 463.973297

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [32]:
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Out[32]:
y y_pred SSE ave_y SST
0 264.0 319.324890 3060.843506 463.973297 39989.320312
1 651.0 437.016418 45788.972656 463.973297 34978.988281
2 572.0 476.246918 9168.652344 463.973297 11669.768555

Point 4. We calculate the difference between sum of SST and sum of SSE

In [33]:
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3732746.8
Sum_SSE : 3231638.2

Point 5. We calculate the R Square parameter

In [34]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.13424659
In [ ]:
 

Artykuł Tensorflow – Calculation of R square for linear regression pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) https://sigmaquality.pl/tensorflow-3/linear-regression-4/ Thu, 28 Nov 2019 18:26:00 +0000 http://sigmaquality.pl/linear-regression-4/ We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the [...]

Artykuł Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the data file: AirQ_filled2.csv comes from previous episodes of this cycle.

In [1]:
import tensorflow as tf
import pandas as pd

df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours'])
df.head(3)

Out[1]:
CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Month Weekday Hours
0 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 3 2 18
1 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 3 2 19
2 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 3 2 20

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

Separation of a column into an independent variable and a dependent variable.

In [2]:
df.columns
Out[2]:
Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'Month', 'Weekday', 'Hours'],
      dtype='object')
In [3]:
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
In [4]:
df.dtypes
Out[4]:
CO_GT           float64
PT08.S1_CO      float64
C6H6_GT         float64
PT08.S2_NMHC    float64
NOx_GT          float64
PT08.S3_NOx     float64
NO2_GT          float64
PT08.S4_NO2     float64
PT08.S5_O3      float64
T               float64
RH              float64
AH              float64
Month             int64
Weekday           int64
Hours             int64
dtype: object
In [5]:
FEATURES = ['PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL = 'CO_GT'
In [6]:
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES]
PKS
Out[6]:
[_NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called „Air” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [7]:
estimator = tf.estimator.LinearRegressor(    
        feature_columns=PKS,   
        model_dir="Air")
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'Air', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000017E850F7CC0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [8]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):    
         return tf.estimator.inputs.pandas_input_fn(       
         x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),       
         y = pd.Series(data_set[LABEL].values),       
         batch_size=n_batch,          
         num_epochs=num_epochs,       
         shuffle=shuffle)

Step 3: Model training

- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [9]:
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
(7486, 15) (1871, 15)
In [10]:
estimator.train(input_fn=get_input_fn(df_train,                                       
                                           num_epochs=None,                                      
                                           n_batch = 128,                                      
                                           shuffle=False),                                      
                                           steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.
Out[10]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x17e850f7828>

Step 4. Model evaluation

To enter a test set, use the following code:

In [11]:
ev = estimator.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Print the loss using by the code below:

In [12]:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))	
Loss: 59.043362

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

In [13]:
y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [14]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
In [15]:
predictions
Out[15]:
[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

The model gave us a result string y. I am now processing this result string into a list.

In [16]:
import numpy as np

conc = np.vstack(predictions)
conc
Out[16]:
array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)
In [48]:
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[48]:
dtype('float32')

Now I’m creating a list of real y values from the test set.

In [50]:
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
Out[50]:
dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

In [47]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[47]:
y         float64
y_pred    float64
dtype: object
In [63]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

To use this function, both variables must have the same data type.

In [51]:
y.dtype
Out[51]:
dtype('float32')
In [52]:
kot.dtype
Out[52]:
dtype('float32')
In [65]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
In [66]:
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
In [67]:
r2 = tf.subtract(1.0, tf.div(residual, total))
In [68]:
r2
Out[68]:
<tf.Tensor 'Sub_27:0' shape=() dtype=float32>
In [77]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.90320766

Calculation of R Square parameter using Pandas

In [78]:
PZU.head(5)
Out[78]:
y y_pred
0 2.2 2.290434
1 1.2 1.419513
2 1.0 0.991711
3 1.5 1.413460
4 1.6 1.471673
In [80]:
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Out[80]:
y y_pred SSE
0 2.2 2.290434 0.008178
1 1.2 1.419513 0.048186
2 1.0 0.991711 0.000069

Point 2. We calculate the average empirical value of y

In [81]:
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Out[81]:
y y_pred SSE ave_y
0 2.2 2.290434 0.008178 2.061304
1 1.2 1.419513 0.048186 2.061304
2 1.0 0.991711 0.000069 2.061304

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [83]:
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Out[83]:
y y_pred SSE ave_y SST
0 2.2 2.290434 0.008178 2.061304 0.019237
1 1.2 1.419513 0.048186 2.061304 0.741845
2 1.0 0.991711 0.000069 2.061304 1.126366

Point 4. We calculate the difference between sum of SST and sum of SSE

In [84]:
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3659.9984179583107
Sum_SSE : 354.26016629427124

Point 5. We calculate the R Square parameter

In [85]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.903207562998923

Artykuł Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) https://sigmaquality.pl/tensorflow-3/tutorial_-linear-regression-in-tensorflow-part_1/ Wed, 27 Nov 2019 18:56:00 +0000 http://sigmaquality.pl/tutorial_-linear-regression-in-tensorflow-part_1/   Part 1. Preliminary data preparation   AirQualityUCI Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality In [1]: import pandas as pd df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';') df.head(3) Out[1]:   Date [...]

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
 

Part 1. Preliminary data preparation

 
In [1]:
import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)
Out[1]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502 NaN NaN
 

Data Set Information:


The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Supplementing data for further analysis

 

Attribute Information:

Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in °C
Relative Humidity (
AH Absolute Humidity


Tutorial: Supplementing data for further analysis

 

Step 1. Data completeness check

In [2]:
df.isnull().sum()
Out[2]:
Date              114
Time              114
CO(GT)            114
PT08.S1(CO)       114
NMHC(GT)          114
C6H6(GT)          114
PT08.S2(NMHC)     114
NOx(GT)           114
PT08.S3(NOx)      114
NO2(GT)           114
PT08.S4(NO2)      114
PT08.S5(O3)       114
T                 114
RH                114
AH                114
Unnamed: 15      9471
Unnamed: 16      9471
dtype: int64
 

There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.

In [3]:
df.agg(['min', 'max', 'mean', 'median'])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesnumpylibnanfunctions.py:1112: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]:
  PT08.S1(CO) NMHC(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) Unnamed: 15 Unnamed: 16
min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 NaN NaN
max 2040.000000 1189.000000 2214.000000 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 NaN NaN
mean 1048.990061 -159.090093 894.595276 168.616971 794.990168 58.148873 1391.479641 975.072032 NaN NaN
median 1053.000000 -200.000000 895.000000 141.000000 794.000000 96.000000 1446.000000 942.000000 NaN NaN
In [4]:
df.shape
Out[4]:
(9471, 17)
 

We delete two empty columns.

In [5]:
del df['Unnamed: 15']
del df['Unnamed: 16']
 

Step 1: Preliminary analysis of data gaps

One more look at how many NaN cells there are.

In [6]:
df.isnull().sum()
Out[6]:
Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64
 

Now I will try to see these empty cells.

In [7]:
df[df['NMHC(GT)'].isnull()]
Out[7]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9357 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9358 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9359 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9360 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9361 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9466 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

114 rows × 15 columns

 

These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.

In [8]:
df = df.dropna(how='all')
df.isnull().sum()
Out[8]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.

In [9]:
import numpy as np

df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)
 

The value of -200 has been changed to NaN and we will see how many empty records there are now.

In [10]:
df.isnull().sum()
Out[10]:
Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64
 

Chart of missing data structure.

In [11]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show
Out[11]:
<function matplotlib.pyplot.show(*args, **kw)>
 

The NMHC (GT) variable is the most incomplete, we eliminate it from research

In [12]:
del df['NMHC(GT)']
 

We displaying records with missing data – Function isna ()

In [13]:
df1 = df[df.isna().any(axis=1)]
df1
Out[13]:
  Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9 11/03/2004 03.00.00 0,6 1010.0 1,7 561.0 NaN 1705.0 NaN 1235.0 501.0 10,3 60,2 0,7517
10 11/03/2004 04.00.00 NaN 1011.0 1,3 527.0 21.0 1818.0 34.0 1197.0 445.0 10,1 60,5 0,7465
33 12/03/2004 03.00.00 0,8 889.0 1,9 574.0 NaN 1680.0 NaN 1187.0 512.0 7,0 62,3 0,6261
34 12/03/2004 04.00.00 NaN 831.0 1,1 506.0 21.0 1893.0 32.0 1134.0 384.0 6,1 65,9 0,6248
39 12/03/2004 09.00.00 NaN 1545.0 22,1 1353.0 NaN 767.0 NaN 2058.0 1588.0 9,2 56,2 0,6561
9058 23/03/2005 04.00.00 NaN 993.0 2,3 604.0 85.0 848.0 65.0 1160.0 762.0 14,5 66,4 1,0919
9130 26/03/2005 04.00.00 NaN 1122.0 6,0 811.0 181.0 641.0 92.0 1336.0 1122.0 16,2 71,2 1,3013
9202 29/03/2005 04.00.00 NaN 883.0 1,3 530.0 63.0 997.0 46.0 1102.0 617.0 13,7 68,2 1,0611
9274 01/04/2005 04.00.00 NaN 818.0 0,8 473.0 47.0 1257.0 41.0 898.0 323.0 13,7 48,8 0,7606
9346 04/04/2005 04.00.00 NaN 864.0 0,8 478.0 52.0 1116.0 43.0 958.0 489.0 11,8 56,0 0,7743

2416 rows × 14 columns

 

Step 2: Check the level of direct correlation to complete the data

CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.

In [14]:
df.dtypes
Out[14]:
Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object
In [15]:
# df['CO(GT)'].astype(float)
 

ValueError: could not convert string to float: '2,6′

It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.

In [16]:
df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')
 

We change the format from object to float

In [17]:
df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)
In [18]:
df.dtypes
Out[18]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
 

We can now check the level of direct correlation.

In [19]:
df.corr()
Out[19]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [20]:
sns.set(style="ticks")

corr = df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
            square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8b43390>
 

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the CO (GT) variable.

I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.

In [21]:
df.corr()
Out[21]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [22]:
df.dtypes
Out[22]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
In [23]:
print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())
missing value in CO(GT):  1683
 

CO (GT) correlation with other variables.

In [24]:
CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
Out[24]:
  CO(GT)
PT08.S3(NOx) -0.703446
T 0.022109
AH 0.048556
RH 0.048890
PT08.S4(NO2) 0.630703
NO2(GT) 0.683343
NOx(GT) 0.795028
PT08.S5(O3) 0.854182
PT08.S1(CO) 0.879288
PT08.S2(NMHC) 0.915514
C6H6(GT) 0.931078
CO(GT) 1.000000
 

The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).

In [25]:
df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [26]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  383
In [27]:
df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [28]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  370
 

Now I do simple refilling – the last good value.

In [29]:
df['CO(GT)'].fillna(method='ffill', inplace=True)   
In [30]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [31]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  366
In [32]:
df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())
In [33]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'NOx(GT)’

In [34]:
print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())
brakuje wartości:  1639
In [35]:
CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')
Out[35]:
  NOx(GT)
PT08.S3(NOx) -0.655707
T -0.269683
AH -0.149323
RH 0.221032
PT08.S4(NO2) 0.233731
PT08.S2(NMHC) 0.704435
PT08.S1(CO) 0.713654
C6H6(GT) 0.718839
NO2(GT) 0.763111
PT08.S5(O3) 0.787046
CO(GT) 0.795028
NOx(GT) 1.000000
In [36]:
df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())
In [37]:
print('missing value: ',df['NOx(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [38]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  1642
In [39]:
CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')
Out[39]:
  NO2(GT)
PT08.S3(NOx) -0.652083
AH -0.335022
T -0.186533
RH -0.091759
PT08.S4(NO2) 0.157360
C6H6(GT) 0.614474
PT08.S1(CO) 0.641529
PT08.S2(NMHC) 0.646245
CO(GT) 0.683343
PT08.S5(O3) 0.708128
NOx(GT) 0.763111
NO2(GT) 1.000000
In [40]:
df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [41]:
df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [42]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  0
In [43]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>
 

I complete the records where the entire measuring device did not work.

In the drawing it can be seen as solid lines.

In [44]:
df.shape
Out[44]:
(9357, 14)
In [45]:
df.fillna(method='ffill', inplace=True)
In [46]:
df.shape
Out[46]:
(9357, 14)
In [47]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>
In [48]:
df.isnull().sum()
Out[48]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.

Let’s save the completed file to disk

df.to_csv('c:/TF/AirQ_filled.csv')
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)

Tutorial: Supplementing data for further analysis

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) https://sigmaquality.pl/uncategorized/linear-regression-2/ Wed, 27 Nov 2019 18:37:00 +0000 http://sigmaquality.pl/linear-regression-2/ Part 2. Simple multifactorial linear regression In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed [...]

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

In [1]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Out[1]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502

Step 1. Launching the time variable

We check what the date format is

In [2]:
df[['Date','Time']].dtypes
Out[2]:
Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

In [3]:
df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()
Out[3]:
0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

In [4]:
df['DATE'] = pd.to_datetime(df.DATE, format='
df.dtypes
Out[4]:
Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

In [5]:
df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour
In [6]:
df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)
Out[6]:
DATE Month Weekday Weekday_name Hours
6109 2004-11-20 07:00:00 11 5 Saturday 7
3537 2004-08-05 03:00:00 8 3 Thursday 3
8053 2005-02-09 07:00:00 2 2 Wednesday 7

Graphical analysis of pollution according to time variables

In [7]:
df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20bbdb38>
In [8]:
df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20f02320>
In [9]:
df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f210e77f0>

Step 3. Correlation analysis

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

In [10]:
del df['Unnamed: 0']
In [11]:
CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP
Out[11]:
CO(GT)
PT08.S3(NOx) -0.715683
Weekday -0.140231
RH 0.020122
AH 0.025227
T 0.025639
Month 0.112291
Hours 0.344071
PT08.S4(NO2) 0.631854
NO2(GT) 0.682774
NOx(GT) 0.773677
PT08.S5(O3) 0.858762
PT08.S1(CO) 0.886114
PT08.S2(NMHC) 0.918386
C6H6(GT) 0.932584
CO(GT) 1.000000
In [12]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Out[12]:
Text(0, 0.5, 'Continuous independent variables')
<Figure size 720x576 with 0 Axes>

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

Variable RH – Relative humidity (

We check a variable with very low correlation with the resulting CO (GT) variable

In [13]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF
In [14]:
x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked
In [15]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)
Optimal shift for RH:  12
In [16]:
cross_corr(x, y, lag=SKO)
Out[16]:
0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

In [17]:
x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)
Optimal shift for AH:  12
In [18]:
cross_corr(x, y, lag=SKP)
Out[18]:
0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C

We check a variable with very low correlation with the resulting CO (GT) variable.

In [19]:
x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)
Optimal shift for T:  12
In [20]:
cross_corr(x, y, lag=PKP)
Out[20]:
-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

In [21]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH (
- The dependent variable is the level of CO (GT)
In [22]:
df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

In [23]:
df2['weather_time'] = df2['DATE']
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [24]:
df2.head(3)
Out[24]:
DATE CO(GT) RH T weather_time
0 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-10 18:00:00
1 2004-03-10 19:00:00 2.0 47.7 13.3 2004-03-10 19:00:00
2 2004-03-10 20:00:00 2.2 54.0 11.9 2004-03-10 20:00:00
In [25]:
df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)
Out[25]:
DATE CO(GT) RH T Shift_weather_time
0 NaT NaN NaN NaN 2004-03-10 18:00:00
1 NaT NaN NaN NaN 2004-03-10 19:00:00
2 NaT NaN NaN NaN 2004-03-10 20:00:00
3 NaT NaN NaN NaN 2004-03-10 21:00:00
4 NaT NaN NaN NaN 2004-03-10 22:00:00
5 NaT NaN NaN NaN 2004-03-10 23:00:00
6 NaT NaN NaN NaN 2004-03-11 00:00:00
7 NaT NaN NaN NaN 2004-03-11 01:00:00
8 NaT NaN NaN NaN 2004-03-11 02:00:00
9 NaT NaN NaN NaN 2004-03-11 03:00:00
10 NaT NaN NaN NaN 2004-03-11 04:00:00
11 NaT NaN NaN NaN 2004-03-11 05:00:00
12 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-11 06:00:00
In [26]:
df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True) 
In [27]:
df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True) 

Deletes rows with incomplete data.

In [28]:
df5 = df5.dropna(how ='any')
In [29]:
df5.head()
Out[29]:
DATE CO(GT) Shift_RH Shift_T Shift_weather_time
36 2004-03-10 18:00:00 2.6 58.1 10.5 2004-03-11 06:00:00
37 2004-03-10 19:00:00 2.0 59.6 10.2 2004-03-11 07:00:00
38 2004-03-10 20:00:00 2.2 57.4 10.8 2004-03-11 08:00:00
39 2004-03-10 21:00:00 2.2 60.6 10.5 2004-03-11 09:00:00
40 2004-03-10 22:00:00 1.6 58.4 10.8 2004-03-11 10:00:00

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

In [30]:
import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()
In [31]:
df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

In [32]:
X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

In [33]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

In [34]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[34]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [35]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

In [36]:
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
Out[36]:
CO(GT) Actual CO(GT)_Predicted
0 0.5 1.63
1 1.9 1.91
2 3.4 2.40
3 1.2 1.45
4 2.4 2.40
In [37]:
from sklearn import metrics

dfKK.head(50).plot()
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>
In [38]:
from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756
In [39]:
print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Linear regression with TensorFlow https://sigmaquality.pl/tensorflow-3/tensorflow-linearregression-1/ Tue, 26 Nov 2019 18:38:00 +0000 http://sigmaquality.pl/tensorflow-linearregression-1/ Part one: Numpy method In [1]: import pandas as pd import tensorflow as tf import itertools   Source of data: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant   Combined Cycle Power [...]

Artykuł Linear regression with TensorFlow pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Part one: Numpy method
In [1]:
import pandas as pd
import tensorflow as tf
import itertools
 
 

Combined Cycle Power Plant Data Set

Data Set Information:

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
For comparability with our baseline studies, and to allow 5×2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.
We provide the data both in .ods and in .xlsx formats.

Attribute Information:

Features consist of hourly average ambient variables

  • Temperature (T) in the range 1.81°C and 37.11°C,
  • Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
  • Relative Humidity (RH) in the range 25.56
  • Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
  • Net hourly electrical energy output (EP) 420.26-495.76 MW
    The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.
 

Step 1: prepare the data

In [ ]:
df = pd.read_csv('c:/1/Folds5x2_pp.csv')
df.sample(3)
In [ ]:
del df['Unnamed: 0']
df.columns
In [ ]:
df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
df.sample(3)
 

Step 2: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

Separation of a column into an independent variable and a dependent variable.

In [ ]:
FEATURES = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity']
LABEL = 'Energy_output'
In [ ]:
Ewa = [tf.feature_column.numeric_column(k) for k in FEATURES]
Ewa
 

Step 3: Defining the estimator

Tensorflow will automatically create a file called „train2” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [ ]:
estimator = tf.estimator.LinearRegressor(    
        feature_columns=Ewa,   
        model_dir="train2")
 

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters:
x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [ ]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):    
         return tf.estimator.inputs.pandas_input_fn(       
         x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),       
         y = pd.Series(data_set[LABEL].values),       
         batch_size=n_batch,          
         num_epochs=num_epochs,       
         shuffle=shuffle)
 

Step 4: Model training

- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [ ]:
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
In [ ]:
estimator.train(input_fn=get_input_fn(df_train,                                       
                                           num_epochs=None,                                      
                                           n_batch = 356,                                      
                                           shuffle=False),                                      
                                           steps=1000)
 

We check the CMD TensorBoard command console.

tensorboard –logdir=.trainlinreg

Tensorboard is located in this URL: http://localhost:6006

It could also be located at the following location.

 

image.png

 

Step 5. Model assessment

To enter a test set, use the following code:

In [ ]:
ev = estimator.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
 

Print the loss using by the code below:

In [ ]:
average_loss = ev["average_loss"]
print("average_loss: ",format(average_loss))
In [ ]:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))	
 

The model has a average loss of 26. You can check the summary statistics to find out how big the error is.

In [ ]:
df_test['Energy_output'].describe()
In [ ]:
PKP=(average_loss/ df_test['Energy_output'].mean())*100
print('Average error in relation to the average value: ',PKP)
 

Step 6. Making a forecast

 

Making a forecast is based on the fact that we have a model and we have a set of independent variables. Now we substitute the independent variables into the model and get the result. We will create 4 random variables and make a forecast for these records.

We create a sample of 4 records without output variables.
In [ ]:
import numpy as np

sample4 =df.sample(4)
result = sample4['Energy_output'].copy() ## <= to have a comparison later
sample4['Energy_output']=np.nan
sample4
In [ ]:
y = estimator.predict(    
         input_fn=get_input_fn(sample4,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [ ]:
predictions = list(p["predictions"] for p in itertools.islice(y, 4))
print("Predictions: {}".format(str(predictions)))
In [ ]:
predictions
 

I’m converting array to dataframe

In [ ]:
conc = np.vstack(predictions)
conc
In [ ]:
newdf = pd.DataFrame(conc)
newdf
In [ ]:
result

Artykuł Linear regression with TensorFlow pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>