shift - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/shift/ Wojciech Moszczyński Wed, 27 Nov 2019 18:37:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png shift - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/shift/ 32 32 Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) https://sigmaquality.pl/uncategorized/linear-regression-2/ Wed, 27 Nov 2019 18:37:00 +0000 http://sigmaquality.pl/linear-regression-2/ Part 2. Simple multifactorial linear regression In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed [...]

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

In [1]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Out[1]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502

Step 1. Launching the time variable

We check what the date format is

In [2]:
df[['Date','Time']].dtypes
Out[2]:
Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

In [3]:
df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()
Out[3]:
0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

In [4]:
df['DATE'] = pd.to_datetime(df.DATE, format='
df.dtypes
Out[4]:
Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

In [5]:
df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour
In [6]:
df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)
Out[6]:
DATE Month Weekday Weekday_name Hours
6109 2004-11-20 07:00:00 11 5 Saturday 7
3537 2004-08-05 03:00:00 8 3 Thursday 3
8053 2005-02-09 07:00:00 2 2 Wednesday 7

Graphical analysis of pollution according to time variables

In [7]:
df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20bbdb38>
In [8]:
df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20f02320>
In [9]:
df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f210e77f0>

Step 3. Correlation analysis

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

In [10]:
del df['Unnamed: 0']
In [11]:
CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP
Out[11]:
CO(GT)
PT08.S3(NOx) -0.715683
Weekday -0.140231
RH 0.020122
AH 0.025227
T 0.025639
Month 0.112291
Hours 0.344071
PT08.S4(NO2) 0.631854
NO2(GT) 0.682774
NOx(GT) 0.773677
PT08.S5(O3) 0.858762
PT08.S1(CO) 0.886114
PT08.S2(NMHC) 0.918386
C6H6(GT) 0.932584
CO(GT) 1.000000
In [12]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Out[12]:
Text(0, 0.5, 'Continuous independent variables')
<Figure size 720x576 with 0 Axes>

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

Variable RH – Relative humidity (

We check a variable with very low correlation with the resulting CO (GT) variable

In [13]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF
In [14]:
x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked
In [15]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)
Optimal shift for RH:  12
In [16]:
cross_corr(x, y, lag=SKO)
Out[16]:
0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

In [17]:
x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)
Optimal shift for AH:  12
In [18]:
cross_corr(x, y, lag=SKP)
Out[18]:
0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C

We check a variable with very low correlation with the resulting CO (GT) variable.

In [19]:
x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)
Optimal shift for T:  12
In [20]:
cross_corr(x, y, lag=PKP)
Out[20]:
-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

In [21]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH (
- The dependent variable is the level of CO (GT)
In [22]:
df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

In [23]:
df2['weather_time'] = df2['DATE']
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [24]:
df2.head(3)
Out[24]:
DATE CO(GT) RH T weather_time
0 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-10 18:00:00
1 2004-03-10 19:00:00 2.0 47.7 13.3 2004-03-10 19:00:00
2 2004-03-10 20:00:00 2.2 54.0 11.9 2004-03-10 20:00:00
In [25]:
df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)
Out[25]:
DATE CO(GT) RH T Shift_weather_time
0 NaT NaN NaN NaN 2004-03-10 18:00:00
1 NaT NaN NaN NaN 2004-03-10 19:00:00
2 NaT NaN NaN NaN 2004-03-10 20:00:00
3 NaT NaN NaN NaN 2004-03-10 21:00:00
4 NaT NaN NaN NaN 2004-03-10 22:00:00
5 NaT NaN NaN NaN 2004-03-10 23:00:00
6 NaT NaN NaN NaN 2004-03-11 00:00:00
7 NaT NaN NaN NaN 2004-03-11 01:00:00
8 NaT NaN NaN NaN 2004-03-11 02:00:00
9 NaT NaN NaN NaN 2004-03-11 03:00:00
10 NaT NaN NaN NaN 2004-03-11 04:00:00
11 NaT NaN NaN NaN 2004-03-11 05:00:00
12 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-11 06:00:00
In [26]:
df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True) 
In [27]:
df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True) 

Deletes rows with incomplete data.

In [28]:
df5 = df5.dropna(how ='any')
In [29]:
df5.head()
Out[29]:
DATE CO(GT) Shift_RH Shift_T Shift_weather_time
36 2004-03-10 18:00:00 2.6 58.1 10.5 2004-03-11 06:00:00
37 2004-03-10 19:00:00 2.0 59.6 10.2 2004-03-11 07:00:00
38 2004-03-10 20:00:00 2.2 57.4 10.8 2004-03-11 08:00:00
39 2004-03-10 21:00:00 2.2 60.6 10.5 2004-03-11 09:00:00
40 2004-03-10 22:00:00 1.6 58.4 10.8 2004-03-11 10:00:00

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

In [30]:
import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()
In [31]:
df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

In [32]:
X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

In [33]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

In [34]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[34]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [35]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

In [36]:
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
Out[36]:
CO(GT) Actual CO(GT)_Predicted
0 0.5 1.63
1 1.9 1.91
2 3.4 2.40
3 1.2 1.45
4 2.4 2.40
In [37]:
from sklearn import metrics

dfKK.head(50).plot()
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>
In [38]:
from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756
In [39]:
print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Example of the use of shift for linear regression in Python. How to find optimal correlation shift? https://sigmaquality.pl/uncategorized/shift-for-linear-regression/ Thu, 14 Nov 2019 18:10:00 +0000 http://sigmaquality.pl/shift-for-linear-regression/ What is this the correlation shift? In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because [...]

Artykuł Example of the use of shift for linear regression in Python. How to find optimal correlation shift? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

In [1]:
from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

In [2]:
AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()
Out[2]:
AAA BBB CCC DDD RESULT
0 295 123 124 437 35
1 370 221 154 453 50
2 310 113 130 764 38
3 385 105 160 346 53
4 325 150 136 239 41

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

In [3]:
corr = df.corr()
corr
Out[3]:
AAA BBB CCC DDD RESULT
AAA 1.000000 0.072278 0.715892 0.206945 -0.261955
BBB 0.072278 1.000000 0.244349 0.748050 0.383326
CCC 0.715892 0.244349 1.000000 0.389072 -0.169164
DDD 0.206945 0.748050 0.389072 1.000000 0.248511
RESULT -0.261955 0.383326 -0.169164 0.248511 1.000000
In [4]:
corr['RESULT']
Out[4]:
AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

In [5]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

In [6]:
x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

In [13]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)
Optimal shift for AAA:  11

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

In [8]:
cross_corr(x, y, lag=SKO)
Out[8]:
0.9999999999999996

We create new DateFrame with optimal shift.

In [9]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)
In [10]:
df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)
Out[10]:
SHIFTED AAA BBB CCC DDD RESULT
0 295 NaN NaN NaN NaN
1 370 NaN NaN NaN NaN
2 310 NaN NaN NaN NaN
3 385 NaN NaN NaN NaN
4 325 NaN NaN NaN NaN
5 400 NaN NaN NaN NaN
6 340 NaN NaN NaN NaN
7 415 NaN NaN NaN NaN
8 355 NaN NaN NaN NaN
9 430 NaN NaN NaN NaN
10 370 NaN NaN NaN NaN
11 175 123.0 124.0 437.0 35.0
12 250 221.0 154.0 453.0 50.0

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

In [11]:
BBB = df.BBB       # independent variable
In [14]:
SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)
Optimal shift for BBB:  3
In [16]:
df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

In [18]:
CCC = df.CCC
In [26]:
SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)
Optimal shift for CCC:  9
In [22]:
df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

In [23]:
DDD = df.DDD
In [27]:
PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)
Optimal shift for DDD:  5
In [31]:
df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

In [33]:
df5 = df5.dropna(how='any')
df5.head(3)
Out[33]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
28 175.0 105.0 70.0 420.0 35.0
29 250.0 150.0 100.0 600.0 50.0
30 190.0 114.0 76.0 456.0 38.0
In [34]:
corr = df5.corr()
corr
Out[34]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
SHIFTED AAA 1.0 1.0 1.0 1.0 1.0
SHIFTED BBB 1.0 1.0 1.0 1.0 1.0
SHIFTED CCC 1.0 1.0 1.0 1.0 1.0
SHIFTED DDD 1.0 1.0 1.0 1.0 1.0
RESULT 1.0 1.0 1.0 1.0 1.0
In [35]:
corr['RESULT']
Out[35]:
SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.

Artykuł Example of the use of shift for linear regression in Python. How to find optimal correlation shift? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>