Example of the use of shift for linear regression in Python. How to find optimal correlation shift?

What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

In [1]:
from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

In [2]:
AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()
Out[2]:
AAA BBB CCC DDD RESULT
0 295 123 124 437 35
1 370 221 154 453 50
2 310 113 130 764 38
3 385 105 160 346 53
4 325 150 136 239 41

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

In [3]:
corr = df.corr()
corr
Out[3]:
AAA BBB CCC DDD RESULT
AAA 1.000000 0.072278 0.715892 0.206945 -0.261955
BBB 0.072278 1.000000 0.244349 0.748050 0.383326
CCC 0.715892 0.244349 1.000000 0.389072 -0.169164
DDD 0.206945 0.748050 0.389072 1.000000 0.248511
RESULT -0.261955 0.383326 -0.169164 0.248511 1.000000
In [4]:
corr['RESULT']
Out[4]:
AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

In [5]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

In [6]:
x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

In [13]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)
Optimal shift for AAA:  11

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

In [8]:
cross_corr(x, y, lag=SKO)
Out[8]:
0.9999999999999996

We create new DateFrame with optimal shift.

In [9]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)
In [10]:
df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)
Out[10]:
SHIFTED AAA BBB CCC DDD RESULT
0 295 NaN NaN NaN NaN
1 370 NaN NaN NaN NaN
2 310 NaN NaN NaN NaN
3 385 NaN NaN NaN NaN
4 325 NaN NaN NaN NaN
5 400 NaN NaN NaN NaN
6 340 NaN NaN NaN NaN
7 415 NaN NaN NaN NaN
8 355 NaN NaN NaN NaN
9 430 NaN NaN NaN NaN
10 370 NaN NaN NaN NaN
11 175 123.0 124.0 437.0 35.0
12 250 221.0 154.0 453.0 50.0

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

In [11]:
BBB = df.BBB       # independent variable
In [14]:
SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)
Optimal shift for BBB:  3
In [16]:
df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

In [18]:
CCC = df.CCC
In [26]:
SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)
Optimal shift for CCC:  9
In [22]:
df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

In [23]:
DDD = df.DDD
In [27]:
PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)
Optimal shift for DDD:  5
In [31]:
df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

In [33]:
df5 = df5.dropna(how='any')
df5.head(3)
Out[33]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
28 175.0 105.0 70.0 420.0 35.0
29 250.0 150.0 100.0 600.0 50.0
30 190.0 114.0 76.0 456.0 38.0
In [34]:
corr = df5.corr()
corr
Out[34]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
SHIFTED AAA 1.0 1.0 1.0 1.0 1.0
SHIFTED BBB 1.0 1.0 1.0 1.0 1.0
SHIFTED CCC 1.0 1.0 1.0 1.0 1.0
SHIFTED DDD 1.0 1.0 1.0 1.0 1.0
RESULT 1.0 1.0 1.0 1.0 1.0
In [35]:
corr['RESULT']
Out[35]:
SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.