What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

corr = df.corr()
corr

corr['RESULT']

AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)

Optimal shift for AAA:  11

0.9999999999999996

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

cross_corr(x, y, lag=SKO)

0.9999999999999996

We create new DateFrame with optimal shift.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

BBB = df.BBB       # independent variable

SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

CCC = df.CCC

SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

DDD = df.DDD

PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

df5 = df5.dropna(how='any')
df5.head(3)

corr = df5.corr()
corr

corr['RESULT']

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.

	SHIFTED AAA	SHIFTED BBB	SHIFTED CCC	SHIFTED DDD	RESULT
28	175.0	105.0	70.0	420.0	35.0
29	250.0	150.0	100.0	600.0	50.0
30	190.0	114.0	76.0	456.0	38.0

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Example of the use of shift for linear regression in Python. How to find optimal correlation shift?