What is this the correlation shift?
In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.
More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.
How to find correlation?
In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.
But to observe it, use simple correlation can be insufficient.
The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.
How to find correlation shift?
from scipy import signal, fftpack
import pandas as pd
import numpy
Let’s build this dataframe.
AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
235, 310, 250, 325, 265, 340, 280, 355]
BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
177, 222, 186, 231, 195, 240, 204, 249]
CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148, 70, 100, 76, 106,
82, 112, 88, 118, 94, 124, 100, 130, 106, 136, 112, 142, 118,
148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
160, 190, 166, 196, 172, 70, 100, 76, 106, 82, 112, 88, 118,
94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
172, 202, 178, 70, 100, 76, 106, 82, 112, 88, 118, 94, 124,
100, 130, 106, 136, 112, 142, 118, 148]
DDD = [ 437, 453, 764, 346, 239, 420, 600, 456, 636, 492, 672,
528, 708, 564, 744, 600, 780, 636, 816, 672, 852, 708,
888, 744, 924, 780, 960, 816, 996, 852, 1032, 888, 1068,
924, 1104, 960, 1140, 996, 1176, 1032, 420, 600, 456, 636,
492, 672, 528, 708, 564, 744, 600, 780, 636, 816, 672,
852, 708, 888, 744, 924, 780, 960, 816, 996, 852, 1032,
888, 1068, 924, 1104, 960, 1140, 996, 1176, 1032, 1212, 1068,
420, 600, 456, 636, 492, 672, 528, 708, 564, 744, 600,
780, 636, 816, 672, 852, 708, 888, 744, 924, 780, 960]
RESULT = [ 35, 50, 38, 53, 41, 56, 44, 59, 47, 62, 50, 65, 53,
68, 56, 71, 59, 74, 62, 77, 65, 80, 68, 83, 71, 86,
74, 89, 77, 92, 80, 95, 83, 98, 86, 35, 50, 38, 53,
41, 56, 44, 59, 47, 62, 50, 65, 53, 68, 56, 71, 59,
74, 62, 77, 65, 80, 68, 83, 71, 86, 74, 89, 77, 92,
80, 95, 83, 98, 86, 101, 89, 35, 50, 38, 53, 41, 56,
44, 59, 47, 62, 50, 65, 53, 68, 56, 71, 59, 74, 62,
77, 65, 80, 68, 83, 71, 86, 74]
df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})
df.head()
Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.
corr = df.corr()
corr
corr['RESULT']
Is it all? Is it entire correlation for linear regression? How to find correlation delay?
Function to find optimal correlation shift
I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.
def cross_corr(x, y, lag=0):
return x.corr(y.shift(lag))
def shift_Factor(x,y,R):
x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
# R factor is the number of the shifts who should be checked by the function
Kot = pd.DataFrame(list(x_corr)).reset_index()
Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
# We find optimal correlation shift
Kot['abs'] = Kot['Corr'].abs()
SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
p1 = SF.to_frame()
SF = p1.Shift_num.max()
return SF
We declare variables to function.
x = df.AAA # independent variable
y = df.RESULT # dependent variable
R = 20 # number of shifts who will be checked
Shift for variable AAA
We are looking optimal correlation shift in variable AAA.
SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)
We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?
cross_corr(x, y, lag=SKO)
We create new DateFrame with optimal shift.
def df_shif(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for h in df.columns:
if h == target:
new[h] = df[target]
else:
new[h] = df[h].shift(periods=lag)
return pd.DataFrame(data=new)
df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True)
df2.head(13)
Now we repeat these manuals for rest independent variables.
Shift for variable BBB
BBB = df.BBB # independent variable
SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)
df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)
Shift for variable CCC
CCC = df.CCC
SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)
df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)
Shift for variable DDD
DDD = df.DDD
PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)
df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)
Correlation after make the shifts
I wipe rows in dataframe where appear NaN values and calculate correlation.
df5 = df5.dropna(how='any')
df5.head(3)
corr = df5.corr()
corr
corr['RESULT']
As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.