Źródło bazy danych: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
Registry contains 9568 measurements made in the period between 2006 and 2011. During the survey power plant was working in full capacity. Registry contains following of process characteristic:
- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
Process of forecasting by the linear regression model¶
We open database and useful libraries.
import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math
## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')
df.sample(5)
We analyze size and parameters of registry.
df.shape
df.dtypes
For improve transparency of code we changed names of columns.
df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
df.sample(5)
Base conditions to realize linear regression model¶
Failure to comply of these rules lead to wrong, misleading result of model calculation.
- There are no correlation between descriptive variables (independent variables)
- There are exist linear relationship between predictive and result variables.
- The errors from the model should have normal distribution.
- Homogeneous of error variance
- Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.
Analyze of relationship between dependent and independent variables
CORREL = df.corr().sort_values('Energy_output')
CORREL['Energy_output']
There are high level of correlation between result variable: energy output (EP) and exogenic factors: Temperature (T) and Exhaust Vacuum (V).
One factor regression model
lm = smf.ols(formula = 'Energy_output ~ Temperature', data = df).fit()
lm.summary()
One factor regression model has a very good predictive properties.
plt.figure()
plt.scatter(df.Temperature, df.Energy_output, c = 'grey')
plt.plot(df.Temperature, lm.params[0] + lm.params[1] * df.Temperature, c = 'r')
plt.xlabel('Temperature')
plt.ylabel('Energy_output')
plt.title("Linear Regression Plot")
Multi factor regression model
lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
print (lm.summary())
Multi factor regression model have very good descriptive features. It raises doubts.
Analyze of variables used in mode
df.describe()
Analyze of distribution of result variable
plt.rcParams['figure.figsize'] = (5, 4)
sns.distplot(df['Energy_output'])
Analysis of correlation between independent variables
plt.rcParams['figure.figsize'] = (5, 4)
sns.heatmap (df.corr (), cmap="YlGnBu")
We can easily remark that between independent variables: Temperature (T) and Exhaust Vacuum (V) exist very high positive correlation.
Another form of correlation matrix.
sns.heatmap (df.corr (), cmap="coolwarm", annot=True, cbar=False)
We check completeness of survey.
df.isnull().sum()
Graphical presentation of relationship between independent and dependent variables
We divide result variable into two categories: “High power” and “Low power”. We do it to remark what descriptive variables work.
Ewa = ['High power', 'Low power']
df['Power'] = pd.qcut(df['Energy_output'],2, labels=Ewa)
df.sample(4)
We create graphical relationship matrix.
sns.pairplot(data=df[['Temperature' ,'Exhaust_Vacuum','Ambient_Pressure', 'Relative_Humidity', 'Power']], hue='Power', dropna=True, height=2)
Relationship among Temperature (T) and Exhaust Vacuum (V) ones again showed high level of correlation.
sns.jointplot(x='Temperature', y='Exhaust_Vacuum', data=df)