
Statistics and Correlation
Statistics, contrary to what one might think, is a relatively young science. Just over 110 years ago, the world’s first statistics department was established at University College London by Karl Pearson, the inventor of the linear correlation test and many other groundbreaking statistical tools.
Could it be that the concept of correlation was only “invented” a little over a century ago? Is it possible that throughout 6,000 years of recorded human history, cause-and-effect phenomena were not understood? People surely recognized that some events caused others; they knew, for instance, that rain would raise river levels and that drought would lead to famine.
Correlation of Continuous Phenomena
Historically, statistics relied on empirical experience that was not mathematically formalized. Karl Pearson pioneered the development of quantitative methods, defining cause-and-effect relationships mathematically. The causal relationship test, known as the Pearson correlation coefficient, operates between values of -1 and 1. Correlation is essentially the observation of a co-linear trend. If, for instance, the air temperature increases and, as a result, more cold drinks are sold, we observe a positive correlation with values ranging from above 0 to 1. Conversely, if a rise in coffee prices leads to a drop in consumption, we have a negative correlation, ranging from -1 to a value close to zero. If the correlation coefficient is zero or near zero, it implies that the increase in coffee price has no impact on its consumption level. Thus, correlation is a straightforward and interpretable phenomenon.
Measuring Correlation for Continuous Values
Imagine we run a bakery and want to determine if there’s a correlation between the age of customers and the amount of bread they purchase. Here, both customer age and quantity of bread bought are continuous variables. We might roughly estimate each customer’s age and record the quantity of bread they buy, creating pairs of values for each customer. These pairs can then be plotted on a graph as points. The relationship between these continuous values can look like one of the types of correlation in Figure 1.
It could turn out that there is no correlation between customer age and the quantity of bread purchased, and the data might appear as a scattered cloud of points. If
, there’s a positive correlation between age and bread quantity, meaning that the older the customer, the more bread they buy. However, in practice, we know this isn’t usually the case; typically, younger people and seniors buy less bread than middle-aged customers. The result might appear unclear or difficult to interpret. To determine if this relationship is statistically significant, we need to use the Pearson linear correlation test.
The Concept of Categorical Values
Not all phenomena can be measured with the simple correlation coefficient because not all values are continuous. Can we measure if someone is “more” of a woman or man? Can a car be “more” of a passenger vehicle or a truck? Or if May is “more” or “less” of a May? No one can be a little pregnant. Some phenomena are definitive—categorical. This is where the concept of discrete, or categorical, values comes into play. Each categorical object has a unique state: one car is red, another green, and yet another yellow.
Measuring Relationships for Categorical Values
If we categorize customers by gender, whether they wear glasses, or whether they wear hats, could we examine if gender affects cake choice? Perhaps age is also an essential factor. We could classify customers into three age groups: 16-40, 40-60, and over 60. We could further categorize customers by gender and compare these groups to the type of cake they purchase, as shown in Table 1.
Six customer categories | Four cake types |
---|
Woman (over 60) | Meringue cake |
Woman (16-40) | Square wedding cake |
Woman (40-60) | Creamy Delight |
Man (40-60) | Square wedding cake |
Woman (over 60) | Gingerbread Marvel |
Man (16-40) | Meringue cake |
Woman (40-60) | Gingerbread Marvel |
Man (over 60) | Meringue cake |
Woman (16-40) | Creamy Delight |
Cakes are increasingly sold online, and online customers are not anonymous, so we can classify them to develop an algorithm that drives automated cake sales. To build such an algorithm, we need to determine if the age and gender of customers influence the selection of one of the four cake types. We have data for 60 purchases, represented as 60 gender-age pairs matched with the cake type purchased. But what next? Can we plot the pair: Woman (16-40) – Creamy Delight on a graph? Of course not; these pairs cannot be represented on a correlation graph because we’re dealing with categorical values. So, how do we find relationships between categorical variables?
Chi-Squared Test
The chi-squared test, also called the test of association or independence, was proposed by Karl Pearson. It involves counting the occurrences of pairs in a contingency table, commonly known as a pivot table. For our example of 60 purchases, we create a contingency table.
Formulating the Null Hypothesis
We aim to determine if customer category, defined by gender and age group, influences the selection of specific cake types.
Each statistical test begins with a “null hypothesis.” These tests are usually performed when there is a suspicion, for example, that customers wearing glasses buy more coffee or that the busiest time for the bakery is between 1 p.m. and 3 p.m. (where the time frame is a categorical value). When forming hypotheses, we intuitively lean towards indicating dependencies, like “pregnant women buy more donuts.” However, when formulating the null hypothesis, we must be cautious. The null hypothesis always assumes that the categories studied have no relationship. Therefore, the null hypothesis here would be that a woman’s pregnancy status has no effect on the number of donuts bought.
If we suspect that customers with glasses are more likely to buy coffee, the null hypothesis would be that there is no relationship between customers wearing glasses and coffee purchases. Returning to our cake sales example, we set the following hypotheses:
- Null Hypothesis: The gender and age group of customers do not influence their choice of a specific cake.
- Alternative Hypothesis: The gender and age group of customers do influence their choice of a specific cake.
The alternative hypothesis always negates the null hypothesis, and the rules for constructing null and alternative hypotheses are consistent across all statistical tests.
Performing the Chi-Squared Test
One way to conduct a chi-squared test is by using Python libraries. First, we initialize the NumPy library and create a matrix with values from Table 2, omitting the row and column totals.
Next, we run the Stats library and perform a chi-squared test.
The test provides two values: the p-value for the chi-squared test and the degrees of freedom. Typically, if the p-value is less than 0.05, we have grounds to reject the null hypothesis. The degrees of freedom are calculated as the difference between the number of values in the matrix and the number of values in a single row and column.
Since the p-value is greater than 0.05, we have no reason to reject the null hypothesis. Therefore, we can be confident that the gender and age group of customers do not influence their choice of cake.
Author Bio
Wojciech Moszczyński – A graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń, specializing in econometrics, data science, and managerial accounting. He focuses on optimizing production and logistics processes and conducts research in AI development and application. He has been engaged in promoting econometrics and data science in business for many years.