Regularization in machine learning models

Machine learning, in simple terms, involves creating mathematical simulations of existing, real-world processes. These simulations are commonly referred to as models, built mostly based on historical data.

What is Machine Learning?
The term „machine learning” is a simplified way of saying that we are teaching a machine something. Here, the „machine” is just an algorithm. For example, in a bakery, we may install a device to count the number of customers entering. This device will generate a time series, showing, for instance, 213 customers on Monday and 319 on Tuesday. We can compare this data with other Mondays and Tuesdays in the past to statistically forecast how many customers will come on a future Monday or Tuesday. Adding factors such as season or weather conditions may improve our model’s predictive accuracy. This gives us a model, one based on averages from previous months. For instance, Mondays have had an average of 305 customers with a standard deviation of 21 customers.

Models based on averages from time series belong to the family of autoregressive models. These models can be created in a spreadsheet, and adding factors like season or weather increases predictive power. When conditional averages emerge, Bayesian models are often applied. Conducting conditional probability analyses in a spreadsheet can be tedious, and when predictive factors become more conditional, it complicates the model further and may improve its accuracy, warranting a shift from spreadsheets to a richer econometric model library.

Machine learning is essentially about building a model that mirrors reality, designed to reflect current and future events in processes. With historical records of a phenomenon in terms of its parameter changes, we can teach a machine statistical inference, including predicting the future.

Let’s assume we want to build a weather model to predict rain in the coming hours. Imagine a technological process that is disrupted by rain, requiring the bakery owner to take protective measures, like covering bread baskets left outside early in the morning. The owner finds standard weather forecasts insufficient.

When building the model, we use historical data such as atmospheric pressure, cloud cover, season, temperature, and wind speed. The goal is for drivers to have an app on their smartphones that says, „Cover the bread baskets with plastic” or „Leave the baskets uncovered.”

We observe that when atmospheric pressure dropped, there was significant cloud cover, and gusty winds, the probability of rain increased. All this data is incorporated into a model designed to predict whether rain will occur or not. This type of problem is best addressed with a classification model, which outputs either 0 („the event will not occur”) or 1 („the event will occur”).

Supervised Machine Learning

The machine learning process is divided into supervised and unsupervised learning. Supervised learning involves training a model using a specific dataset (the training set) and then applying the trained algorithm to another dataset (the test set) to assess the model’s performance. The purpose of using a test set is to evaluate how well the model performs on new, previously unseen data, known as validation.

Unsupervised learning does not involve model testing. Here, the model is trained on the primary dataset without any subsequent testing. An example of unsupervised learning is clustering, which involves grouping elements in a dataset.

Regularization in Machine Learning Models

Every predictive model based on historical data should ideally be developed using supervised learning.

Dividing the Dataset into Training and Test Sets

In supervised learning, we must split our historical data into a training set and a test set. Suppose we have weather data from 2020 to 2022, which we will use for training, while 2023 data will serve as our test set. The sequence of dates must be preserved because we are working with time series data, which may exhibit patterns like gusts of wind shortly before rain.

Dividing historical data into training and test sets enables model performance evaluation. With a working model trained on data from 2020-2022, we can introduce 2023 test data to assess model quality. If our trained classification model performs well on training data by accurately predicting rain 80% of the time, we may expect similar performance with test data. However, if the model’s accuracy significantly drops, from 80% to 60% in 2023, it may indicate overfitting—where the model performs well on known data but poorly on new data.

Overfitting
Overfitting occurs when a model performs well on the training data but fails to generalize to new data, indicating the model may have learned patterns specific to the training period that do not exist in the new data. Overfitting is often an issue in complex models like those based on entropy or neural networks, while simpler linear or logistic regression models are less prone to it. Overfitting is more common in models with a large number of dimensions, known as the „curse of dimensionality.” To address overfitting, it’s essential to perform a statistical comparison of test and training datasets.

What is Regularization?
Regularization aims to improve the model’s predictive accuracy by reducing its sensitivity, or complexity, making the model less likely to pick up on subtle, minor patterns in the training data—referred to as variance error. The process of reducing variance is known as regularization, where irrelevant features are either removed or synthesized into simpler forms.

Regularization in Neural Networks

Specific algorithms, like Ridge Regression (L2) and Lasso Regression (L1), serve as regularization methods in machine learning. In Ridge Regression, regularization is achieved by squaring the penalty coefficient, whereas in Lasso Regression, the penalty coefficient is absolute, allowing for negative values. Another, less universally accepted method is DropOut, which randomly deactivates a fraction of neurons during training. By making the network simpler, DropOut can reduce overfitting, although it lacks reproducibility due to its random selection of neurons.

In neural networks, regularization is akin to neuron penalization. Instead of removing variables or manually creating new ones, a kind of penalty is imposed on the neural network to limit its complexity. For example, signals such as atmospheric pressure, wind speed, temperature, and season are inputs for neurons in a network. When a signal is reduced below a certain threshold due to penalization, it may not pass through the network, thereby simplifying the model’s „thought process.”

In essence, regularization is about making the model interpret phenomena in a simpler way, reducing the likelihood of the model overfitting to noise in the data. This method, known as dimensionality reduction, helps address the „curse of dimensionality” and can be achieved by removing less significant variables or by modifying signal values to prevent minor patterns from unduly influencing the model.

Wojciech Moszczyński
Graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń, specializing in econometrics, finance, data science, and management accounting. He specializes in optimizing production and logistics processes, and has conducted research in AI development and applications. He is actively engaged in popularizing machine learning and data science in business environments.