Application of the Incremental Isolation Forest Model for Improving the Quality of Food Production

Isolation Forest is a machine learning model that can be used to detect anomalies on a food production line, for example, a meat packaging line. Such an installation is equipped with a number of sensors that continuously send information. Receiving this data on a personal computer is relatively simple. The data stream can be directed straight from the sensors to an analytical cloud, such as Azure or AWS, where a ready-made model can be launched. This model will automatically send alerts related to anomalies detected in the production process.

How the Isolation Forest Model Works

This solution is easy, simple (for an average data analyst), and often free.
Let’s assume we have two measurement sensors on a meat packaging line:

Weight sensors, which check whether each piece has the correct weight,
Temperature sensors, which measure the temperature of the meat, indicating its freshness.

Thanks to these sensors, for each product a set of data is collected — weight and temperature.

The Isolation Forest model learns from many products that have passed through the line. In this way, it learns what a “typical” product looks like. Suddenly, an unusual product may appear — for example, one that is too heavy or too warm, which may indicate that the meat is starting to spoil. Such a product will then be recognized by the Isolation Forest model as an anomaly.

Why does this happen? Because a product that differs from the rest (has a different weight or temperature) will be quickly isolated by the model’s decision trees. The model needs only a few splits to notice that this product “does not fit” the standard products normally appearing on the line.

[pdf-embedder url=”http://sigmaquality.pl/wp-content/uploads/2025/10/W-MOSZCZYNSKI-ps-4-25-w2.pdf” title=”W-MOSZCZYNSKI ps 4-25 w2″]

How Isolation Forest Detects Anomalies

The Isolation Forest model is used to detect anomalies in a set of elements — it identifies something that doesn’t fit the rest. The model builds trees randomly; the selection of features is random (e.g., customer gender, race, age, income). Thus, the structure of trees is not predetermined.

From a database of clients, a tree may first randomly divide them by gender (women and men), then by place of residence, age, etc. If something entirely unusual suddenly appears (a horse, a cat), such a data point will be very quickly separated — after one or two splits — forming a single, short branch of the tree. This short branch indicates that the point is an anomaly.

The key lies precisely in the speed of isolating anomalous points (in the above example — animals from humans). Identifying anomalies is based on identifying branches that are very short. The fewer splits required to separate a point from others, the more likely it is an anomaly.

Each tree is created by randomly dividing data into increasingly smaller groups. Normal observations are similar to one another, so separating them requires many splits — many branches are created. Anomalies, on the other hand, are isolated after only a few splits. The shorter and more singular the branch, the greater the anomaly.

The deciding factor for whether an observation is anomalous is therefore the number of branches (splits) needed to isolate it from the rest of the data.

Isolation Forest is particularly well-suited for analyzing streaming data for several reasons:

It is very fast, building simple trees randomly and not requiring complex computations.
Thanks to this, it processes data in real time, without delays.
It does not require storing full historical data or creating complex patterns.
It is ideal for rapid detection of individual unusual points as they appear.
It can be easily updated — being an unsupervised model, it flexibly adapts to new data, which is essential for changing data streams.
It requires little memory, as trees do not rely on stored historical data — they only retain the historical structure of the trees.

In summary, Isolation Forest is fast, computationally lightweight, and well-suited for detecting anomalies on the fly during real-time data processing. In its basic version, the model analyzes static data (datasets previously collected), but it can also be adapted to handle dynamic (streaming) data.

Supervised and Unsupervised Learning

In data science, analyses are divided into static and dynamic ones.
A static analysis is based on a fixed dataset containing historical data. The model learns from this data, and once trained, it can be applied to dynamic data. This approach is called supervised learning, because the model learns using defined explanatory variables that affect the outcome.

Supervised learning resembles teaching a child to recognize animals by showing pictures and saying, “This is a cat, and this is a dog.” Here, the model knows exactly what result it should achieve for given examples because it has a clearly defined output variable — what we are trying to predict. The model thus learns how features (e.g., size, fur length, sound) affect the result (cat or dog).

Unsupervised learning works differently. It is like showing a child many pictures of animals without saying their names. The child must independently find similarities or differences and create groups or categories. In this approach, there is no output variable (no “cat,” “lion,” or “horse” label). The model tries to understand the data’s structure, divide it into logical groups, or detect differences between them. When something does not fit any group, it is considered an anomaly.

In summary, the key difference is that in supervised learning, the model receives clear guidance on the expected result, whereas in unsupervised learning, the model must find patterns and structure by itself, because no correct answers are predefined.

Static vs. Dynamic (Streaming) Analysis

In static (batch) analysis, we work with a complete, closed dataset (e.g., a year’s worth of customer transactions). Decision trees are built once for this dataset.

In dynamic (streaming) analysis, data points arrive in real time — there is no fixed, large dataset, only continuously incoming data, such as online store transactions or sensor data from production systems.

Isolation Forest needs historical data to build its initial forest of random decision trees. Then, as new data arrives, it checks these points against the existing trees, determining whether they are anomalies. Thus, the model does not need to store historical data — it only keeps the tree structure, which summarizes the characteristics of previously analyzed data.

How Isolation Forest Works for Streaming Processes

For streaming data, Isolation Forest is used in the form of the Incremental Isolation Forest.
Based on the initial data, the first tree is built. As new data arrives in the streaming process, the model continuously updates the trees.

In this way, it constantly “learns” new patterns and reacts quickly to changes.

Returning to our earlier example — if the bank customer dataset (where a horse appeared as an anomaly) is later expanded to include many cats, the model will treat them as a group and begin to identify subcategories such as fur color, gender, and age. It dynamically creates new trees.

Thus, in practice, for dynamic data, the Isolation Forest model can continuously adapt to the data stream.

Summary

To summarize, Isolation Forest models are an effective tool for automatic anomaly detection because they can simultaneously analyze hundreds of different variables and quickly identify deviations from the norm.

The use of such models is not limited to production lines — they can also be applied to:

analyzing products stored in warehouses,
raw materials received at entry scales in processing plants,
monitoring auxiliary processes such as workforce utilization or production support systems.

Importantly, once implemented in cloud infrastructure, Isolation Forest models operate fully automatically and practically maintenance-free.
This means that, once deployed, they can function indefinitely, with minimal susceptibility to failure or technical errors — further increasing the reliability of the entire quality control and production monitoring process.

The practical implementation of this solution involves connecting the data stream from the production line directly to the analytical cloud via an API interface.
The streaming data are then automatically processed by an analytical module based on the Incremental Isolation Forest model, which continuously classifies products as “normal” or “anomalous.”

The results are sent as time series, containing a clear classification for each analyzed product. These alerts can be delivered directly to mobile devices — for example, the production manager’s or quality inspection staff’s smartphones — allowing immediate response to potential quality or technological problems, significantly improving the efficiency of the entire quality control process.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński