admin, Autor w serwisie THE DATA SCIENCE LIBRARY https://sigmaquality.pl/author/admin/ Wojciech Moszczyński Thu, 20 Nov 2025 18:23:39 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png admin, Autor w serwisie THE DATA SCIENCE LIBRARY https://sigmaquality.pl/author/admin/ 32 32 A Mini Recommendation System for a Customer Service Centre https://sigmaquality.pl/moje-publikacje/a-mini-recommendation-system-for-a-customer-service-centre/ Thu, 20 Nov 2025 18:20:11 +0000 https://sigmaquality.pl/?p=9056 Management – Bakery and Confectionery Review, December 2025 The architecture of the recommendation system is deliberately simple, yet highly effective operationally. Its key components are [...]

Artykuł A Mini Recommendation System for a Customer Service Centre pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Management – Bakery and Confectionery Review, December 2025

The architecture of the recommendation system is deliberately simple, yet highly effective operationally. Its key components are a financial filter aligned with the declared limit of the monthly instalment and a popularity ranking of the proposed vehicle.
The aim of this paper is not only to describe the method, but also to argue why a “simple bicycle” is, in a real organisation, often more effective than a “racing car” whose construction drags on for months. The system is intended to be ready for deployment in a short time, measurable and easy to scale; its advantages lie in predictability, decision discipline among consultants and robustness to “noise” in narrow customer segments.
A ready-to-use application, described in this publication and written in Python, is available in a GitHub repository at:
https://github.com/ff-wm/reco-A1

Business context and motivation

Let us imagine the call centre of a company brokering the sale or lease of vehicles. In a classic scenario, a consultant answers a phone call from a bakery owner and hears the question: “Which vehicle will be suitable for me?” If the answer is off target – for example, a very expensive car without financing is proposed in a situation where the client expects leasing with a specified instalment – the conversation in practice ends immediately, even if politeness keeps it going for a short while longer. The first hit is crucial, because it reduces the client’s cognitive effort and signals fit: “These people understand my needs.” The opposite situation triggers an avoidance mechanism: the client does not return, because they have experienced dissonance and feel their time has been wasted.

Organisations often postpone building their own recommendation system, waiting for the “ideal” moment or telling themselves: “We are efficient – why change what works well?” Ultimately, daydreams arise about some future large-scale system that will revolutionise sales. This is a costly illusion. Business needs results here and now, not demonstrations of new technologies or advanced artificial intelligence. The idea of “low-hanging fruit” consists in implementing solutions which – despite their simplicity – deliver immediate effects, improve key indicators such as conversion, average revenue per contact, or shorten the time needed to handle a transaction.

A small recommendation system as a low-hanging fruit leading to a measurable competitive advantage

In B2B sales interactions, the impression formed in the first seconds of the conversation has a strong impact on subsequent decisions. A call centre that, right at the outset, proposes an appropriate configuration of a delivery vehicle shortens the client’s decision path and reduces their uncertainty. In this text I present a conceptual, inexpensive and simple recommendation system for a call centre selling delivery vehicles up to 3.5 tonnes to bakeries.

W-MOSZCZYNSKI ppic 12-25

Problem framing and data assumptions

We consider the sale of commercial vehicles to bakeries. We focus on the segment up to 3.5 tonnes, because for bakeries this range is particularly attractive. It combines the availability of drivers with a category B driving licence with functionality sufficient for daily delivery of baked goods. We assume that the call centre has access to a database of clients who call regarding new vehicles. The database contains information on the maximum monthly budget, that is, the limit of the leasing instalment, as well as basic client data, such as scale of production and region of operation. Clients with a purchase history have stored in the database information about vehicle segments, models, equipment variants and forms of financing. The call centre also has a vehicle catalogue containing information such as segment, model, equipment version (trim), form of financing, and the estimated price together with the resulting estimated loan or leasing instalment.

It is important to separate two roles of the data:

• first – they serve as a decision constraint in the form of the client’s budget, which is then translated into a hard filter for clients;
• second – they provide a signal of preferences in the form of purchase frequencies across financial classes.

This makes it possible to build recommendations that are both feasible (fit within the instalment limit) and consistent with the purchasing habits of similar clients.

Recommendation mechanism

The method consists of two sequential steps, but both parts form one compact recommendation procedure. First, for a given client we determine their income position relative to the population. In practice, the most convenient approach is to divide the distribution of the declared instalment limit into six equally numerous intervals (sextiles). In other words, we divide clients into six economic classes. One can choose any number of affluence classes; the number itself is not of major importance here. Such a division balances class sizes and facilitates the estimation of stable purchase frequencies.

Next, we restrict the candidates for recommendation to those vehicle configurations whose forecast monthly instalment does not exceed the client’s declared limit. This hard financial filter is a critical element: it removes random and sometimes embarrassing proposals from the call centre that “blow up” the first impression and lead the client to withdraw from the transaction.

In the second step, we order the list of six economic classes, i.e. the clients’ budget capacity. To do this, we evaluate each vehicle variant in the context of the client’s financial class by its conditional popularity in each class, specifically the purchase frequency in a given interval, as well as by its global popularity, i.e. its frequency in the entire client population. The signal indicating the most frequently purchased configuration within a given financial interval can be unstable when the number of observations is small. A single “exotic” purchase may shift the ranking.

Why do we use additive smoothing rather than, for example, a weighted mean or a median?

Because we aim to maintain stability despite data sparsity in some classes. Fixed weights ignore the fact that classes differ in size. The M-estimate automatically scales the influence of the global component to the sample size, ensuring that classes with few transactions do not “burn through” the ranking by accident. Let us assume that the middle class contains only five transactions, and within these transactions one client has bought seven vehicles from the Indian manufacturer Tata Motors. A single rare transaction can disrupt the stability of a recommendation system in a class containing few transactions; therefore, it is necessary to stabilise it using global means.

The goal of the recommendation system is to provide a ranking of the most popular vehicles in specific classes. A kind of top-5 list of vehicles with the highest probability of sale in a given class is created. In other words, a recommendation list is produced, ordered by increasing probability of a “hit” and, secondarily, by decreasing monthly cost (instalment). As a result, the consultant receives a set of proposals that are both statistically credible and financially acceptable. As mentioned earlier, in cases of extremely rare vehicles on the market that appear in a specific, very small class, we mitigate extremes by incorporating the global ranking, which prevents domination by exotic brands that are a kind of cultural anomaly.

Intuition and stability

Clients react not only to price and specification, but also to the impression of fit and time savings. The first sensible proposal reduces uncertainty and shortens the negotiation distance. From a statistical perspective, the method exploits the fact that preference distributions within similar financial constraints resemble one another. The admixture of global popularity acts as a regulator that “calms” the ranking when the local sample is too small.

From a business point of view, the solution is transparent: we can explain to the consultant why a given vehicle ranked high. It was purchased by “people like you”, so it is popular and meets the budget constraints, which fosters trust in the call centre and facilitates the process. Everyone knows that a popular vehicle model means easy access to spare parts and service, as well as a large amount of user information available online.

Operational implementation without code

It is not necessary for the consultant to understand the details of the algorithm. From their perspective, the interface is a field in which the client identifier is entered; a list of five vehicle proposals with a forecast instalment and a brief justification then appears. From the viewpoint of the psychology of sales, providing specific recommendations is critical. A recommendation opens the conversation but does not close it – the consultant asks two or three follow-up questions, for example about body type preferences or the availability of a driver with a category B licence, then confirms the choice and proposes an alternative variant within the same financial class. Such a scheme limits embarrassing proposals, “offers that miss the mark”, and builds an impression of professionalism and self-confidence.

Measuring the effect in production conditions

The organisation should assess the system from two perspectives.

First – through the prism of operational indicators: conversion of contact into sale, revenue per contact, and time to decision per contact.

Second – from the point of view of ranking quality: whether the purchased vehicle appears in the first five proposals, i.e. on the recommendation list, and in which position on that list.

The most pragmatic experiment is a simple division of call-centre consultants into a test group and a control group. Data “before” and “after” implementation should be collected for the same employees testing the new solution, or a parallel effectiveness test should be conducted in the same time windows.

With normal distributions and “before/after” comparisons, we use Student’s t-test for dependent samples; where normality is violated, we use the non-parametric Mann–Whitney or permutation test. We adopt a conservative decision criterion (e.g. significance level 0.05) and, in addition to p-values, we also report effect sizes (Cohen’s d), which facilitates communication of business value. Put more simply: implementing a simple recommendation system also requires testing on a small sample of call-centre employees.

Risks, ethics and change management

Simplicity does not relieve one of responsibility. Particularly sensitive is the way in which we categorise clients – we should limit ourselves to the financial correlate of the transaction (instalment limit), and not to indirect socio-demographic characteristics that could introduce unjustified bias. The system must not “penalise” the client for being different beyond what is economically justified. In other words, if a citizen of India wants to buy Tata vehicles, there is little point in persuading them that “the computer says no”.

Equally important is preventing excessive exploration of niche configurations. This is safeguarded by the mechanism of smoothing using global popularity. Finally, implementation should be accompanied by a short training path for consultants – with emphasis on understanding the principles of operation and consciously formulating justifications for the client.

What next: modelling purchase sequences

Once a simple recommendation system starts delivering results, its complexity can be increased gradually. Natural directions include incorporating seasonality and regional context or using route logistics. One can move towards modelling sequences of vehicle purchases, i.e. dependence between consecutive vehicles over time.

Modelling purchase sequences means adding to the recommendations information on what clients usually choose as the next vehicle after their previous choice and when they usually do so. In other words: not only what is popular in a given budget class now, but also what, with a certain probability, usually follows a given earlier transaction. For example, after 36 months of leasing an Atom van from Ford, clients most often switch to the refrigerated version AtomFrost+. The “what after what” mapping is based on a simple principle of transition frequencies, e.g. from segment A to B within the same budget class with light smoothing, and their influence grows the more recent and longer the client’s history log is. This method is quite popular among vehicle manufacturers; it is said that cars “grow”¹.

Further development: Learning-to-Rank

A simple, functioning recommendation model that earns additional revenue can be expanded in an evolutionary way. The key is to maintain measurement rigour: no extension should weaken the simplicity of interpretation during the conversation with the client.

We treat recommendation as a ranking problem: the model learns which vehicles should be higher for a given client. Instead of predicting “whether they will buy”, we predict the relation “Ford before Renault”. In practice, we feed the model with features of the client (budget), vehicle features from history, financing, instalments and labels from logs, i.e. information on the course of the purchasing process².

Example: for a client with a monthly instalment budget of 1,800 PLN, the model learns that “SUV-lease_48” usually outperforms “wagon-loan_60”, so in the top-5 it moves SUVs higher, because historically they more frequently end in a scheduled contact or a purchase.

Further development: the multi-armed bandit mechanism

This is an evolutionary development approach that boils down to two balancing directions:

• exploitation – we tell the model: show what has worked so far;
• exploration – we tell the model: sometimes show something less familiar to check whether it performs better.

Each direction is a variant of recommendation, a specific configuration or rule for ordering the list. Simple algorithms include ε-greedy (e.g. in 5

For example, suppose we have three list-ordering strategies:

(S1) pure popularity (i.e. what we have described in this article),
(S2) popularity + “what-after-what” sequences,
(S3) popularity + greater emphasis on leasing_36, which happens to be exceptionally profitable for the call centre.

The bandit assigns consultants’ conversations to S1/S2/S3; after each conversation we update the “reward”. Over time, the algorithm increasingly often selects the strategy that yields the highest profit, while still occasionally testing the others to capture any trend shift.

In summary, the multi-armed bandit approach is a simple online learning method that distributes conversations among several strategies (e.g. S1/S2/S3), updates their “reward” (click/scheduled meeting/purchase) after each conversation, and over time increasingly chooses the strategy that delivers the highest profit – while still occasionally testing the others so as not to miss a better option.

Conclusions

Even the simplest recommendation system described here – budget matching combined with a popularity ranking and stabilising smoothing – can significantly improve call-centre sales outcomes by increasing the accuracy of the first proposal, and thereby sales conversion.

The advantage of this recommendation system stems from three features:

• it is quick to implement,
• it is transparent to explain,
• it is robust to data sparsity in narrow segments.

Contrary to intuition, “toy-like” does not mean frivolous: in many contexts it is precisely this type of system that constitutes the first step enabling an organisation to begin the journey towards more advanced forms of personalisation and revenue optimisation.

Wojciech Moszczyński

Wojciech Moszczyński – graduate of the Chair of Econometrics and Statistics at Nicolaus Copernicus University in Toruń, specialist in econometrics, finance, data science and management accounting. He specialises in the optimisation of production and logistics processes. He conducts research in the field of development and application of artificial intelligence. For many years he has been involved in popularising machine learning and data science in business environments.

¹ Golf VII (since 2012) moved to the modular MQB platform – the same family on which the Passat B8 (since 2014) was built. In practice this meant a “floorpan” and key chassis elements that were shared or similar, so the Golf clearly “grew” towards the former size of the Passat.
² Example models: LambdaMART, XGBoost ranker, pairwise Logistic Regression. Metrics: NDCG@k, MAP, Hit@k.

Artykuł A Mini Recommendation System for a Customer Service Centre pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Zabezpieczone: szkolenie Maksa https://sigmaquality.pl/doe/szkolenie-maksa/ Mon, 03 Nov 2025 15:55:16 +0000 https://sigmaquality.pl/?p=8872 Brak zajawki, ponieważ wpis jest zabezpieczony hasłem.

Artykuł Zabezpieczone: szkolenie Maksa pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Ta treść jest chroniona hasłem. Aby ją wyświetlić, wpisz poniżej swoje hasło:

Artykuł Zabezpieczone: szkolenie Maksa pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Practical notes on applying OCEAN/HEXACO psychological profiling in recommender systems for selling luxury and investment goods (Part 2) https://sigmaquality.pl/moje-publikacje/practical-notes-on-applying-ocean-hexaco-psychological-profiling-in-recommender-systems-for-selling-luxury-and-investment-goods-part-2/ Fri, 31 Oct 2025 09:15:55 +0000 https://sigmaquality.pl/?p=8840 November 2025 | Volume 79 This article refers to my earlier publication, where I presented assumptions for using the Big Five (OCEAN) and HEXACO models [...]

Artykuł Practical notes on applying OCEAN/HEXACO psychological profiling in recommender systems for selling luxury and investment goods (Part 2) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

November 2025 | Volume 79

This article refers to my earlier publication, where I presented assumptions for using the Big Five (OCEAN) and HEXACO models to profile customers and tailor sales communication.

I focus on two practical data sources: (1) call-center/voice-to-text transcripts from the advisory channel and (2) application logs from the web channel (clicks, searches, scroll time, navigation paths). I provide a scientific justification that both types of digital traces contain sufficiently rich linguistic and behavioral signals to estimate personality dimensions, as confirmed by research on predicting OCEAN from language and digital behavior, as well as meta-analyses of digital footprints. I state requirements for sample size, the validation scheme, and ethical-legal issues relevant to commercial applications. Prior findings show that matching the message to the personality profile increases the effectiveness of persuasive and sales activities—provided that the models are properly calibrated on appropriate data and respect privacy standards.

Modern computational psychometrics has repeatedly shown that spontaneous language (open vocabulary) and online behaviors predict questionnaire OCEAN scores at a level useful in practice. A classic study on more than 75,000 Facebook users demonstrated clear and stable mappings of the five OCEAN dimensions in linguistic features. Language from transcripts and “mouse” behavior patterns on a website carry information about personality dispositions that can be used to appropriately tailor commercial communication.

Transcripts in advisory channels

We obtain transcripts from ASR (automatic speech recognition) systems. After normalization (removing fillers, lemmatization, segmentation into speaker utterances) we obtain textual material that can be analyzed along two complementary paths:

(a) Interpretable features — input variables that are easy to understand and explain; it is clear what they measure and why they affect the outcome. For example, the number of expressive adverbs (“super,” “mega”) and invitations to make contact “live” indicate higher Extraversion (E).

(b) Semantic embeddings of sentences/segments compared with OCEAN/HEXACO “seed sentences.” The reliability of a text sample increases with text length.

Sample size. The stability of linguistic features increases with the number of words; numerous publications emphasize that short samples are unstable and require interpretive caution. Practical guidelines suggest a few hundred words as the minimum for meaningful conclusions. In our applications we assume 300 words of the customer’s cumulative speech or the sum of several conversations.

W-MOSZCZYNSKI ps 11-25

Application logs (web channel): measuring OCEAN/HEXACO traits

The second channel is digital behavior on the site and in the app: order of page views (click-through paths), time on subpages and components (dwell time), intensity of filtering/comparisons, search queries, returns to cart, depth of specification exploration, changes of variants, etc. Research on digital footprints shows that even simple activity patterns are characteristic of personality traits, and meta-analyses confirm significant, repeatable links between digital behaviors and the Big Five dimensions. The strongest evidence concerns mobile data (passive sensing): OCEAN levels can be predicted from smartphone logs (calls, location, app habits) at a practically acceptable quality; reviews particularly confirm good predictability for Extraversion, with smaller but significant effects for the remaining dimensions. Methodologically, these results are also important for website logs because they concern the same class of phenomena: stable patterns of technology use.

Example mappability of signals

  • Increased depth of comparisons, frequent switching of specifications, and use of calculators may indicate higher Conscientiousness (C).

  • Quick “live” decisions, willingness to make contact, and checking social proof—higher Extraversion (E).

  • Preference for “what’s new” sections and beta/limited versions—higher Openness (O).

  • Persistent checking of return policies, risk FAQs, and hidden costs—higher Emotionality/Neuroticism (E/N).

  • Preference for “About the brand,” certifications, and proofs of authenticity—higher Honesty–Humility (H).

Such assignments of simple, recurring behaviors must always be verified empirically—by building models on stable behavioral features and checking their validity against reference traits.

Study design and good (operational) practices

In a project aimed at learning customer traits, we combine both sources (transcripts + logs) into one pipeline. On the transcript side, we use cleaning, speaker segmentation, and extraction of features and embeddings; on the logs side—aggregation of event sequences into episodic and long-term features. We primarily analyze the repeatability of characteristic behaviors per session, creating a typical behavior pattern for each customer. We build models as regressions/classifiers for five (OCEAN) or six (HEXACO) dimensions with validation by person–session to avoid leakage between sessions of the same customer. Therefore, we do not pool logs from many sessions into one dataset, because we study the stability of behaviors. Behaviors across many sessions must repeat. The replicability of behavior patterns will be evidence of high-quality customer observation. For transcript analysis, for clarity I recommend a two-layer architecture: an interpretive layer (traditional counts of words and phrases) and a semantic/sequential layer (embeddings, vector models detecting sentence meaning).

Sample size is critical: on the text side, at least a few hundred words per customer for cumulative conversations; on the logs side—dozens of events per user and aggregation within a time window. Publications recommend plotting reliability curves (prediction quality vs. text length/number of events), which is a commonly recommended practice in analyses of online language stability. New AI methods provide many metrics to assess sample quality.

Research demonstrates the power of predicting traits from digital traces, which creates an obligation for transparency and proportionality of processing. Models should be designed according to the “benefit to the customer” principle (better matching, lower information friction), with data minimization, anonymization, access controls, and regular risk assessment. Reporting should include not only accuracy but also stability and calibration to avoid over-interpretation. Prior studies provide empirical background confirming the sensitivity of profiling from digital traces.

Both transcripts and application logs are valuable, complementary sources for estimating OCEAN/HEXACO profiles in the context of selling luxury and investment goods. Language captures the content and style of the customer’s thinking, while web-channel behaviors reveal habits of exploring risk, data, and novelty in a real purchasing process. Combining both sources—while meeting sample-length requirements, validation rigor, and ethical rules—enables operationalizing the principle “first learn, then tailor communication.” The results should be shorter decision cycles, higher satisfaction, and a higher probability of closing the transaction, consistent with experimental evidence on the effectiveness of psychological message matching.

Why salespeople should not be given scripts on how to talk to customers

Classical sales approaches emphasize aligning content and style to the customer type, based on situational knowledge and a repertoire of strategies. At the same time, we know that salespeople, like customers, have their own relatively stable OCEAN/HEXACO profiles that reveal themselves after a dozen or so sentences of interaction. Attempts by a salesperson to pretend a certain style over a long time are not durable; therefore, instead of changing the salesperson’s personality, one should optimize the pairing of salesperson and customer. Put plainly, a person is not able to change their style or psychological structure in a short time. They are not able to change their vocabulary so that the words they use in a sales conversation match the customer’s psychological structure. Yes, they may pretend to be someone else for a short time, but in the longer perspective such pretense will be ineffective. Moreover, a customer who realizes the salesperson is pretending may react unpredictably. One can ask whether such pretense or “playing to the customer” on the part of the salesperson is profitable and whether it carries certain risks. It is not simple to match a salesperson to a customer according to the OCEAN/HEXACO methodology. This aspect requires longer empirical research. The problem of proper matching undoubtedly exists. It is not based on the simple assumption that similarities attract and differences repel. The literature points to cases in which complementarity (e.g., high C in the salesperson with low C in the customer) can be beneficial—hence the need for empirically derived rules. Although similarity–attraction theory predicts greater interaction comfort with similar profiles (which may be advantageous in luxury and long-term relationships), in some cases complementarity is more effective. Therefore, matching rules should be learned from data and continuously updated—instead of assuming the universal primacy of similarity.

  • “Operating manuals” for customers and for salespeople can be embedded in the CRM and generated automatically based on OCEAN/HEXACO profiles.

  • Instead of expecting a salesperson to change personality several times a day, match salespeople to customers based on OCEAN/HEXACO profiles.

If only selected salespeople can work with specific customers, should those selected salespeople also write contracts, offers, and general correspondence with those customers? Why should a person not write written communications tailored to the customer’s psychological profile?

Why, from the perspective of using OCEAN/HEXACO, offers for customers should be generated by AI rather than manually by a human

If every offer and every paragraph of communication is to be precisely matched to the customer’s personality profile according to OCEAN/HEXACO, generating them by people is inefficient and error-prone: it scales poorly, does not guarantee consistency or objectivity, and is difficult to test and improve quickly. AI systems solve these limitations: they can reliably infer a profile from language and digital traces, select content for traits (psychological targeting), and then optimize the offer in an experimental cycle (A/B, multi-armed bandits).

A typical salesperson has limited memory and limited time. They cannot maintain linguistic rules and the right keywords and “golden phrases” for a thousand unique profiles and, on the fly, compose offers with different variants of tone, structure, and contractual safeguards. AI, however, can apply consistent rules at the level of words, phrases, and entire sections, taking into account both hard traits (e.g., for Conscientiousness—emphasis on SLAs) and soft traits (e.g., high N/Emotionality—reducing uncertainty in phrasing and the order of arguments). The empirical foundation that matching content to traits changes recipient behavior (clicks, purchases) has been provided by large-scale field experiments.

Humans are easily influenced by mood, fatigue, and the most recent contact. Algorithms, based on data traces, weigh information evenly and—importantly—do not prefer socially desirable profiles. Comparative results show that computer-based personality assessments were sometimes more accurate than those by friends or spouses of the same individuals. This is an argument that language-selection rules should follow from the model rather than an individual’s intuition.

Generative text systems can produce hundreds of offer variants (different tone, paragraph order, degree of formalization, choice of proofs) and test them experimentally (A/B/MAB), updating the generation policy based on real outcomes (conversion, time to decision, response to follow-ups). In the human world such a loop would be too slow and inconsistent.

Why this translates into sales: the “psychological matching” mechanism

In psychological targeting experiments, matching the message to Extraversion or Openness yielded large increases in clicks and purchases versus non-matched versions. This is strong evidence that micro-features of language and persuasive emphases must be retuned to the profile—otherwise the same offer can provoke resistance or boredom. We extend the same principle to the full OCEAN/HEXACO set, including Honesty–Humility. This is very important for luxury and investment goods due to perceptions of ethics, authenticity, and contractual risk.

What exactly AI does that a human will not do well—or at all

  1. Infers the profile from transcripts and application logs.

  2. Selects content and tone at the micro level (words/phrases), meso level (paragraphs: order, saturation with evidence/risk), and macro level (offer architecture).

  3. Controls “risk words” (exclusion lists) and substitutes for given profiles to avoid defensive reactions.

  4. Experimental optimization (A/B/MAB) on live data—rapid detection of “profile → variant → outcome” patterns and updating the generation policy.

  5. Consistency and metadata: every offer has metadata; effective editing of metadata and subtext with respect to OCEAN/HEXACO traits is unachievable for a human at the scale of hundreds of customers.

Integration with sales practice in the CRM

In practice, a recommendation engine that automatically builds offers is a module in the CRM that takes a previously defined customer OCEAN/HEXACO profile and a business configuration (price list, SLAs, policies) and returns a composite document: tone variant, order of sections, selection of proofs, and what to avoid. Per-profile versions can be maintained as dynamic templates. Every output undergoes compliance checks and can be edited ex ante by a human. This human involvement is important and necessary. In data-science model pipelines this is called human-in-the-loop. However, AI generates the first version and controls the optimization experiments.

OCEAN/HEXACO in the context of configuring the subject of individual offers for luxury goods

There are strong scientific and tooling foundations for recommender systems to select specific product attributes (e.g., yacht drive variant, series limitation, finishing materials, financing mode) based on the OCEAN/HEXACO personality profile and on the behaviors and choices of “psychologically similar” customers, using matrix factorization, embeddings, and vector databases.

Empirical foundations

  1. Psychological targeting works. As stated earlier, in experiments matching the message to Extraversion or Openness increased clicks and purchases compared with non-matched communication—showing that personality traits can be operationalized in marketing and sales, and that their impact is quantitatively significant.

  2. Personality-aware recommendations. Literature reviews on personality-aware recommender systems show that incorporating personality features improves cold-start performance and recommendation stability by leveraging psychological similarity, not only behavioral similarity.

  3. Luxury contains symbolic values. HEXACO traits (e.g., Honesty–Humility) are linked to attitudes toward authenticity, exclusivity, and brand ethics—factors of particular importance in luxury and investment goods.

  4. Mechanism 1 (reduction of cognitive cost). OCEAN/HEXACO organize the style of information processing; tailoring tone and types of proof reduces cognitive cost, leading to higher conversion.

  5. Mechanism 2 (transfer of preferences via “psychological near-neighbors”). Collaborative filtering exploits user similarity; incorporating psychological similarity improves the cold start and stabilizes recommendations.

  6. Mechanism 3 (alignment with symbolic/self-expressive motives). Personality traits relate to preferences for limited editions, authenticity certificates, brand history, or risk-mitigating contractual clauses—dimensions that can be configured per profile.

Limitations and ethics

  • Domain validation — effects must be tested on real indicators (win rate, cycle length, CSAT), not only on RMSE/CTR.

  • Privacy and transparency — psychological profiles are sensitive; consent, data minimization, and bias audits are required (especially when the profile affects price/terms).

  • Stability of inference — models must report uncertainty (e.g., short linguistic sample length).

  • Boundaries of influence — personalization should help the decision, not manipulate beyond the client’s interest.

Summary

Combining the OCEAN/HEXACO profile with matrix factorization + embeddings + a vector database creates a coherent, scientifically documented foundation for automatically selecting offer attributes in luxury markets (yachts, residences, cars). We have evidence for the effectiveness of psychological matching at the level of purchasing behaviors, mature techniques for tailoring actions to personality, and a body of literature linking traits with decisions concerning luxury goods and authenticity.


Artykuł Practical notes on applying OCEAN/HEXACO psychological profiling in recommender systems for selling luxury and investment goods (Part 2) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Strategy as the Choice of a Specific Direction of Development DEA-CCR https://sigmaquality.pl/moje-publikacje/strategy-as-the-choice-of-a-specific-direction-of-development-dea-ccr/ Tue, 28 Oct 2025 10:55:23 +0000 https://sigmaquality.pl/?p=8833 Strategy is a conscious choice of a specific direction of development that allows a company to focus on defined goals and use available resources effectively. [...]

Artykuł Strategy as the Choice of a Specific Direction of Development DEA-CCR pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Strategy is a conscious choice of a specific direction of development that allows a company to focus on defined goals and use available resources effectively. The essence of strategy lies in deciding which actions and areas of activity are priority and which should be limited or eliminated. Thanks to this, a company can concentrate on perfecting the chosen path, minimizing operational chaos and increasing its competitiveness.

In Poland, many bakeries use a mixed model in which part of the products are sold to external stores (wholesale or retail), part goes to supermarkets, and the remaining portion is sold through a network of their own brand stores.

At first glance, such a model seems flexible and opens many distribution channels, but in practice it often leads to operational inefficiency. The reasons include:

Divergent requirements. Each of these sales channels requires a different approach. For example, working with supermarkets demands large volumes at low margins, whereas sales in own stores involve maintaining staff and premises.

Resource dispersion. Managing diverse sales channels requires different resources such as a transport fleet, staff, or marketing strategies.

Increased workload. Combining these activities raises the number of operational processes that must be coordinated, which increases the burden on employees and owners.

Difficulty in optimization. By focusing on many directions at once, the company struggles to achieve maximum efficiency in each of them.

Choosing one coherent strategy is a key element of effective management and long-term success. It allows the company to focus on perfecting specific processes and actions, which improves their quality and efficiency. With a single strategy, the company can optimize the use of resources such as people, finances, or infrastructure, eliminating unnecessary costs. Finally, the most important argument—specializing in one area of activity enables building a competitive advantage, e.g., through better product quality, faster customer service, or lower operating costs. In short, a unified strategy reduces operational complexity, making planning, control, and performance monitoring easier.

W-MOSZCZYNSKI ppic 1-25

Strategic Planning

Let us imagine a situation in which a bakery has grown significantly. From its profits, the owner has built a large plant for the production of bread and pastries. At a certain stage of development, they must decide which strategy to adopt. There are four options.

Strategy 1: Cooperation with supermarkets

This strategy consists of supplying large quantities of bread at very low margin. Supermarkets impose the size and type of production, and settlements with the entrepreneur take place with a substantial delay. In addition, the entrepreneur must invest in a fleet of vehicles to enable long-distance transport. This requires hiring drivers and vehicle maintenance staff. Although the margin in this model is very low, it provides stable income.

Strategy 2: Creating a network of own stores

The owner can open a network of their own shops in nearby towns and villages. These shops would be rented, which entails leasing costs. It is also necessary to create a vehicle fleet and hire drivers and mechanics. Additionally, store staff must be maintained as part of the bakery’s workforce. In this model, planning and the development of sales strategies are required. The margin here is higher than in cooperation with supermarkets, but managing a network of stores is more labor-intensive and involves greater instability (e.g., due to lease agreements or staffing).

Strategy 3: Cooperation with external stores

The bakery supplies bread to external stores whose staff are not bakery employees. These stores accept the goods, and if they do not sell the bread, they return it to the producer. The bakery owner must have a fleet of vehicles and hire drivers and technical personnel to service the fleet. In this model, sales are planned by the external stores, and the bakery merely fulfills their orders. The profitability of this model is higher than cooperation with supermarkets, but lower than in the case of running own stores. However, the model is characterized by high stability.

Strategy 4: Wholesale or sales through intermediaries

The bread producer manufactures goods that are then put up for sale. Buyers come to the receiving hall, where they purchase products based on daily market prices or auctions. A similar model existed at Poznański’s Manufaktura in Łódź, where traders came from all over the country to buy large batches of goods.

In our case, a model of sales through intermediaries can be considered. The bakery signs contracts with intermediaries who collect the bread and distribute it further. In this model, the bakery does not have to employ store staff, drivers, or maintain a vehicle fleet. Production is based on orders from regular customers, which limits the need for sales planning. Profits are smaller, but the model is simpler to manage and less demanding operationally.

Each of these strategies has its advantages and disadvantages, so the choice should be well thought out, taking into account the bakery owner’s capabilities and goals.

People and Their Choices…

Now comes the moment of choosing a strategy. The wholesale vision appeals to me the most. I imagine myself sitting in a leather, tufted armchair, cigar in hand, watching from a high, glass-walled office as intermediaries’ trucks pull up to the loading dock. Everything is under my control: flour in the warehouses, employees under the watchful eye of cameras, and industrial robots precisely performing repetitive tasks. The hall is spotlessly clean, with perfect order and harmony. At any moment I can increase production or introduce some extraordinary innovation. I feel like Izrael Kalmanowicz Poznański in his 19th-century factory in Łódź. Why did I choose the “Wholesale” model? Because I toured that factory and I liked it.

Unfortunately, pragmatism can be merciless. What the heart tells us is not always the most effective solution. In other words, sometimes, by making a bit more effort, we can achieve twice or even three times the profit. And here lies the problem: human nature—with its baggage of emotions, experiences, and fears—can hinder rational decision-making.

In such moments, mathematics comes to the rescue. It allows us to set emotions aside and look at the problem with a cool, analytical eye, enabling the choice of a strategy that will bring maximum benefits. Thanks to this, instead of relying solely on dreams, one can choose a solution that ensures real efficiency and stability in the longer term.

Mathematics, i.e., DEA-CCR

The DEA-CCR method was developed in 1978 as a tool for assessing the efficiency of decision-making units. In short, it serves to compare objects or organizations that seem incomparable. The main evaluation criterion is efficiency in achieving a defined goal. For example, we can compare four different vehicles: a tractor, a bus, a scooter, and a passenger car. At first glance, these means of transport appear completely incomparable because each serves a different function and has different applications. However, thanks to a mathematical approach, we can assess which of these vehicles is the most optimal under given conditions and for accomplishing specific objectives. Such a comparison takes into account key parameters such as fuel consumption, capacity, speed, and operating cost, which allows us to choose the vehicle best suited to the task.

DEA-CCR consists of calculating the relative efficiency of decision-making units by comparing their inputs and outputs. For each unit, the method forms an efficiency frontier (the “frontier”), i.e., a set of units that make the best use of their resources. Units on this frontier are considered efficient, while those outside it are inefficient. Efficiency is expressed as the ratio of a virtual sum of outputs to a virtual sum of inputs, where the weights are chosen to maximize the result for each unit.

What makes DEA-CCR unique is its ability to account for many different input and output variables without the need to simplify or aggregate them.

In the context of choosing a business strategy, DEA-CCR is particularly useful because it allows evaluating the efficiency of different options in a multidimensional way. For example, in analyzing a bakery’s strategies, the method can include various aspects such as operating costs, labor intensity, receivables turnover, or profitability. Each strategy is evaluated against the others, which helps identify which one uses available resources best to maximize results.

DEA-CCR eliminates subjectivity in the strategy selection process because it automatically determines optimal weights for each unit’s inputs and outputs. Thanks to this, it not only points to the most efficient strategies but also indicates what changes should be introduced in less efficient options to improve their own efficiency.

Choosing a Strategy Using DEA-CCR

For the purposes of this example, I prepared a set of decision variables that may play a key role when choosing a strategy. These factors are presented in the table.

For turnover measured in cycles, a higher number of cycles per month (or year) is preferable. This means the company converts receivables into cash more quickly, which translates into better financial liquidity and more efficient working-capital management. Accordingly, in the table “receivables turnover (cycles)” remains in a form where higher values are better. The “level of calm” is an indicator that describes the degree of harmony and stability in the company’s functioning. It is the inverse of the level of stress, where higher values mean less operational tension and greater ease in managing business processes. The level of calm can be understood as a measure of operational comfort that enables the company to function more smoothly without excessive disruption.

Here is the task implemented in Python.

STEP 1: Define inputs and outputs.
• Inputs: operational factors that require resources (labor intensity, number of office employees, costs).
• Outputs: business effects, including both positive (profitability, stability) and negative (stress).

STEP 2: Prepare the data. Negative outputs (e.g., stress) are transformed into negative values so that they are included in the analysis as “undesirable.”

STEP 3: Implement the DEA-CCR model. The dea_efficiency function calculates efficiency for each strategy by comparing ratios of inputs and outputs.

Parameters

Parameter Supermarkets Own stores External stores Wholesale
Inputs
Labor intensity 0.9 0.8 0.6 0.1
Number of office employees 10.0 20.0 15.0 2.0
Fuel costs (PLN) 15,000 12,000 10,000 5,000
Total costs (PLN) 500,000 700,000 450,000 300,000
Outputs
Sales profitability (ROS) 0.05 0.10 0.15 0.08
Level of calm 0.9 0.8 0.6 0.3
Level of stability 0.5 0.7 0.9 0.8
Receivables turnover (cycles) 5.0 10.0 8.0 12.0

STEP 4: Compute efficiency. We call the function and calculate the efficiency of each strategy.

Result

Summary

Based on the DEA-CCR analysis results for the bakery strategies, the following conclusions can be drawn for the owner of a plant producing pastries and bread.

1. “Supermarkets” strategy
Efficiency: high (e.g., 0.79 compared to other strategies).
Conclusions:
• This strategy is stable despite the low margin. Cooperation with supermarkets ensures regular orders but requires large investments in a vehicle fleet and the hiring of drivers and technical service.
• The stress associated with this strategy is high, which may affect company management in the long run.
• It is worth considering as a primary option if the priority is stability with limited possibilities for sales planning.

2. “Own stores” strategy
Efficiency: average (e.g., 0.71).
Conclusions:
• The strategy provides higher profits (higher sales profitability) but is more complicated to manage.
• It requires high operating costs such as renting premises, maintaining a vehicle fleet, and store staffing.
• The instability of store leases and labor intensity can be challenging; therefore, this strategy should be chosen by an owner experienced in managing retail networks.

3. “External stores” strategy
Efficiency: relatively low (e.g., 0.56).
Conclusions:
• Although this strategy is more profitable than cooperation with supermarkets, its efficiency is limited by the need to maintain a vehicle fleet.
• This model does not require hiring store staff, which reduces managerial burdens.
• It may be a good solution if the bakery has limited human resources, but in the longer term it may be less competitive.

4. “Wholesale” strategy
Efficiency: lowest (e.g., 0.18).
Conclusions:
• This strategy requires the fewest resources (employees, fuel, and planning), but its efficiency is low due to limited margins and smaller revenues.
• It is an option for an owner who wants to simplify operations and reduce operating costs but must accept lower profits.
• It can be useful as a complement to more complex strategies or in crisis situations when the company wants to minimize risk and costs.

The owner should choose the strategy that best fits their resources (financial and staffing) and long-term business goals.

This is, of course, only an example of applying the algorithm, and the data were selected at random, based on the author’s limited experience. This analysis should not constitute the basis for making actual business decisions. Nevertheless, the algorithm can easily be adapted by substituting real data on variables and decision factors that matter when choosing a strategy.

The strategy model was written in a universal way, so one only needs to enter their own data. It is worth remembering that business decisions should never be made under the influence of emotions.

The analysis showed that my vision of being a great industrial producer, controlling every aspect of operations, turned out to be the least efficient. Had I chosen this strategy under the assumptions formulated in the task, I would have incurred high opportunity costs. In the longer term, this could lead to a loss of the company’s financial liquidity, threatening its stability and future.

This example is a reminder of how important it is to make decisions based on reliable data analysis, not merely on subjective beliefs or visions.

Wojciech Moszczyński

Wojciech Moszczyński—graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń; specialist in econometrics, finance, data science, and management accounting. He specializes in optimizing production and logistics processes. He conducts research in the development and application of artificial intelligence. For years, he has been engaged in popularizing machine learning and data science in business environments.

Artykuł Strategy as the Choice of a Specific Direction of Development DEA-CCR pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Factorization Machines – how a baker can build an effective recommendation system based on client log data https://sigmaquality.pl/moje-publikacje/factorization-machines-how-a-baker-can-build-an-effective-recommendation-system-based-on-client-log-data/ Tue, 28 Oct 2025 10:42:28 +0000 https://sigmaquality.pl/?p=8825 Building a Recommendation System (Part 3) Contemporary online shops—even small ones run by local bakers—can effectively use customer data to personalize offers. One of the [...]

Artykuł Factorization Machines – how a baker can build an effective recommendation system based on client log data pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Building a Recommendation System (Part 3)

Contemporary online shops—even small ones run by local bakers—can effectively use customer data to personalize offers. One of the most effective approaches to creating product recommendations is to apply the Factorization Machines (FM) method.

Factorization Machines – how a baker can build an effective recommendation system based on client log data

Below, I will explain how to build such a system step by step, show the differences between FM and the classic PCA + k-means approach, and present an example of how such personalization can translate into increased sales.

Collecting customer data

In the previous part, I explained how to collect information about your customers. First and foremost, to perform any analysis, we must have a database concerning our customers. We are analyzing a confectionery store that operates online. Each customer, in order to make a purchase, must first log in. In the previous publication, I pointed out that the owner has their own website and direct access to the logs. Logs are simple pieces of information generated by the system that record customers’ actions. The first log records that the customer logged into the store using a password and their own login. Another log records what the customer typed into the search box and which sections of the website they entered. From the logs, we can learn how long the customer took to make a purchase, whether they hesitated, or whether they abandoned the purchase. System logs are a treasure trove of information. In the previous publication, I explained how to place information from logs into a table. This table is a database describing the characteristic features and behavior of each individual customer. In this part, I would like to show how, having 30 characteristic features for around 100,000 customers of our store, to create a recommendation system that will select appropriate recommendations individually for each customer.

In our example, based on the customer’s activity logs, we extracted so-called descriptive features (features), e.g.:

 
Customer ID | Average time to purchase | 1001 | short | 5 | 120 PLN | evenings | 2635 1002 | long | 25 | 80 PLN | mornings | 1825 1003 | medium | 10 | 200 PLN | weekends | 3545

In practice, there may be 30 or more such features, including, for example, preferred flavors, purchase frequency, favorite days of the week, etc.

W-MOSZCZYNSKI ppic 11-25

 

Building a recommendation system with Factorization Machines

Factorization Machines is a supervised learning model (regression/classification) that extends ordinary linear regression with the ability to automatically learn interactions between pairs of features, even with very high data dimensionality (30 or more features). FM efficiently models dependencies between features thanks to so-called hidden factors (embeddings) instead of explicitly creating all combinations of visible and invisible features. This allows it to capture patterns that a traditional linear model would miss, while avoiding the explosion in the number of parameters (the curse of dimensionality). FM became popular in recommendation systems and click-through prediction precisely because of this ability to model interactions in high-dimensional data. The FM model thus predicts the probability of clicks and the course of the customer’s purchasing process.

On the basis of these data, one can train a Factorization Machines model which:
• recognizes hidden patterns in customer preferences,
• does not require manual creation of feature combinations,
• deals effectively with sparse data (e.g., a customer bought a cake only once).

The model learns to predict which products are most likely to be purchased by a specific customer. It performs better than simple rules like “who bought A will buy B” because it accounts for many subtle interactions between features. It can see what is visible and what is not visible.

For example, Customer ID: 1002 has a high propensity for discounts and usually shops in the morning. FM will learn that cream cakes on Monday mornings with a discount code have a higher chance of being purchased by this customer. Based on the model’s prediction of a higher probability of such behavior (customer ID: 1002 – morning + promotion), the system can automatically suggest morning promotions for products, and even prepare a personalized promotion just for them.

A typical recommendation system based on FM can take into account, among others: user features (demographics, behavioral segment, even hundreds of features such as gender, risk tendencies, or various types of phobias), product features (category, price, popularity, etc.), and context (time of day, device, channel, etc.). FM efficiently computes interactions among these features. It discovers hidden interactions that a linear model would never capture without a manually defined interaction. Thanks to such automatic feature combination, FM works well for tasks with very sparse data, such as product recommendations or click-through prediction.

Matrix Factorization

To better understand Factorization Machines, let’s start with its first version, i.e., Matrix Factorization. Let’s learn by doing. We want to create a recommendation system for baked goods (e.g., rolls, cakes, bread) that suggests to each customer what else they might like—based on the preferences of other similar customers.

We have 5 customers and 5 products:

“OK” means the customer likes it, “No-OK” means they don’t like it, while “n/a” is simply missing data.

Instead of guessing “who likes what,” Matrix Factorization:
• learns hidden customer features (e.g., prefers sweet, soft, healthy, poppy-seeded…),
• learns hidden product features (e.g., sweetness, hardness, fiber content…).

As a result, in our example MF allows us to say: “Customer A likes sweet pastries and doesn’t like heavy cakes → they will probably enjoy a croissant, because other customers with a similar profile bought it and praised it.”

For each customer and product, it computes a match score—how many points a given customer would give to a given baked good even if they haven’t tried it yet. It thus looks for similarities in customers’ preferences relative to products. The notion of points does not come out of nowhere.

Matrix Factorization was first applied at scale in the Netflix competition. It wasn’t invented by Netflix, but it was popularized thanks to the Netflix Prize announced in 2006, with a $1 million award for improving the accuracy of their recommendation system by 10

Thanks to this approach, the system was able to learn subtle customer preferences and similarities between films. From that moment on, MF became one of the foundations of modern recommendation systems.

“Rich data”: how Matrix Factorization differs from Factorization Machines

Subsequent variants of Matrix Factorization, such as Factorization Machines, use so-called rich data. This consists of augmenting classical collaborative filtering with additional information:
• customer data: age, purchase time, city, dietary preferences,
• product data: type of flour, caloric value, ingredients, allergens,
• context data: online vs. offline purchase, time of day, promotions.

We append such features to the matrix, and instead of pure Matrix Factorization (from “Netflix”), we use, for example, Factorization Machines or DeepFM, which learn interactions among these features. This can be achieved using Python libraries: lightFM, xlearn, scikit-surprise, fastFM. With respect to cloud solutions, we can apply: AWS: Amazon Personalize; GCP: Recommendations AI; Azure: Azure ML + LightGBM + AutoML.

It is worth remembering that a Factorization Machines system learns on its own from customers’ purchases—the more data, the better the recommendations.

What does the recommendation system look like for each customer?

We can use a trained FM model to recommend products. In offline (batch) mode, this is usually done periodically.

For each user (or for active users) we generate predictions for specific products that may interest them. Because exhaustive checking of all user–product combinations can be costly with a very large assortment, in practice it is limited to popular products, products from a given category, new items, etc. The FM model computes the probability for each individual customer for this smaller set of products. Of course, for a small cake shop, the assortment is small enough that one can compute scores for all products. We sort products by predicted probability of purchase (or preference score) and select the top-10 for each user. These generated recommendation lists can be saved, for example, in a database. For instance, each week we update the list of recommended cakes for each customer based on the latest data.

Using the recommendations: in the application (e.g., the online store), after the user logs in, we can display their personalized list of recommended products from this prepared list. In the batch system, this list remains fixed until the next update (e.g., until the next day).

Comparing Factorization Machines with PCA + k-means

Some time ago, I wrote at length about the excellence of the method of assigning customers to purchasing clusters; I wrote about the PCA + k-means method.

This method consisted of collecting information on customers’ behavior in the online store: how often they visited the store, how much they spent on purchases, how long they took to make decisions. Then I added statistical data such as age, place of residence, type of business. You cannot do k-means clustering on, say, 40 features. Dimensionality reduction is necessary. You can do clustering based on two, at most three features. Therefore, I ran PCA—a method that groups customer features into typically 2–3 bundles, so-called principal components, in which features assigned to individual customers are found. This recommendation method is unfortunately already outdated. I mention it because many of my publications were devoted to it.

The classical segmentation approach PCA + k-means consists of:
• reducing the dimensionality of features (PCA),
• clustering customers (e.g., into 5 groups),
• assigning an offer, i.e., a recommendation, to each of the 5 groups.

It was a limited recommendation method because everyone in a segment received the same thing, and after all, even very similar people differ from one another. The method did not account for changing preferences. There was also no learning based on outcomes, no feedback information on whether the customer actually bought.

Why is Factorization Machines better than PCA + k-means?

There are three reasons:
• FM learns directly on purchase transaction data, so it reacts faster,
• it delivers personalized, not group, recommendations—after all, even similar people differ,
• FM handles new products and customers better.

Factorization Machines is not a clustering system

Is the Factorization Machines method (and its extension DeepFM) a special way of clustering customers based on numerous features? And if it is not a clustering system, how does it differ from, for example, segmentation using PCA + k-means? It must be emphasized that FM is a predictive model, not a clustering method. PCA (principal component analysis) is used for dimensionality reduction/structure discovery in an unsupervised way, and k-means clustering groups similar objects without information about any target variable. By contrast, FM is a supervised algorithm—we train it to predict a specific target (e.g., whether a user will buy a product or not). It is thus a predictive system.

Unlike PCA, which creates linear combinations of features, FM learns nonlinear patterns thanks to hidden factors. This means that FM does not so much assign customers to static clusters as it is a learning model that allows one to forecast behavior (e.g., propensity to buy a given product) based on combinations of features (the “rich” features I mentioned earlier). Of course, during training FM learns hidden vectors for customers—these can be treated as embeddings in a space of hidden factors (earlier mentioned latent factors or, by another name, embeddings). Customers with similar vectors (grouping various features) will behave similarly (which resembles customer clusters), but these representations are learned for a specific criterion (e.g., maximizing recommendation accuracy), not purely statistical like in PCA. For marketing segmentation one could use PCA + k-means clustering on 30 features, but such a division does not necessarily maximize conversion. FM, on the other hand, will learn hidden dependencies (latent factors) that group customers in a way optimal from the perspective of recommendations (e.g., it will reveal which customers respond to discount codes and which to delivery speed, etc.), which has greater value for creating a personalized offer generated by the recommendation system.

Hidden factors

In the Factorization Machines method, hidden factors (latent factors) are hidden features (i.e., not directly visible in the data) that represent the properties of objects and features, allowing the model to capture interactions among them even when the data are sparse.

FM extends the classic linear regression model by adding second-order interactions between variables, which are modeled using hidden vectors (latent vectors). Each variable is assigned such a vector—i.e., the latent factors.

If we have user A and the product “Meringue Cake,” then instead of simply taking the usual linear-model weight of the interaction “user A – product ‘Meringue Cake,’” FM assigns latent vectors to the user and the product (e.g., of dimension 10) that encode, for example, the “user’s taste” and the “product’s characteristics”—that is, a vector containing many, in this case 10, features of the product and user. These 10 features are called “rich data.” The model learns how well these 10-dimensional vectors fit one another (e.g., through their dot product), which constitutes the prediction of the interaction (e.g., product rating, click, purchase).

Latent factors make it possible to reconstruct hidden links among elements—for example, that users who like “caramel” often also like “meringues” (even if there is no direct data connection for this). Products with similar latent factors are perceived similarly by users.

Finally, an important note: the dot product can be used for different purposes, and the concept of latent vectors can differ across contexts.

FM: User A has a latent vector; Product B has another latent vector. The model computes the dot product of these vectors and predicts, for example, the product rating or the probability of purchase.

PCA: From a set of features (e.g., age, gender, number of purchases) PCA creates new variables, PC1, PC2, …, which are combinations of the original features and explain as much variance as possible—used e.g. for visualization or data simplification.

Latent vectors in FM are not the same as principal components in PCA, although both concepts relate to the hidden structure of data and can lead to dimensionality reduction.

Latent vectors are learned and optimized for prediction. PCA creates a statistical projection without knowledge of the target.

Summary

A craftsman running an online store, equipped with customer data, can build an effective recommendation system using FM. This approach is more precise than classic customer grouping, and at the same time relatively simple to implement, especially in the batch version. Thanks to this, even a local confectionery can use tools that until recently were reserved for the largest e-commerce platforms.

Wojciech Moszczyński

Wojciech Moszczyński—graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń; specialist in econometrics, finance, data science, and management accounting. He specializes in optimizing production and logistics processes. He conducts research in the field of the development and application of artificial intelligence. For years, he has been engaged in popularizing machine learning and data science in business environments.

 
 

Artykuł Factorization Machines – how a baker can build an effective recommendation system based on client log data pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Creating a Grain Purchase Price Forecasting Model, or How the Machine Showed Us Who We Really Are https://sigmaquality.pl/moje-publikacje/creating-a-grain-purchase-price-forecasting-model-or-how-the-machine-showed-us-who-we-really-are/ Thu, 16 Oct 2025 10:52:39 +0000 https://sigmaquality.pl/?p=8771 The digital revolution is already underway, and we have to accept the way it is unfolding. It consists in the fact that key information concerning [...]

Artykuł Creating a Grain Purchase Price Forecasting Model, or How the Machine Showed Us Who We Really Are pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
The digital revolution is already underway, and we have to accept the way it is unfolding. It consists in the fact that key information concerning business is increasingly collected and processed in digital form. But what does that actually mean? After all, it’s obvious that accounting records and processes all information digitally. The cash register also processes everything and communicates with the integrated accounting system. Warehouses are run digitally.

Low Digital Integration

In most cases our businesses are well saturated with digital systems. Phones contain call histories, integrated accounting systems communicate with warehouses. Those in turn communicate with logistics systems that issue or receive goods. Generally speaking, there is some integration of systems, but it is neither coherent nor uniform. For example, our mobile phone is not integrated with the ERP system, though it could be. Cars do not have onboard computers integrated with the main computer responsible for the enterprise’s logistics, though such correspondence between systems could exist. Grain-milling devices or loading scales are usually independent of warehouse systems. To connect all this we need a uniform nervous system—i.e., a network of digital connections enveloping the entire organization. And that will not exist for some time yet, because it would require close cooperation among various vendors, manufacturers of machines and measuring devices, and software—according to some agreed standard. And there is no sign of that for now. I mean a standard like what we have at Google, where a mobile phone, a computer, and a TV can operate under one identity and all of these devices can save history, data, and all kinds of information to a shared database in Google’s cloud. That solution is, let’s call it, a civil solution. It is hard, however, to find similar solutions in the professional world. The solutions supplied for warehouses, transport, machines, and systems are not integrated with one another—and won’t be for a long time.

W-Moszczynski (4)

Internet Access

Systems will be able to be integrated freely with each other when all these devices have internet access. Why is that so important? Because internet APIs make it possible to integrate individual elements of a system with clouds. Clouds are not only for data storage and simple calculations. Practically all analytical tools and methods can be found in the clouds. Their use will soon be necessary so that our business does not lose the competitive race.

In the grain-milling industry, a very important element is price speculation at the point of purchase and sale of grain and flour. In other words, analysis of the commodities exchange, macroeconomic assessments, agricultural market analysis, historical weather forecasts, and unforeseen weather phenomena must be carefully analyzed. Creating more advanced forecasting models requires hiring a very well-educated data analyst who should additionally be a practitioner. Such an expert is quite costly and certainly will not be physically available nearby, so they will work remotely. Since we have no way to supervise them and we are unable to evaluate their work substantively, it would be advisable to cooperate with a company specializing in forecasts—and here the difficult process begins again: finding such a company and establishing cooperation with it.

In the end, we would have to pay more than for a single data analyst, and we still would not be able to control their work.

The Cloud

The cloud differs from a data analyst in that it is predictable and it’s rather hard to expect that it will cheat us on working hours or go on sick leave. The cloud also has a kind of professional experience, because its systems are based on large language models like ChatGPT. In other words, cooperation with the cloud reduces most of the problems associated with cooperation with a human. First: using the cloud costs a fraction of an analyst’s price. Second: if we start treating the cloud like a very advanced calculator, we ourselves can become experts in forecasts and building forecasting models. In the cloud we can optimize warehouse size, analyze the movement of the flour-truck fleet, or check the efficiency of human resource utilization in production; we can compare everything, draw conclusions, and make decisions based on well-processed data.

Artificial Intelligence Will Destroy the Human Race, for Sure!

There has long been a discussion about what digitization actually is. What is this famous artificial intelligence that is supposed to enter our lives and turn us into not-very-bright people for whom machines do the thinking? Is there anything to fear? Usually after such an introduction, some laconic deliberations begin—more or less successful predictions of the future, scaremongering or raising hopes. Whenever something new and groundbreaking appears, there is no shortage of panic-mongers and critics. From today’s perspective, people’s statements from 100 years ago seem laughable.

So as not to speak without examples, I will cite a few instances of how people reacted to the emergence of the railways.

“Traveling at such a speed is impossible and will certainly cause illness.” — Dr. Dionysius Lardner, professor at the University of London, 1830. Back then, “high speed” was considered 30–40 km/h.
“The railway will be the end of civilization. It will destroy family, religion, and morals.” — The Quarterly Review (a conservative British periodical), ca. 1825.
“Traveling at such a speed can cause madness. People are not made for such a pace.” — an anonymous physician quoted in British newspapers in the 1830s. Many doctors warned that the rapid movement of the landscape could lead to disorientation or even insanity. They spoke of dizzying speeds on the order of 40 km/h.
“When people travel at 15 miles per hour, the blood will boil in their veins and they will die.” — Thomas Tredgold, engineer and early railway author, 1825.
“Horses will be useless. Many professions will collapse. This is a threat to the social order.” — a statement by a representative of the London cab-drivers’ guild, 1830s–1840s.

In other words, the fear—then as now—of job losses.

We have plenty of similar statements regarding the introduction of airplanes, electricity, and telephones. How could telephones destroy the human race?

“Telephone conversations will be demoralizing—women will speak with men they do not know!” — letter to the editor of the New York Times, 1880.
“This device disturbs the peace of the home and destroys family bonds.” — a pastor from Massachusetts, 1881.
“The telephone is too complicated to be useful to ordinary people.” — Western Union internal memo, 1876.
“Who would want to talk to someone they cannot see?” — comment attributed to a London aristocrat, late 19th century.

And electricity? How frightening it was!

“If we allow the widespread use of electricity, people will die by the hundreds. Alternating current is murder.” — ca. 1889.
“Electric lamps will blind people and ruin their eyesight. Gas light is gentler and safer.” — an anonymous physician quoted in The Times, London, ca. 1882.

Let’s imagine that we are a person who has never seen a railway. We have read various Cassandra-like statements spread by experts. How can we be convinced there’s nothing to fear? A newspaper will not handle it, in which other, better experts will praise this invention. You simply have to take a ride on that “terrible” railway. I am sure that everyone who rode the railway racing at the dizzying speed of 40 km/h understood how nice it is to get somewhere far without effort and at breathtaking speed.

Now let us take a ride on the “railway of our time.” In the end, we will understand what our weakness is and why this truth can set us free.

A Grain Purchase Price Forecast Made by Artificial Intelligence

Today everyone knows artificial intelligence that you can chat with—ask who invented the flush toilet or have it write an article for us or draw a picture. It’s like an expanded Internet, only you can do everything faster and better. Along the way, professions such as graphic designer, editor, or language teacher disappear from the market. However, ChatGPT in this version is no great breakthrough.

Real business needs very concrete things done massively at large scale. For example: analysis of surveillance video, processing gigantic amounts of data and creating forecasting systems that predict events, autonomous vehicle control, or organizing huge numbers of scanned documents. So let’s do one such example—not to learn how to do it, but to “take a ride on the railway.”

To build (in 15 minutes) a grain purchase price forecasting model, you need to click around a bit in the cloud.

It Was Not Always This Easy

In the 1990s, spreadsheets appeared with an option to create trends. Back then, building a forecast took a moment; unfortunately, they were unreliable and weak. In the second decade of the 21st century, the possibility of programming in Python and R appeared. Libraries emerged that enabled the creation of forecasting models. They were simply ready-made instructions to execute. You didn’t have to write the entire model algorithm from scratch.

In practice, when someone wanted a forecasting model for their sizeable company (e.g., a grain elevator) placed in the cloud, they hired a team of experts—data analysts (a “project team”)—who would spend months creating such a model. This is still how most such undertakings are carried out. I think companies implementing these projects deliberately do it the old way, because it’s profitable. Sometimes it’s not worth enlightening the client.

Welcome Aboard the Railway

Below are a few slides showing how this process looks in practice. I included the slides not so that everyone learns how to do it, but to make it clear that building a forecasting model is a matter of a dozen or so clicks in the cloud. For the experiment I chose GCP (Google Cloud Platform), one of the most popular on the market. Using this cloud is free; anyone can log in. After exceeding certain limits, the cloud starts charging fees—very low ones anyway.

Step 1. Log in to the cloud, create a project. In the slide this is the project “Wojtek1.” Choose the “BigQuery” tab.

Step 2. Upload data. The cloud needs them to create a model. Pasting the data is another few clicks.

Our data table contains 48 rows and looks like this:

We see a column with years and months; we have three grain elevator locations: Miłkowo, Połoć, and Stare Wolice. We added columns with temperature and precipitation levels. We have to paste this table into the cloud.

Step 3. Prepare the query.
This step may seem difficult if someone cannot write SQL. Currently the cloud helps us with artificial intelligence. I will not describe how to talk to it and ask for help. Artificial intelligence writes the query for us.

We press Enter—and we have a model.

We get a beautiful, hopeless model that is useless.

The Role of the Human in the Process

With this example, I wanted to show that using artificial intelligence relieves us of tedious work while simultaneously forcing us to make an intellectual effort. And that is the new quality. We no longer have to program sluggishly, create queries, search libraries, validate, and generally toil. New times require us to think, and we cannot excuse ourselves by saying we are not mathematicians or professional data analysts. Now everyone will have to think!

So what model did we get?

R-squared (R²) = 0.0979 — a very low value → it means the model explains only ~9.8

So we ask artificial intelligence what we can improve:

  1. Add important features (feature engineering):
    • e.g., baseline_monthly_trend, _in_harvest_peak, rainy_days
    • encode base as a categorical feature (one-hot encoding)

  2. Use a more advanced model (e.g., tree-based):
    • BigQuery ML is simple linear regression;
    • In Vertex AI you can use XGBoost/AutoML Tables, which handle non-linearities better.

  3. More data:
    • one year is too little—we need at least 5–10 years of history;
    • data on inventories, acreage, previous yields, etc.

  4. Check data quality:
    • whether there are random errors;
    • whether variable scales are not dominating the model.

What have we learned so far?

First, the machine did all the work for us—and the result turned out to be hopeless. If we had done it “by hand,” it would have been just as bad, only it would have taken longer. The machine told us what to change.

What conclusions follow?

First: anyone can make (click together) a model.
Second: the machine will not clean up our company for us.
Third: from this example, we learned what is important and what is not in the model—that is, we learned something. This last conclusion is quite significant, because as passive observers of the classic path of model creation, lacking experience, we truly would not have learned anything.

What does the machine advise? “Point 1: Add important features (feature engineering).” The idea is that we, experienced millers or brokers on the purchase price market, must devise the features that we should place in our database. This could be the inflation rate or fuel prices, or a consumer optimism index. We can introduce any variable. The variables we introduced in the form of precipitation and temperature turned out to be irrelevant.

I would rather ignore point two proposed by the machine. It doesn’t matter which model you have—if the error is that large in scale, the data are to blame, not the model.

In points three and four, the machine tells us directly that we have a mess in the data. The data may be untrue, falsified, or simply entered incorrectly.

What, specifically, might be messy?

For example, transaction entry dates may be delayed; there may appear the phenomenon typical for high warehouse turnover—“debt shifting.” This consists in the warehouse manipulating volume over time or between warehouses, e.g., to avoid paying VAT. There are many other, non-tax factors that will render our data useless. In our example, the database contained a column with average temperature and average precipitation. These data were recorded at the moment of grain intake. Temperature and precipitation primarily concern the harvest of the grain, not the moment when the purchase took place. This is a typical error made by data analysts. The phenomenon is called shift—i.e., a displacement.

A Forceful Conclusion

Did the arrival of the car make people worse and weaker, did they lose the ability to walk? No—people became more technologically advanced. They also acquired many new, astonishing capabilities in carrying heavy things and moving through space. A typical car user, wanting to take advantage of these new superpowers, had to learn to drive a car, had to understand what the oil level is, what coolant is, or how to change a tire. If the car had not appeared, people would still be walking—significantly more slowly.

Today something has appeared again—some astonishing technology. And again we must take a step toward our development. The lack of IT education or the lack of programming skills will not excuse our lack of thinking this time. We must prepare the data; we must understand the business; we must simply start thinking. The example I cited today, relatively simple and intuitive, showed how the machine exposes our weaknesses, our laziness, and our lack of understanding of the business. The data we provided and which seemed decent turned out to be useless.

Artykuł Creating a Grain Purchase Price Forecasting Model, or How the Machine Showed Us Who We Really Are pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Application of the Isolation k-means Method in Quality Control of Food Production https://sigmaquality.pl/moje-publikacje/application-of-the-isolation-k-means-method-in-quality-control-of-food-production/ Thu, 16 Oct 2025 10:45:05 +0000 https://sigmaquality.pl/?p=8765 Modern systems monitoring production processes, especially in the food industry, generate enormous amounts of real-time data. Efficient detection of anomalies — deviations from the norm [...]

Artykuł Application of the Isolation k-means Method in Quality Control of Food Production pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Modern systems monitoring production processes, especially in the food industry, generate enormous amounts of real-time data. Efficient detection of anomalies — deviations from the norm that may indicate machine failures, operator errors, contamination, or other quality threats — is one of the key challenges in ensuring production safety and continuity. In this context, data analysis methods capable of identifying non-standard patterns without prior labeling of data as correct or incorrect become particularly important.


One Promising Method: Isolation k-means

One promising method is Isolation k-means — an approach that combines the advantages of classical k-means clustering with the isolation mechanism characteristic of anomaly detection techniques such as Isolation Forest.

Isolation k-means is a hybrid data analysis method based on two fundamental assumptions.
First, it uses the k-means algorithm to group data into clusters — identifying natural structures and recurring patterns within a dataset.
Second, it analyzes the distance between data points and the centers (centroids) of their assigned clusters — assuming that data points lying far from a cluster center are potential anomalies.

Moreover, the algorithm can apply additional isolation criteria to separate points that are not only distant but also rarely co-occur with other observations.

This method becomes particularly valuable in the food industry, where detecting irregularities in real time can prevent serious problems such as product contamination, improper temperature conditions, or incorrect ingredient proportions. Thanks to its flexibility and ability to operate on unlabeled data, Isolation k-means can be effectively used to monitor production parameters, identify deviations from standard process profiles, and support decision-making by operators and quality engineers.

The following sections describe how the Isolation k-means method can be adapted to time-series data in food production and what benefits its application may bring to real-world quality control systems.


Unsupervised Learning

In data science and machine learning, we distinguish between supervised and unsupervised learning.
In supervised learning, the machine is guided by a human, whereas in unsupervised learning it must learn everything on its own.

Unsupervised learning creates the ability to work with unlabeled data. This means that the algorithm does not need previously prepared labels — information indicating which data are correct (normal) and which are faulty (anomalies).

For example, in classical supervised learning, the data must be labeled beforehand: a temperature of 60 °C is proper, while exceeding 95 °C is too high.

In practice — especially in food production — we often do not have such critical thresholds. It is difficult to pre-define which cases are anomalies, particularly when the production line frequently changes its assortment, leading to very short production batches.

That is precisely why algorithms operating on unlabeled data, that is, according to the principle of unsupervised learning, are flexible and well suited to this type of production environment. These algorithms learn patterns directly from the data, independently identifying what is “typical” and what deviates significantly from it.


W-MOSZCZYNSKI ps 5-25

The k-means Clustering Method

k-means clustering is one of the most popular exploratory data analysis methods in machine learning.
It is the most frequently used clustering technique — grouping objects based on features — for example, in customer behavior analysis, image segmentation, genetic data exploration, or anomaly detection.

It is a myth that clustering can only be done in two-dimensional (2D) or three-dimensional (3D) space. In practice, data often have a dozen, dozens, or even hundreds of features (e.g., sensor data from a production line), and k-means still works — just without visualization, since analysis takes place in multidimensional space.

Why mention this?
Because research is often conducted for a client who usually needs to see the division — which leads to generating cluster plots. When the studied phenomenon has, for example, 14 features, techniques such as PCA (Principal Component Analysis) or nonlinear methods like t-SNE (t-distributed Stochastic Neighbor Embedding) are used to reduce those 14 features to 2–3 key ones.

In production practice, using k-means clustering might look as follows:
We have a sample of meat with slightly lower temperature and 23 other measured features. These features place it in Cluster A. Another sample, with different values of those 24 features, falls into Cluster B. We also have Clusters C, D, and E.

In practical terms, this clustering does not necessarily mean anything — it simply reflects differences between products on the production line. Products can be assigned to clusters due to minimal variations in feature magnitudes. The process may be conducted for purely technical reasons.

The k-means method belongs to the group of unsupervised learning algorithms, meaning it does not require pre-labeled data. The algorithm independently finds structures (clusters) within the dataset.

The main goal of k-means is to divide a dataset into k non-overlapping groups (clusters) so that points within the same cluster are as similar as possible to each other and as different as possible from points in other clusters (e.g., groups of meat samples).

The quality of clustering is measured using the Sum of Squared Errors (SSE), also known as Within-Cluster Sum of Squares (WCSS) — the sum of squared distances between each point and its cluster centroid. The smaller the SSE, the tighter the clusters — meaning the points are closer to their centers.

In practice, this means that meat samples from one production batch can be significantly distinguished from each other. This can be useful for analyzing events related to, for example, bacterial contamination or harmful impurities in products.


The Isolation k-means Anomaly Detection Method

Isolation k-means combines two approaches:

  • clustering (k-means) for identifying typical data patterns, and

  • isolation of anomalies similar to methods such as Isolation Forest, focusing on identifying points that significantly deviate from “normal” groups (clusters).

Initially, data are grouped using the classical k-means algorithm. Centroids — cluster centers representing the most frequent patterns — are created.

For each data point, the distance from its assigned centroid is calculated. The greater the distance, the less the point fits its cluster.

For example, a quality-monitoring system on a packaging line may detect five different clusters of sausages. It’s still the same product (one SKU), but the system has found subgroups. Isolation k-means evaluates how isolated each point is from others. Points very distant from any subgroup are treated as potential anomalies.

Although this is an unsupervised method, it is possible to adjust anomaly detection sensitivity, for instance by setting a 95th percentile distance threshold, above which points are considered anomalies, or by using a normalized distance scale such as Z-score.


How to Prepare Data for Isolation k-means

Because the method is based on k-means, data must be properly prepared — just as in classical clustering.

Feature standardization (scaling): all features should have similar scales (e.g., mean = 0, standard deviation = 1). Otherwise, features with larger numerical values will dominate the distance metric. Alternatively, normalization can be used to transform feature values into a defined range (e.g., −1 to 1).

Mathematical models usually interpret numerically large values as more significant than smaller ones. For instance, in the case of meat, temperature of 4 °C would be considered less important than humidity of 62

Feature selection: choose variables that meaningfully represent process “normality.”
In food production, these may include temperature, production time, concentration levels, pressure, and humidity.
In industrial environments, many parameters (e.g., motor power or water usage) may have no relation to product quality and can unnecessarily clutter analysis.

Use of historical data:
The dataset should contain as much data as possible so that k-means can learn typical clusters.
In streaming (real-time) measurements, sample windows should include at least 100 products.

Food-production sensor data usually take the form of time series. They can be transformed into feature vectors using rolling windows, moving averages, standard deviations, etc.

Unlike classical k-means clustering, which assigns every point (product) to some cluster, Isolation k-means allows that certain points may not belong to any cluster — they are then interpreted as anomalies.


Practical Example: Isolation k-means in a Dairy

We have a dairy production line manufacturing yogurt. The process is strictly controlled and includes several key technological parameters that must remain within defined ranges to ensure safety and final product quality.

We measure six key features:

  1. Milk pasteurization temperature (°C)

  2. Fermentation time (minutes)

  3. Fat content (

  4. Protein content (

  5. Final product pH

  6. Number of live bacterial cultures (log CFU/ml)

Data from sensors and analyzers are collected in real time in transactional form. Each batch of 40 yogurts has a unique production number.
Each batch can thus be represented as a point in a six-dimensional feature space.
We do not analyze each feature separately, but treat them as one coherent observation — a complete process profile for that batch.

The k-means algorithm groups many earlier correct batches into clusters representing typical parameter configurations. For each cluster, a centroid is created.

When a new batch is produced, its profile (six features) is compared to the centroids.
If it lies significantly far from the nearest cluster (e.g., very short fermentation time combined with high pH and low bacterial culture count), it is treated as a potential anomaly.

In streaming mode, data are analyzed continuously.
Thanks to the ability to quickly compute distances to centroids, Isolation k-means can immediately identify batches with unusual profiles — before the product reaches packaging or distribution.
When combined with an alert system, operators can react instantly — stopping the line or conducting additional laboratory tests.


Advantages of the Isolation k-means Method

  • Considers the full feature profile, not just individual limits.
    Many issues arise from combinations of parameters that individually fall within limits but together form an unnatural case.

  • No need for predefined thresholds or quality standards, which is especially useful for short production series. Production anomalies are rare and difficult to define in advance.

  • Simplicity and speed.
    Isolation k-means is relatively simple and efficient, allowing real-time use in quality monitoring systems.

  • Robustness to scale differences.
    After appropriate feature standardization or normalization, the algorithm operates stably even with varying measurement units.


Summary

This article demonstrated how the Isolation k-means method can be successfully applied to anomaly detection in food production processes.
Combining classical clustering with isolation analysis allows for a more effective quality assessment approach than traditional threshold-based methods.

By analyzing complete process feature profiles rather than individual values, the method can identify subtle yet potentially critical deviations.
Its flexibility, lack of dependence on predefined limits, and real-time applicability make it particularly attractive for production environments where reliability and rapid response are crucial.

In the era of digital transformation in the food industry, methods like Isolation k-means may become the foundation of intelligent quality control and predictive maintenance systems — supporting consumer safety and process efficiency.

Artykuł Application of the Isolation k-means Method in Quality Control of Food Production pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
A Cyberattack on My Pastry Shop? That’s Absurd https://sigmaquality.pl/moje-publikacje/a-cyberattack-on-my-pastry-shop-thats-absurd/ Thu, 16 Oct 2025 10:38:53 +0000 https://sigmaquality.pl/?p=8756 Cyberattack? What does that even mean? It doesn’t concern me.I’m a calm pastry chef in a small town where everyone knows me. Cyberattacks are about [...]

Artykuł A Cyberattack on My Pastry Shop? That’s Absurd pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Cyberattack? What does that even mean? It doesn’t concern me.
I’m a calm pastry chef in a small town where everyone knows me. Cyberattacks are about banks, advertising agencies, or corporations in big cities. Surely no one would attack an ordinary craftsman who honestly works for his community.


Digital Aquarium

We are now deeply immersed in the digital world.
We have bank accounts, social media accounts, YouTube channels, accounting systems, and websites where we sell our products.
An integrated accounting system controls sales systems in our stores, opens and settles cash registers, and configures production systems that carry out daily production plans.
Recipe and warehouse systems are also integrated with oven control systems.

Everything is connected within a defined structure that can be paralyzed by taking over access codes.
Most people try to work efficiently by saving time and simplifying processes.
We usually have one password for everything — sometimes we modify it by changing one or two letters.
All passwords are stored in Google’s password manager.

Our professional life and the safety of our money depend on the systems we have on our smartphones — because that’s the most intuitive and simple approach.
Bank codes, system codes, BLIK codes, and access to our online store — everything in Google Password Manager.

But what if someone took control of everything? Changed all the codes?
We assume nobody will — so we don’t worry.
The situation is like leaving bicycles unlocked in front of a shop.
When a country is safe, and stealing bikes is unprofitable, nothing happens.
We leave our cars unlocked in front of our houses, expensive garden tools on open properties.

But maybe it’s worth taking an interest in cybersecurity?
The situation can change drastically if we fall victim to a hacker who, for fun or profit, paralyzes our lives.


W-MOSZCZYNSKI ppic 6-25

Why Would Someone Hack Me for Fun?

It’s not exactly about fun.
Currently, there’s a significant downturn in the IT services market.
Many highly qualified professionals in information technology have lost their jobs due to the implementation of artificial intelligence — which turned out to be much more efficient at coding than humans.

At the same time, because of the war in Ukraine and the growing importance of artificial intelligence, there is now huge demand for cybersecurity specialists.
The problem is that you can’t learn cybersecurity at university.
There, you only acquire basic knowledge.

The best cybersecurity experts come from hacker circles — they usually call themselves ethical hackers.
They claim to have hacking experience but say they’ve never done anything bad (of course, we believe that).
To become a hacker, you must start hacking — and to gain practice, you must act.
You can’t learn how to break into systems in artificial, virtual environments.
It’s best to hack into existing secured systems.


A Hooded Figure in a Dark Room with a Monitor?

Let’s imagine this situation.
I go to a café.
I sit down with my phone, maybe I want to check my account balance or send an email to a supplier.
I connect to Wi-Fi called “Café_Free” — sounds fine, right?
Who would think that behind it hides someone who just connected to my life?

This type of attack is called “man-in-the-middle” — the hacker sits in the middle of the communication, between your phone and the internet.

Someone creates a fake Wi-Fi network — identical to the one you usually see at cafés, airports, hotels, or on a Pendolino train.
They give it a familiar name, like Starbucks_WiFi or LOT_Free_WiFi — and you’re trapped.

The hacker has just grabbed your bank password, access to your online store, or even your browsing history and secret confectionery recipes.

The hacker, like a spider, sets up a fake network and waits for someone naive enough to connect.
You don’t have to click any suspicious link or download any strange file — just connect.
From that moment, all your data goes through their computer before reaching the real internet.

It’s like writing a letter to your bank — but first handing it to a stranger who reads it, copies it, and only then mails it.

Some people say: “But I’m just an ordinary person — I have nothing interesting.”
And I say: for a hacker, everything is interesting — even access to your email inbox.
They can use it for further attacks.
Your contact list can easily be sold to an advertising agency.

And since, as I mentioned, we usually have one password for everything — it becomes a problem.
Such an attack can happen anywhere — someone can create a fake network called free_Warsaw_WiFi, and people will immediately connect.

And to make it even more cinematic — some wireless keyboards and mice, the “modern” ones, can be taken over via Bluetooth.
A hacker with the right device can control your computer while sitting in a car outside your office.
You wouldn’t even know who that person is.
They can type your password, send emails, or simply log you out.

Sometimes it’s worth going back to something old and reliable — a wired keyboard and mouse.


The Extra Technician

If we have a company that employs more people than can fit on a city bus, we usually deal with a multitude of processes, services, and suppliers.
Many different specialists come through our business — plumbers, electricians, camera technicians, or accounting system operators.
We can’t monitor them all.

Now imagine this scene:
A technician comes in to “check the camera signal.”
He unplugs a cable and plugs something back in.
It looks like a regular Ethernet adapter — an inconspicuous little box.
We have plenty of such unknown devices in our companies.

But inside that little box sits a computer.
Once you plug it in, it automatically connects to the hacker’s server.
From that moment, they have access to your world.

Would you even notice?
Nobody checks what’s behind the router or under the desk.
Everything works, the internet is up, the store terminal functions — why bother?


The Hacker’s Credit Card Trick

Now one more quick trick — the credit card.
The chip, once a symbol of innovation and convenience, has become a weak point for cyberattacks.
Someone can walk right past you with a small reader in their pocket — no contact required.
You walk through a shopping mall and… boom — your card number is read.
And you have no idea.

Fortunately, there’s still the CVV number (Card Verification Value).
The card number alone won’t help the hacker unless they also gain access to your phone.
In general, attacks on bank accounts require higher expertise and are better protected by law enforcement.
That’s why hackers often target something easier — your home server.


A Beginner Hacker’s Exercise: How to Break Into a Home Server

Picture someone sitting in a car outside your house.
That’s how a hacker attack on your home Wi-Fi network might look.

Remember the café example? The hacker who accessed a phone through free Wi-Fi can now see the password to your home server through Google Password Manager.

Most people use default IP addresses like 192.168.0.1 or 192.168.1.1 — and never change them because “it works.”
The router login panel is standard, accessible through a browser.

But — not only you know that address.
A hacker who once connected to your Wi-Fi, or that technician who had temporary access, can scan your entire local network in seconds.
They’ll see your home server at 192.168.0.105, your router at 192.168.0.1, and your laptop at 192.168.0.103.

If the hacker doesn’t already have your password, they try default ones — using a list of the 150 most common passwords worldwide.
Their software checks all of them in less than a minute.
Many people log into their server panels with passwords like:
admin/admin, admin/1234, user/password, qwerty, or MyCompany2023.
Because who would hack a pastry chef’s home server?

Once the hacker gets in, they use a password dictionary — an automatic program that tries every common combination.
A minute or two, and they’re inside.
They can see your saved passwords (stored in a .txt file on your desktop), install malware, or open a backdoor for permanent access.

They can redirect your web traffic so that everyone visiting your online store or blog receives fake files, ads, or even malware — and you won’t even notice.
Everything looks normal.

All this happens simply because your server had a default address and default password.
Because you “didn’t bother to change it.”

We’re pastry chefs — honest people.
We work with flour, not with computers.


The Cherry on the Cake

The modern car key — a wireless module that opens your vehicle.
It transmits a coded signal — and the hacker can intercept it, record it, and wait.
Then, at night, they approach your car and click — it’s open.
No noise. No broken windows.

It sounds like science fiction, but it’s not.
It’s not the future — it’s happening now.
And it can happen to you.
All because someone left a strange little box in the cabinet next to your modem.


Why Updates Matter

We can now decide — either we fall into permanent cyber-paranoia, or we classify the author of this article among the “cyber-freaks,” people who can turn even the brightest day into a dark night.

The choice about your mental comfort is yours — but remember one thing: keep everything updated.

That’s the simplest solution.
Update everything — routers, phones, computers, refrigerators, robot vacuums, even your toaster if it has Wi-Fi.
Each update doesn’t just add icons — it fixes vulnerabilities that someone could exploit.

Most hacking attacks are done remotely and rely on system vulnerabilities.
These are gradually patched by software providers.
If you don’t update, you’re leaving the door open with a note saying “Welcome, hackers.”

Also, change your passwords, memorize them, or write them on paper — not in .txt files.
And don’t let Google Password Manager remember them for you.

Because the truth is:
Today’s hacker doesn’t have to be a genius.
All it takes is you forgetting to click “Install update,” connecting to free airport Wi-Fi, or leaving your router’s default password unchanged.

And suddenly your digital infrastructure becomes a testing ground for some beginner hacker learning the trade.

Artykuł A Cyberattack on My Pastry Shop? That’s Absurd pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Application of the Incremental Isolation Forest Model for Improving the Quality of Food Production https://sigmaquality.pl/moje-publikacje/application-of-the-incremental-isolation-forest-model-for-improving-the-quality-of-food-production/ Thu, 16 Oct 2025 10:30:16 +0000 https://sigmaquality.pl/?p=8742 Isolation Forest is a machine learning model that can be used to detect anomalies on a food production line, for example, a meat packaging line. [...]

Artykuł Application of the Incremental Isolation Forest Model for Improving the Quality of Food Production pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Isolation Forest is a machine learning model that can be used to detect anomalies on a food production line, for example, a meat packaging line. Such an installation is equipped with a number of sensors that continuously send information. Receiving this data on a personal computer is relatively simple. The data stream can be directed straight from the sensors to an analytical cloud, such as Azure or AWS, where a ready-made model can be launched. This model will automatically send alerts related to anomalies detected in the production process.


How the Isolation Forest Model Works

This solution is easy, simple (for an average data analyst), and often free.
Let’s assume we have two measurement sensors on a meat packaging line:

  • Weight sensors, which check whether each piece has the correct weight,

  • Temperature sensors, which measure the temperature of the meat, indicating its freshness.

Thanks to these sensors, for each product a set of data is collected — weight and temperature.

The Isolation Forest model learns from many products that have passed through the line. In this way, it learns what a “typical” product looks like. Suddenly, an unusual product may appear — for example, one that is too heavy or too warm, which may indicate that the meat is starting to spoil. Such a product will then be recognized by the Isolation Forest model as an anomaly.

Why does this happen? Because a product that differs from the rest (has a different weight or temperature) will be quickly isolated by the model’s decision trees. The model needs only a few splits to notice that this product “does not fit” the standard products normally appearing on the line.


W-MOSZCZYNSKI ps 4-25 w2

How Isolation Forest Detects Anomalies

The Isolation Forest model is used to detect anomalies in a set of elements — it identifies something that doesn’t fit the rest. The model builds trees randomly; the selection of features is random (e.g., customer gender, race, age, income). Thus, the structure of trees is not predetermined.

From a database of clients, a tree may first randomly divide them by gender (women and men), then by place of residence, age, etc. If something entirely unusual suddenly appears (a horse, a cat), such a data point will be very quickly separated — after one or two splits — forming a single, short branch of the tree. This short branch indicates that the point is an anomaly.

The key lies precisely in the speed of isolating anomalous points (in the above example — animals from humans). Identifying anomalies is based on identifying branches that are very short. The fewer splits required to separate a point from others, the more likely it is an anomaly.

Each tree is created by randomly dividing data into increasingly smaller groups. Normal observations are similar to one another, so separating them requires many splits — many branches are created. Anomalies, on the other hand, are isolated after only a few splits. The shorter and more singular the branch, the greater the anomaly.

The deciding factor for whether an observation is anomalous is therefore the number of branches (splits) needed to isolate it from the rest of the data.

Isolation Forest is particularly well-suited for analyzing streaming data for several reasons:

  • It is very fast, building simple trees randomly and not requiring complex computations.
    Thanks to this, it processes data in real time, without delays.

  • It does not require storing full historical data or creating complex patterns.

  • It is ideal for rapid detection of individual unusual points as they appear.

  • It can be easily updated — being an unsupervised model, it flexibly adapts to new data, which is essential for changing data streams.

  • It requires little memory, as trees do not rely on stored historical data — they only retain the historical structure of the trees.

In summary, Isolation Forest is fast, computationally lightweight, and well-suited for detecting anomalies on the fly during real-time data processing. In its basic version, the model analyzes static data (datasets previously collected), but it can also be adapted to handle dynamic (streaming) data.


Supervised and Unsupervised Learning

In data science, analyses are divided into static and dynamic ones.
A static analysis is based on a fixed dataset containing historical data. The model learns from this data, and once trained, it can be applied to dynamic data. This approach is called supervised learning, because the model learns using defined explanatory variables that affect the outcome.

Supervised learning resembles teaching a child to recognize animals by showing pictures and saying, “This is a cat, and this is a dog.” Here, the model knows exactly what result it should achieve for given examples because it has a clearly defined output variable — what we are trying to predict. The model thus learns how features (e.g., size, fur length, sound) affect the result (cat or dog).

Unsupervised learning works differently. It is like showing a child many pictures of animals without saying their names. The child must independently find similarities or differences and create groups or categories. In this approach, there is no output variable (no “cat,” “lion,” or “horse” label). The model tries to understand the data’s structure, divide it into logical groups, or detect differences between them. When something does not fit any group, it is considered an anomaly.

In summary, the key difference is that in supervised learning, the model receives clear guidance on the expected result, whereas in unsupervised learning, the model must find patterns and structure by itself, because no correct answers are predefined.


Static vs. Dynamic (Streaming) Analysis

In static (batch) analysis, we work with a complete, closed dataset (e.g., a year’s worth of customer transactions). Decision trees are built once for this dataset.

In dynamic (streaming) analysis, data points arrive in real time — there is no fixed, large dataset, only continuously incoming data, such as online store transactions or sensor data from production systems.

Isolation Forest needs historical data to build its initial forest of random decision trees. Then, as new data arrives, it checks these points against the existing trees, determining whether they are anomalies. Thus, the model does not need to store historical data — it only keeps the tree structure, which summarizes the characteristics of previously analyzed data.


How Isolation Forest Works for Streaming Processes

For streaming data, Isolation Forest is used in the form of the Incremental Isolation Forest.
Based on the initial data, the first tree is built. As new data arrives in the streaming process, the model continuously updates the trees.

In this way, it constantly “learns” new patterns and reacts quickly to changes.

Returning to our earlier example — if the bank customer dataset (where a horse appeared as an anomaly) is later expanded to include many cats, the model will treat them as a group and begin to identify subcategories such as fur color, gender, and age. It dynamically creates new trees.

Thus, in practice, for dynamic data, the Isolation Forest model can continuously adapt to the data stream.


Summary

To summarize, Isolation Forest models are an effective tool for automatic anomaly detection because they can simultaneously analyze hundreds of different variables and quickly identify deviations from the norm.

The use of such models is not limited to production lines — they can also be applied to:

  • analyzing products stored in warehouses,

  • raw materials received at entry scales in processing plants,

  • monitoring auxiliary processes such as workforce utilization or production support systems.

Importantly, once implemented in cloud infrastructure, Isolation Forest models operate fully automatically and practically maintenance-free.
This means that, once deployed, they can function indefinitely, with minimal susceptibility to failure or technical errors — further increasing the reliability of the entire quality control and production monitoring process.

The practical implementation of this solution involves connecting the data stream from the production line directly to the analytical cloud via an API interface.
The streaming data are then automatically processed by an analytical module based on the Incremental Isolation Forest model, which continuously classifies products as “normal” or “anomalous.”

The results are sent as time series, containing a clear classification for each analyzed product. These alerts can be delivered directly to mobile devices — for example, the production manager’s or quality inspection staff’s smartphones — allowing immediate response to potential quality or technological problems, significantly improving the efficiency of the entire quality control process.

Artykuł Application of the Incremental Isolation Forest Model for Improving the Quality of Food Production pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Review of the Most Important Machine Learning Models for Detecting and Eliminating Hacker Attacks https://sigmaquality.pl/moje-publikacje/review-of-the-most-important-machine-learning-models-for-detecting-and-eliminating-hacker-attacks/ Thu, 16 Oct 2025 10:22:47 +0000 https://sigmaquality.pl/?p=8738 A hacker attack is a deliberate action aimed at paralyzing a network server or, in its active form, taking control over it. In the past, [...]

Artykuł Review of the Most Important Machine Learning Models for Detecting and Eliminating Hacker Attacks pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

A hacker attack is a deliberate action aimed at paralyzing a network server or, in its active form, taking control over it. In the past, this was done by introducing a virus; today, antivirus systems are so advanced that viruses have become a secondary threat.

A network server can be compared to the human body. Human skin has an extremely high ability to block all pathogens, germs, and chemical substances. It is like a wall through which nothing can pass. Infections occur only when the skin is damaged, has deep cracks, or is penetrated by a sting or another type of attack that destroys its structure. The human body has two main gateways through which germs, viruses, and pathogens enter: the alimentary canal and the respiratory tract. As we know, these two gates cannot be closed, because that would deprive us of life.

An identical mechanism exists in servers; these devices have entry gates through which vast amounts of content flow, often containing harmful software. Closing the inflow of information to the server would mean it could not be used — it would cease to perform its function.

Therefore, both servers and living organisms cannot completely shut themselves off from the external environment. They must build systems within themselves that can identify and eliminate threats. This function must be performed in a streaming system, that is, in real time. Threats must be eliminated immediately — here and now.


W-Moszczynski (5)

Operation of Machine Learning Models

In general, data science models can be divided into supervised, unsupervised, and autoregressive models.

Supervised Models

The learning process in supervised learning involves randomly dividing a dataset into training (used to train the model) and test (used to check its efficiency and accuracy) sets, usually in an 80/20 proportion. The trained model’s task is to detect information that is anomalous, that is, deviating from the standard behavior of data on the server. It takes the form of an algorithm or mathematical equation. It is usually small in size, though depending on its technology, it may absorb significant server resources.

The problem of consuming large analytical resources is best illustrated by the example of military drones. At present, drones are mostly operated by humans. However, Machine Learning models could easily identify and eliminate enemy armored vehicles. The problem is that drones themselves have no computers onboard. Computers like Raspberry Pi can be installed in them, but they are relatively expensive, and their capabilities are insufficient to smoothly handle such artificial intelligence models.

The moment when widely available microcomputers become both cheap and powerful will be the moment when we witness the mythical war of robots against humans. To put it less dramatically — we will observe completely autonomous robots killing people. For now, this phenomenon is rare on the battlefield for technical reasons. As we can see, Asimov’s Laws of Robotics (created by Isaac Asimov, American science-fiction writer and science popularizer) can be thrown into the trash — before robots even appeared for good, these laws lost all significance.

Thus, supervised learning models work effectively but do not learn during operation. In other words, if server administrators do not ensure that the model is systematically retrained, hackers — or some form of artificial intelligence — will find new forms of cyberattack that the model will not recognize, because it did not have the chance to learn them.


Unsupervised Models

Unsupervised learning models for anomaly detection learn by identifying normal patterns in data and marking unusual observations as anomalies. These models learn without supervision, meaning they have no information about which data points are anomalies and which are normal — this is called learning from unlabeled data.

Such models mainly analyze the structure of the data and create a pattern of normality. New data entering the server are evaluated by comparison, identifying differences relative to the learned pattern.

Typical unsupervised algorithms for anomaly detection include:

  • Clustering methods,

  • Statistical methods (e.g., PCA, KDE) identifying outliers in a distribution,

  • Autoencoders (based on neural network technology),

  • Entropy-based models, e.g., Isolation Forest, which quickly isolates anomalies.


The Role of Logs

The human body communicates with the brain using protocols generated by the nervous system. If an internal organ is attacked by pathogens, we feel pain in a specific area. Pain is a protocol sent directly to the brain, whose purpose is to inform the decision-making center about a problem.

At the same time, there are procedures independent of our will. If dust enters the respiratory tract, we start sneezing and coughing.

Server logs work in a similar way, except they are generated and sent continuously — whether there is a problem or not. Logs are continuous information about server activity, sent to an analytical center.

At the current stage of development, this analytical center — the “brain” — is still a human. However, the evolution of AI systems introduces an intermediate problem identification layer between logs and humans, so that the human is not flooded with endless streams of information but instead receives synthetic summaries about the system’s state.

The key is to make the system intelligent enough to identify and immediately eliminate threats. This is exactly what artificial intelligence models are designed for.

Thus, the log register on a server is one of the key data sources for analyzing cyber threats.

We can distinguish five main types of logs:

  1. Access logs,

  2. Error logs,

  3. Authentication and authorization logs,

  4. Firewall logs,

  5. Application logs.

When identifying anomalies that often turn out to be malicious software or a veiled form of cyberattack, application logs are the most useful, as they analyze user behavior on the server.


Application Logs

Application logs play a key role in both recommendation systems and cyberattack defense systems because they provide valuable information about user behavior.

They show:

  • clicks on products or content,

  • time spent on a product page or article,

  • scrolling behavior,

  • cart actions (add/remove product, purchase),

  • ratings and reviews.

Hacker attacks are usually carried out by bots programmed to imitate human behavior. They can scroll pages, move to subpages, and perform actions similar to human users. However, bots are created according to specific templates and replicated, so they behave identically. Their uniformity makes them detectable and removable.

Moreover, bots do not respond to graphic or contextual cues as humans do. Of course, advanced AI can detect nuances in images, but bots are simple — they must be. If each bot contained a neural network for image recognition, the hacker’s computer controlling thousands of such bots would be heavily overloaded and inefficient. This would make mass attacks impossible — which, as we know, are one of the most common forms of cyberattacks (DDoS attacks).


DDoS-Type Cyberattacks

Bots are primarily designed to attack the application layer (Layer 7). The attack consists in running the most resource-intensive queries or functions of a website.

All DDoS (Distributed Denial of Service) attacks aim to paralyze the server by maximally consuming its resources. Other forms include volume-based attacks, where massive amounts of “empty” queries are sent to the server, and protocol attacks (e.g., SYN Flood), involving interrupted or incomplete external login procedures.


Intrusive Attacks

DDoS attacks aim to overload a server artificially, but logs can also be used to fight intrusive attacks such as SQL Injection, XSS, or RCE. These are far more dangerous because they are active and aim for specific goals — such as data theft or taking control of the server.

  • SQL Injection involves injecting malicious code into database queries, allowing the attacker to read, modify, or delete data, and sometimes gain full control over the database.

  • Cross-Site Scripting (XSS) injects malicious JavaScript code into an application, which can lead to session hijacking, fake content display, or account takeover.

  • Remote Code Execution (RCE) allows arbitrary code to be executed on the server, potentially taking control of the system. Variants include LFI (Local File Inclusion) and RFI (Remote File Inclusion).


How Machine Learning Uses Application Logs to Detect Attacks

The best method of defense against cyberattacks is detecting anomalies using unsupervised Machine Learning algorithms. Here are the key models:

Isolation Forest

Like the well-known Random Forest, it builds many decision trees (e.g., 100–200). Each tree divides a random subset of data to check how quickly a given point can be “isolated.” If a point is isolated very quickly, it is an anomaly.

Each data point (e.g., username, IP, server query) passes through multiple trees. For each, the model computes the average number of splits required for isolation. Normal points need many splits; anomalies require few.

Autoencoders

An unsupervised model that learns during operation.
It works in two steps:

  1. Encoding – compressing large datasets (e.g., user logs) into a smaller representation.

  2. Decoding – reconstructing the original data from the compressed form.

The model learns to reconstruct “normal” transactions. If a suspicious transaction appears (e.g., strange activity, too fast or too repetitive), the reconstruction error grows, and the model flags it as anomalous.

This can be illustrated simply: we expect an elderly woman on a tram to behave calmly. Suddenly, a person dressed as a dragon enters, shaking uncontrollably — unpredictable. Our brain cannot anticipate what happens next. This is a stochastic or anomalous behavior. Autoencoders are based on recurrent neural networks (RNNs) capable of such contextual learning.

LOF (Local Outlier Factor)

LOF compares user activity with that of neighbors, measuring local density. If a point is in a dense cluster, it is typical; if it lies alone, it’s suspicious. The algorithm is similar to statistical parametric tests based on normal distribution density.


Why Use Autoregressive Models for Anomaly Detection?

Besides supervised and unsupervised models, autoregressive models are very important in anomaly detection because they can capture what others cannot — temporal changes.

Isolation Forest, LOF, and Autoencoders detect anomalies in static data, but they do not analyze time-based dynamics. Autoregressive models detect anomalies in trends and sequences.

Models such as ARIMA (Autoregressive Integrated Moving Average) and ARMA (Autoregressive Moving Average) belong to statistical time-series forecasting. They model dependencies between observations over time.

Autoregressive models detect anomalies in temporal dynamics — for example, they can see that someone performs the same action every hour or that there are strange pauses in server activity.

If a server usually handles 500 requests per minute and the autoregressive model predicts 480–520, but suddenly it receives 700, the model marks this as an anomaly. Such models are ideal for detecting DDoS attacks, unusual logins, and sudden changes in user behavior.


Summary

Network security is a key condition for the development of the digital economy. Over millions of years of evolution, living organisms have developed systems that eliminate invisible biological threats such as microbes, viruses, and bacteria. The same is now happening in digital evolution.

Today, cybersecurity no longer relies solely on firewalls or simple associative rules. Increasingly sophisticated forms of cyberattacks require equally sophisticated and effective countermeasures.


Artykuł Review of the Most Important Machine Learning Models for Detecting and Eliminating Hacker Attacks pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>