Review of the Most Important Machine Learning Models for Detecting and Eliminating Hacker Attacks

A hacker attack is a deliberate action aimed at paralyzing a network server or, in its active form, taking control over it. In the past, this was done by introducing a virus; today, antivirus systems are so advanced that viruses have become a secondary threat.

A network server can be compared to the human body. Human skin has an extremely high ability to block all pathogens, germs, and chemical substances. It is like a wall through which nothing can pass. Infections occur only when the skin is damaged, has deep cracks, or is penetrated by a sting or another type of attack that destroys its structure. The human body has two main gateways through which germs, viruses, and pathogens enter: the alimentary canal and the respiratory tract. As we know, these two gates cannot be closed, because that would deprive us of life.

An identical mechanism exists in servers; these devices have entry gates through which vast amounts of content flow, often containing harmful software. Closing the inflow of information to the server would mean it could not be used — it would cease to perform its function.

Therefore, both servers and living organisms cannot completely shut themselves off from the external environment. They must build systems within themselves that can identify and eliminate threats. This function must be performed in a streaming system, that is, in real time. Threats must be eliminated immediately — here and now.

W-Moszczynski (5)

Operation of Machine Learning Models

In general, data science models can be divided into supervised, unsupervised, and autoregressive models.

Supervised Models

The learning process in supervised learning involves randomly dividing a dataset into training (used to train the model) and test (used to check its efficiency and accuracy) sets, usually in an 80/20 proportion. The trained model’s task is to detect information that is anomalous, that is, deviating from the standard behavior of data on the server. It takes the form of an algorithm or mathematical equation. It is usually small in size, though depending on its technology, it may absorb significant server resources.

The problem of consuming large analytical resources is best illustrated by the example of military drones. At present, drones are mostly operated by humans. However, Machine Learning models could easily identify and eliminate enemy armored vehicles. The problem is that drones themselves have no computers onboard. Computers like Raspberry Pi can be installed in them, but they are relatively expensive, and their capabilities are insufficient to smoothly handle such artificial intelligence models.

The moment when widely available microcomputers become both cheap and powerful will be the moment when we witness the mythical war of robots against humans. To put it less dramatically — we will observe completely autonomous robots killing people. For now, this phenomenon is rare on the battlefield for technical reasons. As we can see, Asimov’s Laws of Robotics (created by Isaac Asimov, American science-fiction writer and science popularizer) can be thrown into the trash — before robots even appeared for good, these laws lost all significance.

Thus, supervised learning models work effectively but do not learn during operation. In other words, if server administrators do not ensure that the model is systematically retrained, hackers — or some form of artificial intelligence — will find new forms of cyberattack that the model will not recognize, because it did not have the chance to learn them.

Unsupervised Models

Unsupervised learning models for anomaly detection learn by identifying normal patterns in data and marking unusual observations as anomalies. These models learn without supervision, meaning they have no information about which data points are anomalies and which are normal — this is called learning from unlabeled data.

Such models mainly analyze the structure of the data and create a pattern of normality. New data entering the server are evaluated by comparison, identifying differences relative to the learned pattern.

Typical unsupervised algorithms for anomaly detection include:

Clustering methods,
Statistical methods (e.g., PCA, KDE) identifying outliers in a distribution,
Autoencoders (based on neural network technology),
Entropy-based models, e.g., Isolation Forest, which quickly isolates anomalies.

The Role of Logs

The human body communicates with the brain using protocols generated by the nervous system. If an internal organ is attacked by pathogens, we feel pain in a specific area. Pain is a protocol sent directly to the brain, whose purpose is to inform the decision-making center about a problem.

At the same time, there are procedures independent of our will. If dust enters the respiratory tract, we start sneezing and coughing.

Server logs work in a similar way, except they are generated and sent continuously — whether there is a problem or not. Logs are continuous information about server activity, sent to an analytical center.

At the current stage of development, this analytical center — the “brain” — is still a human. However, the evolution of AI systems introduces an intermediate problem identification layer between logs and humans, so that the human is not flooded with endless streams of information but instead receives synthetic summaries about the system’s state.

The key is to make the system intelligent enough to identify and immediately eliminate threats. This is exactly what artificial intelligence models are designed for.

Thus, the log register on a server is one of the key data sources for analyzing cyber threats.

We can distinguish five main types of logs:

Access logs,
Error logs,
Authentication and authorization logs,
Firewall logs,
Application logs.

When identifying anomalies that often turn out to be malicious software or a veiled form of cyberattack, application logs are the most useful, as they analyze user behavior on the server.

Application Logs

Application logs play a key role in both recommendation systems and cyberattack defense systems because they provide valuable information about user behavior.

They show:

clicks on products or content,
time spent on a product page or article,
scrolling behavior,
cart actions (add/remove product, purchase),
ratings and reviews.

Hacker attacks are usually carried out by bots programmed to imitate human behavior. They can scroll pages, move to subpages, and perform actions similar to human users. However, bots are created according to specific templates and replicated, so they behave identically. Their uniformity makes them detectable and removable.

Moreover, bots do not respond to graphic or contextual cues as humans do. Of course, advanced AI can detect nuances in images, but bots are simple — they must be. If each bot contained a neural network for image recognition, the hacker’s computer controlling thousands of such bots would be heavily overloaded and inefficient. This would make mass attacks impossible — which, as we know, are one of the most common forms of cyberattacks (DDoS attacks).

DDoS-Type Cyberattacks

Bots are primarily designed to attack the application layer (Layer 7). The attack consists in running the most resource-intensive queries or functions of a website.

All DDoS (Distributed Denial of Service) attacks aim to paralyze the server by maximally consuming its resources. Other forms include volume-based attacks, where massive amounts of “empty” queries are sent to the server, and protocol attacks (e.g., SYN Flood), involving interrupted or incomplete external login procedures.

Intrusive Attacks

DDoS attacks aim to overload a server artificially, but logs can also be used to fight intrusive attacks such as SQL Injection, XSS, or RCE. These are far more dangerous because they are active and aim for specific goals — such as data theft or taking control of the server.

SQL Injection involves injecting malicious code into database queries, allowing the attacker to read, modify, or delete data, and sometimes gain full control over the database.
Cross-Site Scripting (XSS) injects malicious JavaScript code into an application, which can lead to session hijacking, fake content display, or account takeover.
Remote Code Execution (RCE) allows arbitrary code to be executed on the server, potentially taking control of the system. Variants include LFI (Local File Inclusion) and RFI (Remote File Inclusion).

How Machine Learning Uses Application Logs to Detect Attacks

The best method of defense against cyberattacks is detecting anomalies using unsupervised Machine Learning algorithms. Here are the key models:

Isolation Forest

Like the well-known Random Forest, it builds many decision trees (e.g., 100–200). Each tree divides a random subset of data to check how quickly a given point can be “isolated.” If a point is isolated very quickly, it is an anomaly.

Each data point (e.g., username, IP, server query) passes through multiple trees. For each, the model computes the average number of splits required for isolation. Normal points need many splits; anomalies require few.

Autoencoders

An unsupervised model that learns during operation.
It works in two steps:

Encoding – compressing large datasets (e.g., user logs) into a smaller representation.
Decoding – reconstructing the original data from the compressed form.

The model learns to reconstruct “normal” transactions. If a suspicious transaction appears (e.g., strange activity, too fast or too repetitive), the reconstruction error grows, and the model flags it as anomalous.

This can be illustrated simply: we expect an elderly woman on a tram to behave calmly. Suddenly, a person dressed as a dragon enters, shaking uncontrollably — unpredictable. Our brain cannot anticipate what happens next. This is a stochastic or anomalous behavior. Autoencoders are based on recurrent neural networks (RNNs) capable of such contextual learning.

LOF (Local Outlier Factor)

LOF compares user activity with that of neighbors, measuring local density. If a point is in a dense cluster, it is typical; if it lies alone, it’s suspicious. The algorithm is similar to statistical parametric tests based on normal distribution density.

Why Use Autoregressive Models for Anomaly Detection?

Besides supervised and unsupervised models, autoregressive models are very important in anomaly detection because they can capture what others cannot — temporal changes.

Isolation Forest, LOF, and Autoencoders detect anomalies in static data, but they do not analyze time-based dynamics. Autoregressive models detect anomalies in trends and sequences.

Models such as ARIMA (Autoregressive Integrated Moving Average) and ARMA (Autoregressive Moving Average) belong to statistical time-series forecasting. They model dependencies between observations over time.

Autoregressive models detect anomalies in temporal dynamics — for example, they can see that someone performs the same action every hour or that there are strange pauses in server activity.

If a server usually handles 500 requests per minute and the autoregressive model predicts 480–520, but suddenly it receives 700, the model marks this as an anomaly. Such models are ideal for detecting DDoS attacks, unusual logins, and sudden changes in user behavior.

Summary

Network security is a key condition for the development of the digital economy. Over millions of years of evolution, living organisms have developed systems that eliminate invisible biological threats such as microbes, viruses, and bacteria. The same is now happening in digital evolution.

Today, cybersecurity no longer relies solely on firewalls or simple associative rules. Increasingly sophisticated forms of cyberattacks require equally sophisticated and effective countermeasures.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński