How Recurrent Neural Networks Work - THE DATA SCIENCE LIBRARY

Data Science in the Milling Industry

An artificial neural network is a copy of the naturally existing neural network of the brain. It can achieve a level of reasoning unattainable for the average person. However, a neural network is not an intelligence we are accustomed to. It is an information system that, like a production machine, perfects itself in a narrow specialization, achieving very high efficiency.

An artificial neural network is built from many layers of neurons that allow for mutual communication. Neural networks learn through a process of recurrence, meaning repeated performance of the same simple calculations, improving the accuracy of estimations slightly each time.

How Does a Recurrent Neural Network Work?

Natural and artificial neurons function in a very similar way. They are a kind of relay boxes. Information flows into a neuron. Whether a neuron transmits this information further or retains it depends on the intensity of the incoming information. This intensity is determined by the weights assigned to this information. In biology, the weights assigned to stimuli are the intensities of electrical charges.

The functioning of a single neuron can be compared to the reactions of a sleeping cat. The cat may be sleeping on the carpet in a room. Various sounds from the television, conversations among people, and the noise of a dishwasher reach it. However, just a gentle scratching is enough for the cat to open its eyes wide and perk up its ears. This is how a single neuron operates. Just like the cat reacts only to those stimuli that have significance for it, training a neural network is like teaching a cat to catch mice—a cat that for some reason did not inherit this skill from its ancestors.

Different signals are supplied to the neural network simultaneously. Often, large numbers and small fractions, as well as zero and one values, influence it at the same time. Before delivering information to the neurons, numbers must be standardized. Standardization, in its simplest form, involves processing numbers so that their distribution becomes a distribution with a mean of zero and a standard deviation of one. An artificial neural network accepts all standardized signals and initially assigns them random weights.

The learning of an artificial neural network consists of gradually changing the weights assigned to individual pieces of information. Gradually, the most important information is sharpened, while the least important information is dulled.

A sample neuron receives three signals: x1, x2, x3. Each of these pieces of information is assigned weights: w1, w2, and w3. In the diagram, I placed the Greek letter Σ, which represents a sum. The neuron sums all information x1, x2, x3 strengthened or weakened by weights w1, w2, w3, emitting a value Z at the output. This phenomenon can be described by the simple formula:

$Z = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3$

The total excitation signal of the neuron Z, that is, the weighted sum of signals, travels to the activation function. The value Z is called the postsynaptic potential (PSP). The activation function can be any simple mathematical function with a single argument Z. It is assumed that the activation function should not be a linear function.

In the diagram, this function is expressed as: f(z)

Whether the activation function becomes excited depends on the intensity level of the total postsynaptic signal Z. Just as the cat reacts to the sound of a scratching mouse, the neural network learns to distinguish significant information from noise.

How Does a Neural Network Learn?

The network initially accepts random weights. Then, using them, it processes information and checks it against the target value programmed in the loss function. Based on the results of the calculations, the algorithm adjusts the settings and recalculates everything anew. An artificial neural network repeats this same simple computational process hundreds of times, each time changing the level of the weights of the individual input variables. The network adjusts weights based on successive levels of the loss function. Each neuron receives sets of external variables from 1 to n and calculates an output value y.

Training a neural network is classified as supervised learning. Let’s assume we want to create a model for forecasting grain prices on a commodity exchange. We have information from the previous year about rainfall levels, temperature, direct payment amounts, corn prices, and a range of other data. Our output variable in the model is the price of the grain.

All the mentioned information flows into each neuron, and each neuron calculates the output value in the form of price y. Each input variable has its own set of weights for each layer. Supervised learning means that we train the model on historical data. We input historical input data into the model, and the model performs calculations of the output value and then checks whether the calculated theoretical value is close to the historical empirical value.

The recurrent neural network repeats the computation loops hundreds of times. Each neuron in the layer performs similar calculations described by the following equation:

$Z_{i} = a_{i} \cdot w_{i} + Z_{T}$

where: a – is the activation of the neuron calculated by:

$a_{i} = f(Z_{i})$

Each time the calculations are confronted with the activation function. Calculations are conducted through matrix operations.

What Role Does the Activation Function Play in Learning?

The activation function is a very important element in the learning process of the neural network. The activation function must be simple because its simplicity greatly affects the speed of the learning process. Currently, in deep learning, the ReLu (Rectified Linear Unit) function is most commonly used. Slightly less frequently, sigmoid and tanh functions are used.

In the illustration below, we see the ReLu function. A neuron is activated when the postsynaptic potential reaches the value n. As seen, the value n is artificially added. The neuron activates after exceeding the value n on the X-axis.

Loss Function

This is the primary source of feedback about progress in learning. With each interaction of the neural network, a calculation result is generated. Since we are conducting supervised learning, we know what the results yn should be for subsequent records of input variables: x1, x2, x3, … xn.

The neural network calculates theoretical results ŷ. It then compares them with historical values y. The loss function is most often the sum of squared differences between theoretical values y and empirical values ŷ.

The purpose of the loss function is to indicate how much the theoretical results differ from the empirical results.

$\text{loss} = \sum_{i}(y_i – \hat{y}_i)^2$

$loss = \sum_{i} (y_{i} - y^_{i})^{2}$

After each of the hundreds of interactions of the neural network, an assessment appears as a result of the loss function. Based on this, the network adjusts weights striving to minimize the next result of the loss function. The network performs interactions as many times as indicated by its programmer. The greatest progress in learning occurs at the beginning of the training.

It’s like a musician practicing thousands of times a sonata on the violin, each time performing it better. In the end, they refine their piece, at which point progress is no longer noticeable to the untrained ear.

Gradient Descent Principle

Finally, it should be mentioned how the neural network learns based on the loss function. Weights are adjusted using the gradient descent principle, which is based on differential calculus and the derivative of the function.

Let’s imagine a vast green valley. From one of the peaks surrounding the valley, we release a ball. The ball rolls down, bouncing off hills and irregularities, stopping in some depression without reaching the bottom of the valley. The ball is again bounced from a local minimum and continues to fall down. Our goal is for the ball to roll to the bottom of the valley. However, this does not always succeed.

This is a way to imagine the minimization of error using the gradient descent method. With each interaction of the neural network, the partial derivative of the function is calculated for each parameter of the network. The derivative of the function can determine whether the function is increasing or decreasing. Because of this, dozens of balls symbolizing the remnants of our neural network model know where the bottom of the valley they aim for is. This way, the network knows which direction to minimize deviations. After each interaction, the gradient indicates the direction of optimization for individual weights.

Learning Rate

The learning rate of the network is defined for all neurons as the length of the step they can take during each interaction. If these steps are small, learning may take a very long time. Worse still, gradients may get stuck in local minima and lack the momentum to escape them.

Referring to our example of the green valley, our ball may fall into a hole and never reach the bottom of the valley. Too strong kicks to the ball may bounce it repeatedly above the bottom of the valley. The learning process will then be somewhat chaotic.

The General Form of a Multilayer Neural Network

The diagram below shows the theoretical appearance of a multilayer neural network. Four independent variables flow into the network from the left side, creating the input layer of neurons. Information flows into subsequent internal layers, each adjusting the importance of the information through the level of weights assigned to these pieces of information. The information reaches the output layer, where the theoretical results are verified against empirical (historical) values. The process then returns to the starting point. Utilizing the gradient descent method, the network adjusts weights in the subsequent layers to reduce the sum of squares of the model’s errors.

Source of the Diagram: Araujo Vinicius, Guimarães Augusto, Campos Souza Paulo, Rezende Thiago, Araujo Vanessa. “Using Resistin, Glucose, Age and BMI and Pruning Fuzzy Neural Network for the Construction of Expert Systems in the Prediction of Breast Cancer” Machine Learning and Knowledge Extraction (2019).

Application of Neural Networks

The best example of the application of neural networks is image recognition. This is a process that other machine learning models cannot perform.

A network can learn to recognize bicycles in photos, even though it has no knowledge of bicycles, does not know their purpose, and how they are constructed. The network receives several hundred images of bicycles on the streets and photos of streets without bicycles. The photos with bicycles are labeled as one, and the photos without bicycles as zero.

Each photo consists of thousands of pixels; the network assigns a weight to each cluster of pixels, which it then adjusts. The network recognizes the shape of the wheel, the saddle, and the handlebars. It also identifies the characteristic positioning of people’s bodies on bicycles. By repeating the review of photos hundreds of times, it finds patterns and dependencies. Similarly, networks recognize customer behaviors in a store, finding patterns of behavior, identifying through movements whether a customer is indecisive or convinced.

The primary goal of creating artificial neural networks was to solve problems at the level of general abstraction, as the human brain would do. However, it turned out that it is more effective to use networks for specialized tasks. In such cases, neural networks significantly surpass human perception.

Training a high-class specialist takes a lot of resources and time. Human work is expensive and error-prone. Models based on neural networks diagnose diseases and faults much better than the human mind, utilizing enormous information resources from around the world. These systems almost never make mistakes, their work costs nothing, they can operate 24 hours a day, and they can be replicated.

Two hundred years ago, machines began to replace humans in tedious jobs, working faster, better, and more efficiently. Now we are witnessing machines starting to replace us in intellectual activities.

Wojciech Moszczyński