Factorization Machines – how a baker can build an effective recommendation system based on client log data

Building a Recommendation System (Part 3)

Contemporary online shops—even small ones run by local bakers—can effectively use customer data to personalize offers. One of the most effective approaches to creating product recommendations is to apply the Factorization Machines (FM) method.

Factorization Machines – how a baker can build an effective recommendation system based on client log data

Below, I will explain how to build such a system step by step, show the differences between FM and the classic PCA + k-means approach, and present an example of how such personalization can translate into increased sales.

Collecting customer data

In the previous part, I explained how to collect information about your customers. First and foremost, to perform any analysis, we must have a database concerning our customers. We are analyzing a confectionery store that operates online. Each customer, in order to make a purchase, must first log in. In the previous publication, I pointed out that the owner has their own website and direct access to the logs. Logs are simple pieces of information generated by the system that record customers’ actions. The first log records that the customer logged into the store using a password and their own login. Another log records what the customer typed into the search box and which sections of the website they entered. From the logs, we can learn how long the customer took to make a purchase, whether they hesitated, or whether they abandoned the purchase. System logs are a treasure trove of information. In the previous publication, I explained how to place information from logs into a table. This table is a database describing the characteristic features and behavior of each individual customer. In this part, I would like to show how, having 30 characteristic features for around 100,000 customers of our store, to create a recommendation system that will select appropriate recommendations individually for each customer.

In our example, based on the customer’s activity logs, we extracted so-called descriptive features (features), e.g.:

In practice, there may be 30 or more such features, including, for example, preferred flavors, purchase frequency, favorite days of the week, etc.

W-MOSZCZYNSKI ppic 11-25

Building a recommendation system with Factorization Machines

Factorization Machines is a supervised learning model (regression/classification) that extends ordinary linear regression with the ability to automatically learn interactions between pairs of features, even with very high data dimensionality (30 or more features). FM efficiently models dependencies between features thanks to so-called hidden factors (embeddings) instead of explicitly creating all combinations of visible and invisible features. This allows it to capture patterns that a traditional linear model would miss, while avoiding the explosion in the number of parameters (the curse of dimensionality). FM became popular in recommendation systems and click-through prediction precisely because of this ability to model interactions in high-dimensional data. The FM model thus predicts the probability of clicks and the course of the customer’s purchasing process.

On the basis of these data, one can train a Factorization Machines model which:
• recognizes hidden patterns in customer preferences,
• does not require manual creation of feature combinations,
• deals effectively with sparse data (e.g., a customer bought a cake only once).

The model learns to predict which products are most likely to be purchased by a specific customer. It performs better than simple rules like “who bought A will buy B” because it accounts for many subtle interactions between features. It can see what is visible and what is not visible.

For example, Customer ID: 1002 has a high propensity for discounts and usually shops in the morning. FM will learn that cream cakes on Monday mornings with a discount code have a higher chance of being purchased by this customer. Based on the model’s prediction of a higher probability of such behavior (customer ID: 1002 – morning + promotion), the system can automatically suggest morning promotions for products, and even prepare a personalized promotion just for them.

A typical recommendation system based on FM can take into account, among others: user features (demographics, behavioral segment, even hundreds of features such as gender, risk tendencies, or various types of phobias), product features (category, price, popularity, etc.), and context (time of day, device, channel, etc.). FM efficiently computes interactions among these features. It discovers hidden interactions that a linear model would never capture without a manually defined interaction. Thanks to such automatic feature combination, FM works well for tasks with very sparse data, such as product recommendations or click-through prediction.

Matrix Factorization

To better understand Factorization Machines, let’s start with its first version, i.e., Matrix Factorization. Let’s learn by doing. We want to create a recommendation system for baked goods (e.g., rolls, cakes, bread) that suggests to each customer what else they might like—based on the preferences of other similar customers.

We have 5 customers and 5 products:

“OK” means the customer likes it, “No-OK” means they don’t like it, while “n/a” is simply missing data.

Instead of guessing “who likes what,” Matrix Factorization:
• learns hidden customer features (e.g., prefers sweet, soft, healthy, poppy-seeded…),
• learns hidden product features (e.g., sweetness, hardness, fiber content…).

As a result, in our example MF allows us to say: “Customer A likes sweet pastries and doesn’t like heavy cakes → they will probably enjoy a croissant, because other customers with a similar profile bought it and praised it.”

For each customer and product, it computes a match score—how many points a given customer would give to a given baked good even if they haven’t tried it yet. It thus looks for similarities in customers’ preferences relative to products. The notion of points does not come out of nowhere.

Matrix Factorization was first applied at scale in the Netflix competition. It wasn’t invented by Netflix, but it was popularized thanks to the Netflix Prize announced in 2006, with a $1 million award for improving the accuracy of their recommendation system by 10

Thanks to this approach, the system was able to learn subtle customer preferences and similarities between films. From that moment on, MF became one of the foundations of modern recommendation systems.

“Rich data”: how Matrix Factorization differs from Factorization Machines

Subsequent variants of Matrix Factorization, such as Factorization Machines, use so-called rich data. This consists of augmenting classical collaborative filtering with additional information:
• customer data: age, purchase time, city, dietary preferences,
• product data: type of flour, caloric value, ingredients, allergens,
• context data: online vs. offline purchase, time of day, promotions.

We append such features to the matrix, and instead of pure Matrix Factorization (from “Netflix”), we use, for example, Factorization Machines or DeepFM, which learn interactions among these features. This can be achieved using Python libraries: lightFM, xlearn, scikit-surprise, fastFM. With respect to cloud solutions, we can apply: AWS: Amazon Personalize; GCP: Recommendations AI; Azure: Azure ML + LightGBM + AutoML.

It is worth remembering that a Factorization Machines system learns on its own from customers’ purchases—the more data, the better the recommendations.

What does the recommendation system look like for each customer?

We can use a trained FM model to recommend products. In offline (batch) mode, this is usually done periodically.

For each user (or for active users) we generate predictions for specific products that may interest them. Because exhaustive checking of all user–product combinations can be costly with a very large assortment, in practice it is limited to popular products, products from a given category, new items, etc. The FM model computes the probability for each individual customer for this smaller set of products. Of course, for a small cake shop, the assortment is small enough that one can compute scores for all products. We sort products by predicted probability of purchase (or preference score) and select the top-10 for each user. These generated recommendation lists can be saved, for example, in a database. For instance, each week we update the list of recommended cakes for each customer based on the latest data.

Using the recommendations: in the application (e.g., the online store), after the user logs in, we can display their personalized list of recommended products from this prepared list. In the batch system, this list remains fixed until the next update (e.g., until the next day).

Comparing Factorization Machines with PCA + k-means

Some time ago, I wrote at length about the excellence of the method of assigning customers to purchasing clusters; I wrote about the PCA + k-means method.

This method consisted of collecting information on customers’ behavior in the online store: how often they visited the store, how much they spent on purchases, how long they took to make decisions. Then I added statistical data such as age, place of residence, type of business. You cannot do k-means clustering on, say, 40 features. Dimensionality reduction is necessary. You can do clustering based on two, at most three features. Therefore, I ran PCA—a method that groups customer features into typically 2–3 bundles, so-called principal components, in which features assigned to individual customers are found. This recommendation method is unfortunately already outdated. I mention it because many of my publications were devoted to it.

The classical segmentation approach PCA + k-means consists of:
• reducing the dimensionality of features (PCA),
• clustering customers (e.g., into 5 groups),
• assigning an offer, i.e., a recommendation, to each of the 5 groups.

It was a limited recommendation method because everyone in a segment received the same thing, and after all, even very similar people differ from one another. The method did not account for changing preferences. There was also no learning based on outcomes, no feedback information on whether the customer actually bought.

Why is Factorization Machines better than PCA + k-means?

There are three reasons:
• FM learns directly on purchase transaction data, so it reacts faster,
• it delivers personalized, not group, recommendations—after all, even similar people differ,
• FM handles new products and customers better.

Factorization Machines is not a clustering system

Is the Factorization Machines method (and its extension DeepFM) a special way of clustering customers based on numerous features? And if it is not a clustering system, how does it differ from, for example, segmentation using PCA + k-means? It must be emphasized that FM is a predictive model, not a clustering method. PCA (principal component analysis) is used for dimensionality reduction/structure discovery in an unsupervised way, and k-means clustering groups similar objects without information about any target variable. By contrast, FM is a supervised algorithm—we train it to predict a specific target (e.g., whether a user will buy a product or not). It is thus a predictive system.

Unlike PCA, which creates linear combinations of features, FM learns nonlinear patterns thanks to hidden factors. This means that FM does not so much assign customers to static clusters as it is a learning model that allows one to forecast behavior (e.g., propensity to buy a given product) based on combinations of features (the “rich” features I mentioned earlier). Of course, during training FM learns hidden vectors for customers—these can be treated as embeddings in a space of hidden factors (earlier mentioned latent factors or, by another name, embeddings). Customers with similar vectors (grouping various features) will behave similarly (which resembles customer clusters), but these representations are learned for a specific criterion (e.g., maximizing recommendation accuracy), not purely statistical like in PCA. For marketing segmentation one could use PCA + k-means clustering on 30 features, but such a division does not necessarily maximize conversion. FM, on the other hand, will learn hidden dependencies (latent factors) that group customers in a way optimal from the perspective of recommendations (e.g., it will reveal which customers respond to discount codes and which to delivery speed, etc.), which has greater value for creating a personalized offer generated by the recommendation system.

Hidden factors

In the Factorization Machines method, hidden factors (latent factors) are hidden features (i.e., not directly visible in the data) that represent the properties of objects and features, allowing the model to capture interactions among them even when the data are sparse.

FM extends the classic linear regression model by adding second-order interactions between variables, which are modeled using hidden vectors (latent vectors). Each variable is assigned such a vector—i.e., the latent factors.

If we have user A and the product “Meringue Cake,” then instead of simply taking the usual linear-model weight of the interaction “user A – product ‘Meringue Cake,’” FM assigns latent vectors to the user and the product (e.g., of dimension 10) that encode, for example, the “user’s taste” and the “product’s characteristics”—that is, a vector containing many, in this case 10, features of the product and user. These 10 features are called “rich data.” The model learns how well these 10-dimensional vectors fit one another (e.g., through their dot product), which constitutes the prediction of the interaction (e.g., product rating, click, purchase).

Latent factors make it possible to reconstruct hidden links among elements—for example, that users who like “caramel” often also like “meringues” (even if there is no direct data connection for this). Products with similar latent factors are perceived similarly by users.

Finally, an important note: the dot product can be used for different purposes, and the concept of latent vectors can differ across contexts.

FM: User A has a latent vector; Product B has another latent vector. The model computes the dot product of these vectors and predicts, for example, the product rating or the probability of purchase.

PCA: From a set of features (e.g., age, gender, number of purchases) PCA creates new variables, PC1, PC2, …, which are combinations of the original features and explain as much variance as possible—used e.g. for visualization or data simplification.

Latent vectors in FM are not the same as principal components in PCA, although both concepts relate to the hidden structure of data and can lead to dimensionality reduction.

Latent vectors are learned and optimized for prediction. PCA creates a statistical projection without knowledge of the target.

Summary

A craftsman running an online store, equipped with customer data, can build an effective recommendation system using FM. This approach is more precise than classic customer grouping, and at the same time relatively simple to implement, especially in the batch version. Thanks to this, even a local confectionery can use tools that until recently were reserved for the largest e-commerce platforms.

Wojciech Moszczyński

Wojciech Moszczyński—graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń; specialist in econometrics, finance, data science, and management accounting. He specializes in optimizing production and logistics processes. He conducts research in the field of the development and application of artificial intelligence. For years, he has been engaged in popularizing machine learning and data science in business environments.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński