Building a Recommendation System

In application logs and other logs you can find all the phrases that were typed by the customer into the store’s search engine; you can find information on what they bought, how long it took them to make a decision, how they reacted to advertising banners and information on whether they abandoned their purchase at a certain stage of the process. Application logs are simple text information that is very easy to read without the need to use tools. Unfortunately, there are too many logs to read and draw conclusions from. There may be up to 50 people simultaneously in the store performing various transactions and actions. To effectively extract information from application logs you need to process that information with simple Python tools.

Where to get logs?

Most websites are built on HTML using WordPress templates. Building such a site is child’s play. After building the site you can install an online store plugin. Plugins on WordPress sites work similarly to plugins installed on mobile phones. It’s more or less the same concept. Building an online store on the basis of such a plugin, e.g., with cakes, is a more complex task. Both the website and the store have to be published somewhere on the internet. Therefore you need to buy cloud space. This is called hosting. Once we have a hosted website on which we run an online store with confectionery products, we can start looking for logs.

As the owner of a store based on WordPress you have access to many different system, application, and server logs, but where and how to get them depends on several things: hosting, WordPress configuration, and additional plugins.

Server logs – system logs (e.g., logins, errors, traffic) are located in the hosting panel (e.g., cPanel, DirectAdmin, Plesk, etc., depending on the hosting provider). There we may encounter the following files:
• access.log – a list of all HTTP requests (who entered, from which IP, when, which URL),
• error.log – server errors (e.g., PHP errors, missing files, 404 errors),
• auth.log (sometimes) – logs of login attempts to FTP, SSH, etc.
To find them you need to log in to the hosting panel and look for a tab: “Logs”, “Statistics”, “File Manager” > /logs/. Alternatively, you can download these log lists via FTP (e.g., directory logs/, logs/error_log, etc.). Usually system log files have the .log or .txt extension, sometimes in Apache or Nginx log format (readable text). Logs can be read in a regular txt editor.

Our goal is to examine how the customer behaves in the store. The most important for this are application logs, not the system logs mentioned above. WordPress does not, by default, keep user event logs (e.g., who logged in, who viewed what), but you can enable them or add plugins. You should install one of the following:
• WP Activity Log – full logs of what users do,
• Simple History – shows logins, edits, admin activity,
• User Activity Log – a list of who logged in and what they did.
Logs are then saved in the WordPress panel, sometimes also as JSON files or databases. If someone runs an online store on a WordPress-type site, they are probably using WooCommerce, i.e., the WordPress store module.

Application logs can then be found at:
WorldPress Admin>WooCommers>Status>Logs tab

To observe customers, collect information about them, create a database, you simply need to get into the logs.
Stores can be run on ready-made modules

Building a recommendation system (part 2)

To build a recommendation system we must have a customer database. Each customer should be described automatically on the basis of application logs. It is therefore essential to gain access to them.

Profiling the customer based on the logs of your own online store of online stores; there are also large websites like Allegro where you can buy yourself space. I cannot provide all the possible ways to obtain application logs. I can only say that it is essential to create a recommendation system that will stimulate our sales. If the provider of the online store service does not provide access to application logs, you can move to another store. After all, what matters is development, not loyalty to the service provider. The technical service of the hosting or the online store should help to obtain access to application logs. This is very important. It’s a lesson that must be done in order to move on.

Types of logs and available information

A typical e-commerce site generates various server logs that we can use to analyze customer behavior. The primary source is WWW access logs (e.g., Apache/Nginx), which record every HTTP request. Each log entry contains, among other things, the visitor’s IP address, a timestamp, the method and address of the request (URL along with any query parameters), the response code, and the browser client identifier (User-Agent). An example access log entry may look as follows:
plaintext
CopyEdit
81.174.152.222 – – [30/Jun/2020:23:38:03 +0000] „GET
/search?query=gluten-free+cake HTTP/1.1″ 200 6763
„-” „Mozilla/5.0 (X11; Ubuntu; Linux x86_64) Firefox/77.0”

The above entry (here adapted to a query for a gluten-free cake – marked in yellow) shows key elements: the IP address, date/time, the GET method and the request path (in this case the /search endpoint with the parameter query=gluten-free+cake), the 200 response code, and the browser user-agent string.

In addition to WWW server logs, there are the application logs mentioned above that describe user actions in the store. If the web application logs events (e.g., in a separate file or database), we can obtain more detailed entries of user actions, e.g., searching for a product, viewing a specific category, adding a product to the cart, cart abandonment, or purchase completion. Such user-action logs often contain the user identifier (because the customer must log in), a timestamp, and the name and parameters of the action. For example, entries in an action log may look like this:
plaintext
CopyEdit
2025-07-26 10:15:32 INFO user_id=123 ACTION=
SEARCH query=”gluten-free cake”
2025-07-26 10:16:05 INFO user_id=123 ACTION=VIEW_
CATEGORY name=”Gluten-free cakes”

2025-07-26 10:17:10 INFO user_id=123 ACTION=ADD_
TO_CART product_id=456 name=”Gluten-free chocolate
cake”
2025-07-26 10:18:45 INFO user_id=123 ACTION=
PURCHASE order_id=789 value=120.00 PLN

Such an application log shows the behavior path of a specific customer user_id=123: they searched the phrase gluten-free cake, viewed the category Gluten-free cakes, added to the cart the product with ID 456, and ultimately made a purchase. In the literature such data are often called clickstream data or user tracking logs, where each entry represents a user’s interaction with store elements (click, add to cart, like a product, purchase, etc.) along with the user identifier, the object of the action (e.g., product ID), and time. Thanks to the fact that customers are not anonymous (they log into the store as users), we can link all these events to a specific person and build their behavioral profile based on the history of actions. Let’s remember that our goal is to build a database about each customer of the store.

Extracting data from logs using Python

Having access to the logs, we can use Python to automate the extraction of the necessary information. Logs are text files that need to be transformed into an organized data structure (a DataFrame table). There are libraries that facilitate this task—for example, lars (Library for Apache/Nginx log processing) makes it easy to parse WWW logs and load them into Python. Similarly, the apache-log-parser library. The best and simplest is to use the easy and effective functions of Pandas. Python makes it possible to read a log line by line, split each line into fields (e.g., with the split method or re.findall according to the log format), and assemble a list of dictionaries representing entries. Then it’s easy to convert this into a Pandas DataFrame.

The resulting DataFrame table will have columns corresponding to the log fields (request, time, price, etc.) and each row corresponds to a single customer. Thanks to this we can filter and aggregate the data. I am not able to describe step by step how this is done. Anyone who knows Pandas and can aggregate data and perform extraction from long strings will handle this with ease.

In our context, the most important thing is that after loading the logs we link events to specific customers. We identify the customer by ID; each customer logs into the store; a good solution is to log user actions according to the mention in the logs of user_id. Thanks to this, in the DataFrame we can have a user_id column and easily group data by user.

Key metadata worth extracting from the logs:
• Search phrases – based on query parameters (e.g., query= in the URL) from HTTP logs or the query field in action logs.

• Identifiers and names of viewed products/categories – from the URLs of visited pages (e.g., /category/for-children/ in the URL path) or from action logs (VIEW_CATEGORY, VIEW_PRODUCT).
• Cart and purchase actions – entries for adding to the cart (ADD_TO_CART), removing from the cart, proceeding to checkout, and finalizing the transaction (PURCHASE), along with timestamps.
• Interactions with promotions – e.g., clicks on promotional banners (if logged), use of discount codes (which may appear in the purchase log), visits to the “promotions” page.
• Time aspects of visits – dates and times of actions that allow us to determine how much time the customer spends on the site, at what times they are active, how long it takes from the first visit to purchase, etc.

W-MOSZCZYNSKI ppic 10-25

Behavior analysis and mapping customer features from logs

When the data from the logs have already been collected in tabular form, the next step is to extract useful business information, i.e., to understand what these data tell us about the preferences and profile of the customer. We should define certain features (attributes) of customers that we want to recognize. Based on the behaviors visible in the logs we can automatically profile customers according to such features. Here are example features and ways to determine them.
• Dietary preferences (e.g., gluten-free, vegan) – if in the search or browsing logs we see frequent queries like “gluten-free”, “veggie”, “keto”, etc., we can note that the customer has specific dietary requirements. For example, a customer who repeatedly looks for low-gluten or sugar-free cakes will receive the flag Dietary=YES in their profile (a column in the DataFrame). Over time we can even distinguish the type of diet (e.g., gluten-free person, vegan, etc.) if such categories appear in the data.
• Having small children – we infer this feature by analyzing the products and categories viewed or purchased by the customer. If the logs show that the user often visits the children’s section (e.g., cakes with cartoon characters, numeral-shaped candles, products for infants) or searches phrases like “for a child”, then they are probably a parent of small children. Purchases such as diapers, toys, first-birthday cakes, etc., also strongly indicate this group. In that case we mark in the customer’s profile e.g., Small_children=YES. The more signals (e.g., regularly buying cakes with children’s themes), the more reliable the categorization.
• Estimated age of the customer – without direct demographic data we must estimate age indirectly. Sales logs can help: the type of products purchased and occasions can suggest an age range. For example, a customer ordering cakes for an 18th birthday (the inscription “18 years”), for a 50th anniversary, or retro products vs. very modern flavors—such information allows us to guess. If a customer mainly buys children’s items, we can assume they are, say, 25–40 (typically parents of small children). Conversely, frequent purchases of health-oriented or orthopedic products may suggest a higher age bracket. Although this is less precise, purchase patterns and occasions (e.g., buying cakes for “30th birthday”, “50th birthday”) allow us to approximately estimate age or life stage. This information can be recorded, for example, as Estimated_age=30–40. (Remember that this is guessing based on data—useful in marketing segmentation, though not 100

In practice, when creating a customer profile, we count or flag such events in columns. Using Python and Pandas, we can group entries in the DataFrame by user_id and aggregate information, for example:
• how many times the user searched for phrases from a given category (dietary, for children, etc.),
• which product categories they most often view (e.g., build a ranking of categories for the user),
• the average time from first visit to purchase,
• the share of transactions using a discount code,
• the maximum cart value, purchase frequency (monthly, yearly),
• days/hours of greatest activity, etc.
We append all these data to the customer profiles table. As a result we obtain, for example, a DataFrame where each row is one customer and the columns are features/profile:

Such a profile allows us to segment customers; we can segment by grouping customers with similar profiles into clusters (e.g., “young bargain hunters”, “parents on a diet”, “loyal traditionalists”, etc.). The literature emphasizes that properly analyzed server log data can be used to build a user profile and personalize the offer. Companies use such segmentation to better target marketing—e.g., direct family promotions to those who have children, or advertise gluten-free products to customers with a dietary flag. This is the very essence of customer segmentation: identifying different groups and tailoring marketing actions to them. However, it should be added that segmentation is not our goal. The goal is to create a customer database based on logs. Besides, customer segmentation is unnecessary for us, as we will find out in the next article.

Summary

To effectively profile customers without ready-made analytics tools (like Google Analytics), one should focus on maximizing the use of logs from one’s own store. Python scripts should be used to automatically process logs, preferably doing this in the Pandas library. Next, define a set of customer features that are relevant from a business point of view. The results should be saved in a customer database, preferably in DataFrame format, so that each subsequent log of new activity can update the profile (this process can be automated).

Over time, data from logs will allow the offer and communication to be personalized just as effectively as with ready-made analytics platforms. The analytical table of customer features of the store will be the basis for applying a powerful recommendation tool known as Factorization Machines. We will learn how to use this tool in the next article.

Wojciech Moszczyński

Wojciech Moszczyński – a graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń, a specialist in econometrics, finance, data science, and management accounting. He specializes in the optimization of production and logistics processes. He conducts research in the area of development and application of artificial intelligence. For years he has been involved in popularizing machine learning and data science in business environments.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Profiling the Customer Based on the Logs of Your Own Online Store

Building a Recommendation System