Hadoop – an elephant created for big tasks

It All Started with GoogleW-MOSZCZYNSKI ps 10-24 w2

Hadoop – An Elephant Created for Big Tasks

The Google search engine was a pioneer in indexing the internet, marking the beginning of the internet revolution. The amount of online data began to grow at an alarming rate, a challenge no one had faced before. Processing such large volumes of data required a powerful infrastructure, which Google lacked; at the time, no one had such database capabilities. Hadoop was born at the beginning of this century when engineers Doug Cutting and Mike Cafarella worked on a project called „Nutch.” They were inspired by Google’s data processing model, which was able to index billions of web pages using distributed processing technology. However, Google’s technology was underdeveloped, and engineers at Apache, the company where Cutting and Cafarella worked, took over the idea and developed their own solution. Hadoop was created, named after Doug Cutting’s son’s favorite stuffed elephant. Hadoop was designed to enable distributed storage and processing of large datasets using database clusters. The project quickly gained popularity and attracted the attention of the Apache Software Foundation, which adopted it and supported its further development as open-source software.

Today, Hadoop is used by the world’s largest companies as the backbone of many Big Data systems, offering a free, open-source tool.

My Adventure with Hadoop

A few years ago, I worked with databases of timber delivery transactions for a company producing chipboards. Initially, the data wasn’t very complex—I had Excel spreadsheets with simple information: arrival and departure times, specifications, load measurements at the plant’s entrance scale. Additionally, there were more static data, such as vehicle capacities or supplier information, addresses, and technical details from documents.

The task assigned to me involved identifying various types of fraud, fictitious loads, and incidents of selling substandard timber. It quickly became clear that knowing simple data was insufficient for detecting fraud. Additional variables needed to be added, such as behavioral patterns of individual suppliers, weighted average masses of specific raw materials, and other complex data. The database began to grow.

The company also expected that the initially small scope of raw material deliveries would eventually evolve into a full-scale analysis and control system for all possible timber sourcing channels. Initially, I managed this with Excel, creating increasingly complex formulas, but after a few months, it was clear that Excel wouldn’t handle it. That’s when I first turned to Python and Pandas. It was revolutionary—I could suddenly process larger volumes of data much, much faster. However, even Python struggled to cope with the rapid increase in data.

As the number of suppliers and vehicles grew, there was a need to build relational databases. Python was no longer enough, so implementing PostgreSQL was the natural next step. Python uses data in the form of massive data sheets, while relational databases consist of small tables that can be joined into larger ones. This technology is better suited for processing and organizing vast datasets. With the PostgreSQL relational database, I could manage data more systematically, perform complex SQL queries, and integrate various information sources. I built a database that tracked every aspect of deliveries—from detailed static data on vehicle capacities and supplier information to dynamic data from wood measurements on the scale.

However, while PostgreSQL is powerful, it also began to reach its limits. Each day, the number of transactions increased, with each transaction containing hundreds of attributes. We averaged over 10,000 transactions daily. Each delivery had its specifications, load mass data, travel times, and corporate data, generating massive amounts of information. Managing this data in a classic relational database became increasingly inefficient. SQL query times extended, and analytical processes consumed too many system resources. At that point, the decision was made to find something more scalable.

The solution turned out to be Hadoop. This tool allows for the processing of vast data sets in a distributed way, which was a perfect fit for our needs. Hadoop was the best choice, mainly for two reasons: scale and flexibility. It allowed us to store and process enormous amounts of data, which was crucial given the volume of transactions. Every day, tens of thousands of new transactions, rich in detail, were added to the system. Hadoop handled this without issue and enabled us to distribute computations across multiple servers, significantly speeding up processing time.

In combination with tools like Hive and Spark, Hadoop enabled easy data transformation and real-time analysis. With our dynamic data from delivery measurements, we could analyze any anomalies immediately. These deliveries were then closely examined for potential irregularities and abuses.

In summary, moving to Hadoop was a natural step. Excel and PostgreSQL were excellent in the early stages, but with such data volume, we needed a more advanced tool. Hadoop not only solved the scale problem but also opened up avenues for more sophisticated analyses and data management in ways previously impossible.

Why Is Hadoop Effective?

Hadoop is a system that allows processing of very large data volumes. It’s a tool that helps companies and organizations manage vast amounts of information that traditional computer systems cannot handle efficiently. Hadoop becomes indispensable when dealing with data at a scale that a single standard computer cannot process quickly or efficiently. This system is effective because it can divide data into smaller pieces and process them simultaneously on many computers.

Imagine we have an enormous book that needs to be read and understood. Instead of assigning it to a single employee, it can be divided into chapters and given to multiple employees—that’s how Hadoop works. Data is processed faster because the task is distributed across many computers (we call this distributed processing).

When Is Hadoop Essential?

Hadoop becomes essential when:

There’s an extremely large volume of data, such as hundreds of millions of records with numerous features that need to be quickly analyzed.
Data comes from various sources and needs to be collected and processed in one place. Wood receipt, for instance, leaves traces in multiple control and oversight systems.
High data processing speeds are required, meaning results are needed quickly despite large data volumes. In my example, I couldn’t wait days for the computer to process transactions; I needed to act fast to detect fraud. Within a few days, defective raw materials could already have been processed into chipboard.

How Does Hadoop Work Technically?

Data Division
Hadoop divides huge datasets into smaller parts, called blocks. Each data block is stored on a different computer in the network.

Data Processing
The module responsible for aggregating processed data in Hadoop is called “Reduce” (from “MapReduce”). It collects all the results from individual computers and produces the final analysis result. Processing these blocks involves two stages:

Mapping (Map): Data is processed in small chunks by various computers in the network.
Aggregation (Reduce): After processing, results are combined into a whole.
Through this, Hadoop can process enormous data volumes quickly and efficiently.

The Story of an Indian Banking Analyst

Let me share another story I found online, regarding Hadoop’s implementation at an Indian bank. This story was told by a Hadoop instructor running an online course on this tool.

In 2011, when this man worked at ICICI Bank in India, they encountered a problem with the increasing data volume. The bank wanted to migrate to a Big Data platform because their relational database system struggled to process large amounts of information. Although the database could be scaled by adding storage, performance was limited by the data volume.

Main Issues with Relational Databases:

Performance: When data volume became very large, SQL queries slowed significantly.
Data Partitioning: Though data could be partitioned (e.g., by cities), it still led to uneven loads, delaying query processing.
The relational database structure handled only structured data (data that could be represented in table form) and couldn’t efficiently process unstructured data (like images or audio files).

ICICI Bank decided to switch to a Big Data platform, specifically Hadoop, because traditional systems based on relational databases had scalability issues with massive data volumes. The bank needed higher performance in processing large datasets.

Hadoop enables storage and processing of structured and unstructured data on a large scale.

The instructor also spoke about his experiences with e-commerce companies like Flipkart, where customer data, tracked on websites, mobile apps, and social media, was analyzed to create product recommendations. In such cases, relational databases were insufficient, and NoSQL technologies (like MongoDB or Cassandra) and Big Data systems like Hadoop were essential.

Key Benefits of Hadoop

Hadoop excels in managing enormous data volumes that are challenging to process using traditional methods. Additionally, it can work with unstructured data. Different types of data, such as photos, videos, and audio files, can be stored and processed. The system allows for easy scalability as data grows. This last term, “scalability,” requires further explanation.

What is Scalability?

In Hadoop, scalability refers to the system’s ability to expand its computational and storage resources to handle growing data and load. Key scalability features in Hadoop include:

Horizontal Scaling (Scaling Out)
Hadoop is designed to scale out, meaning more computers (nodes) can be added to the cluster, rather than upgrading a single server. Each new computer adds more computational power and data storage, enabling larger data processing.

Parallel Data and Computation Scaling

Hadoop splits data into smaller fragments (blocks) and processes them simultaneously across many cluster nodes. Each node (computer) performs computations on its assigned data portion, speeding up the entire process. Adding new nodes allows for more data processing without sacrificing performance.

Automatic Resource Management

Hadoop, especially with its YARN component (Yet Another Resource Negotiator), manages cluster resources by balancing load across nodes and optimizing memory and CPU usage. This allows the system to dynamically adapt to changing computational needs.

Cost-Effective ScalingScaling traditional databases vertically (adding more power to one server) is costly and limited. In contrast, Hadoop allows scaling using cheaper, standard servers, making it more cost-effective.

Why Is Scalability Important?

As organizations generate more data, increasing system capacity without losing performance is crucial. Hadoop ensures efficient management of massive amounts of complex data without requiring a complete infrastructure overhaul.

Wojciech Moszczyński

Wojciech Moszczyński – a graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń; a specialist in econometrics, finance, data science, and management accounting. He specializes in optimizing production and logistics processes and has been involved for years in promoting machine learning and data science in business environments.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński