
MOSZCZYNSKI 11-24
Hadoop and Spark: Key Tools in Big Data Analysis
Hadoop and Spark are two essential tools in the world of big data analysis. Each has its unique history and different goals that impact how it is used.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005 as part of the Apache project. The inspiration behind Hadoop was to create an open-source system capable of handling massive amounts of data, based on Google’s work with its file system (Google File System) and MapReduce—a data processing model. Hadoop quickly became a standard in big data processing thanks to its HDFS (Hadoop Distributed File System) and MapReduce model.
Apache Spark, on the other hand, was created by Matei Zaharia in 2009 at the University of California, Berkeley, as a response to some of Hadoop’s limitations, particularly performance-related issues. Zaharia and his team wanted to create a tool that could process data in memory rather than just on disk, which would significantly speed up analyses. Reading data from a hard disk, especially at that time, was very slow. The natural path was to load large amounts of data into the system’s memory. Initially, Spark was developed as a research project at AMPLab and later became part of the Apache ecosystem.
What Are Hadoop and Spark Used For?
Hadoop is a distributed system for storing and processing massive datasets. Its key components are HDFS—a distributed file system—and MapReduce, a programming model that divides computing tasks into smaller parts, processes them in parallel across different nodes (computers), and then combines the results. Hadoop is ideal for processing data stored on disk, making it more efficient for batch processing than for real-time processing. In other words, Hadoop works well with files, while Spark is more efficient in processing real-time data streams. In the near future, stream processing will play an increasingly significant role in analytical processes.
Spark, a tool for in-memory data processing, is faster than Hadoop, especially for analyses that require iterative data operations. It supports both batch and streaming processing, making it more versatile. Spark also has advanced libraries for machine learning (MLlib), graph processing (GraphX), and SQL processing (SparkSQL), making it a more comprehensive tool for data analysis.
Why Not Use Pandas for Big Data Analysis?
Pandas is a popular open-source Python library used for data analysis and manipulation. It allows easy operations on data structures like DataFrames (spreadsheet-like tables) and Series (one-dimensional data vectors). Pandas is especially useful for data cleaning, transformation, filtering, and advanced analytical operations. Due to its simple command language, Pandas is widely used in data science, data analysis, and machine learning. It has become a standard tool among analysts and data scientists for its efficiency and versatility with small to medium-sized datasets.
However, Pandas is not designed for processing massive amounts of data, which is common in the context of big data. It operates on in-memory data, meaning that any dataset exceeding the available RAM capacity leads to performance issues. In practice, this means interrupting often hours-long computations and ending the process without visible results. Therefore, Pandas is not recommended for large datasets, which can contain billions of records. When using Pandas for working with very large datasets, all data must be loaded into the operating memory (RAM), which quickly exhausts available memory, especially on machines with limited memory. When trying to process larger datasets, memory problems arise, resulting in slow system performance or, in the worst case, ending the application. Everything would have to be restarted. For this reason, Pandas is mainly used for analyzing small to medium-sized datasets.
Relational Databases in Big Data Analysis
Relational databases, such as PostgreSQL, can be effective in big data analysis as they allow data to be divided into smaller parts that can be processed without risking memory exhaustion. However, using relational databases in the context of big data is time-consuming and inefficient. SQL queries are labor-intensive, and data modeling for working with massive datasets is often complex. Moreover, in many cases, especially when data is already organized in relational databases, Hadoop and Spark may not be necessary. However, if one deals with unstructured data or dispersed data resources, Spark and Hadoop are more efficient as they do not require creating large relational databases.
Relational databases (RDBMS) are database management systems where data is stored in tables organized according to relationships between them. Each table consists of rows (records) and columns (attributes), and data is joined based on keys that ensure the integrity and consistency of the information.
Imagine a data analyst receives a massive dataset for analysis. Instead of processing all the data at once, they decide to retrieve fragments using SQL and organize them in a relational database. This process allows easier manipulation and analysis of smaller, selectively retrieved portions of data. However, to achieve this, the analyst had to build a complex infrastructure that allows efficient management of such a database.
What is a Relational Database?
A relational database is a system for storing data, where information is organized in tables linked together by keys. These keys can be primary keys that uniquely identify each record in a table or foreign keys that create relationships between tables by linking records from one table to corresponding records in another. This system allows data to be processed and linked precisely and efficiently.
Cost of Building a Relational Database Architecture
Building an efficient relational database requires significant work and planning. The structure of the data must first be planned, typically involving creating ERD (Entity-Relationship Diagrams) to illustrate the relationships between different data elements. The next step is to identify keys (both primary and foreign) to ensure data integrity and consistency. Keys are essential because they allow the creation of logical connections between data, enabling efficient SQL queries and quick access to information.
This process is labor-intensive and complex, especially with large numbers of variables. Its labor intensity is not related to the database size but to the number of variables within it. Redundancy, query optimization, and potential future data structure changes must all be considered. In our example, the analyst had to build a solid database infrastructure, which required considerable work and resources.
Distributed Data Storage in Hadoop
Hadoop stores data across multiple nodes using the HDFS system, which automatically replicates data to ensure resilience to failures. Hadoop creates special directories and files, invisible to users, allowing large amounts of data to be stored efficiently and securely. This enables both structured and unstructured data storage, making it suitable for a wide range of applications.
The HDFS storage infrastructure differs from traditional file systems on personal computers. When data is stored in Hadoop, it is automatically divided into smaller parts and stored on different nodes in a cluster using HDFS. HDFS also creates additional copies of data (replications) to protect against data loss in the event of a node failure.
However, this file and directory infrastructure cannot be accessed directly through typical operating system tools, such as Windows File Explorer or terminal commands in Linux systems. To access files, specific Hadoop tools like the hadoop fs commands or an API are needed to interact with HDFS. This file system isolation is one reason why Hadoop is so efficient in managing large datasets since the computer’s operating system doesn’t need to handle and manage large files directly.
Distributed Data Processing in Hadoop
One of Hadoop’s key features is its ability to process data in a distributed manner. When working with very large datasets that don’t fit on a single computer, Hadoop divides the data into smaller pieces and distributes them across multiple machines in a cluster. Part of the computation is performed on these distributed machines, and the results are returned to the computer that initiated the task. This process is accomplished through MapReduce—a computational model that splits the task into two stages: mapping (dividing the task into smaller parts) and reducing (combining results).
Example of Large Data Analysis
Suppose a data analyst receives a massive database containing billions of records, but their computer lacks sufficient memory and processing power to analyze all the data at once. The symptoms of the computer struggling with the data size are quite clear. In such cases, distributed processing is necessary. Hadoop, with its HDFS system and MapReduce model, solves this problem by dividing data and computations across many computers, allowing the analysis of even the largest datasets without overloading a single machine.
Spark and Its Similarity to Pandas
Spark has its command system and operates similarly to Pandas for data manipulation, except that Spark can process data in a distributed manner and in memory. Spark’s DataFrame resembles a Pandas DataFrame but can work on large, distributed datasets. Spark can scale across large clusters and avoids the memory exhaustion issues characteristic of Pandas. Spark is designed for big data analysis, whereas Pandas is unsuitable due to RAM limitations.
Stream Processing in Spark
Apache Spark excels at stream processing, enabling real-time data analysis. Spark Streaming and its newer version, Structured Streaming, are modules built into Spark that allow continuous processing of incoming data in a distributed manner.
In Spark Streaming, data is processed as mini-batches—small portions of data collected at short intervals (e.g., every few seconds) and then processed similarly to batch processing. In Structured Streaming, the process is more advanced—incoming data is treated as an endless DataFrame or Dataset, and Spark automatically manages the stream’s transition to subsequent operations.