As we dive headfirst into the digital age, don’t we all stop once in a while and wonder what all the miracles are that let us compete the way we do? How are we managing the exponentially increasing data without compromising on efficiency? Of late, a new name has been making rounds in the Big Data world, and it is here to stay. Apache Hadoop in its most basic form originated in 2003. Like most great inventions, the idea originated as a solution to a problem that was plaguing the digital world- the ever increasing mound of data, how to handle it and what to do with it.
The basic idea for the solution came to Google. So, they published a research paper on File Systems, and this lead them to publish a paper on MapReduce – the algorithm that is the heart and soul of Apache Hadoop. Thus, the revolution began. The major question at this point, due to the burgeoning data was – How do we sort petabytes of data efficiently? The solution seems quite simple – Distributed Processing. This was when Hadoop was born.
Apache Hadoop Architecture
In the first generation of Apache Hadoop MapReduce component did job scheduling, resource management, and job processing . This lead to limited scalability and resource utilization. Thus, to solve this problem, they introduced the second generation Hadoop. This is the version that is in use today.
Components of the Hadoop ecosystem are
Hadoop Distributed File System (HDFS)
HDFS, as the name suggests, is the File System for managing Apache Hadoop. It provides distributed storage by storing the large mounds of data across several machines. It is highly fault tolerant, and stores data in a very redundant manner, ensuring there is no data loss in case of a failure.
How does HDFS work?
Components of HDFS are primarily the NameNode and the Data Node. HDFS follows a master-slave architecture. So for a single cluster in HDFS, one NameNode serves as the master node. The NameNode is responsible for storing metadata for the file system. It tracks, storage of each block, the replication factor of blocks, the number of backups created, etc. For each machine in the cluster, a Data Node serves as the slave. Consequently, the Data Node is responsible for storing the data blocks, reading from them, writing to them, and processing requests for the data.
To service a read request – The Data Nodes ping the NameNode with a short message informing it that they are free to service requests. If the NameNode doesn’t receive a ping from a Data Node, it assumes that there is a failure. Therefore, when there is a request from the client, it begins sending the requests to the data node’s backup. If a Data Node fails halfway through servicing the request, it ignores the retrieving, and sends the entire data from the backup. The NameNode has a backup as well, a Secondary Name Node. When the client invokes a request to the NameNode, the NameNode provides the address of the Data Nodes that contain the data, to the File System.
Then, the File System approaches the corresponding Data Nodes, and the Data Nodes process the requests in parallel. After the retrieving, all the necessary data is put together and handed over to the client. So, the default block size is 128MB. Servicing of a write request happens along similar lines, using parallel processing.
MapReduce is the brain behind Hadoop and coordinates all the distributed processing. Hadoop is written in Java, and writing MapReduce programs usually involves coding in Java. However, other languages like Python can be used as well. MapReduce has primarily two components; the Job Tracker and the Task Tracker. The Job Tracker services requests between you and the Hadoop ecosystem. It keeps track of the jobs processed, failures, etc., while to put it simply, a Task Tracker is to the Job Tracker what a Data Node is to a Name Node. In the Hadoop ecosystem, the NameNode services are parallel to storage, and the Job Tracker services are parallel to processing.
How does MapReduce work?
The basic working of MapReduce is, to split the job into sub-jobs, and to process the sub-jobs in parallel. Once the processing is complete, the result is combined and outputted.
Another Resource Negotiator (YARN)
The second generation of Hadoop introduced YARN as a Cluster Resource Manager. In the first generation of Hadoop the MapReduce component executed job scheduling, resource management and job processing. This lead to limited scalability and resource utilization. Thus, to solve this problem, YARN emerged in between the HDFS and the MapReduce component as a cluster resource manager. This took care of resource management and job scheduling, leaving MapReduce to focus on job processing and achieving optimal resource utilization. Thus, the resource management functions and the job scheduling functions run on separate domains.
Hadoop Common provides a set of utilities and libraries. All other Hadoop modules use these. It is an essential part of the Hadoop ecosystem.
Since Hadoop is an Open Source platform, many vendors put in efforts to develop their versions of Hadoop by adding functionalities that they believed would be the best solution. In 2008, the first Hadoop distributor Cloudera originated. It still remains the most used distributor. Other vendors soon tried to compete with MapR (founded in 2009) and HortonWorks (founded in 2011) by Yahoo! These three distributors still dominate the Hadoop world. At the heart of it, all three distributors offer the same core element. The Hadoop Distributed File System (HDFS) based on the Google File System (GFS), the Map Reduce Engine, and Hadoop Common are a set of libraries. The distributors have built around this core their implementations to try to give you the best viable product.
The Hadoop story is far from over and is being written every day. A revolution is taking the world by storm, and it continues to do so for the near future. Additions in Hadoop are constantly evolving, becoming the most coveted tool of the 21st century.
Apart from the mentioned above use cases, can you think of any other where Hadoop will be immensely helpful?