Apache Hadoop: Celebrating 10 Years
Introduction
As we dive headfirst into the digital age, we all stop once in a while to wonder what are the miracles which let us compete the way we do. How are we managing the exponentially increasing data without compromising on efficiency? Of late, a new name has been making rounds in the Big Data world, and it is here to stay. The idea of Hadoop originated as a solution to a problem. The problem of the ever increasing mound of data, how to handle it and what to do with it. So, the basic idea for the solution came to Google. They published a research paper on File Systems. This lead them to publish a paper on MapReduce. Which is the algorithm that is the heart and soul of Hadoop. Thus, the revolution began.
The major question at this point due to the burgeoning data was – How do we sort petabytes of data efficiently? Consequently, the solution seemed quite simple – Distributed Processing. This was when Hadoop born. It is an open source project of the Apache Foundation written in Java and used for Big Data. A framework allows the Distributed processing of large data sets across clusters of commodity computers using a simple programming model.
Did You Know? Doug Cutting, who gave the name Hadoop, he named it after his son’s toy elephant.
Evolution of Hadoop
In 2006 and 2007 they developed Core Hadoop. Which was shortly followed by integrated HBase, Zookeeper, and Mahout in 2008. HBase is a non-relational distributed database system written in Java. It is the replacement of classical SQL base database system. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Apache Mahout provides a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop using the MapReduce paradigm.
Hadoop became more flexible after the introduction of Pig and Hive in 2009. Apache Pig is a platform for analyzing large data sets. They consist of a high-level language, expressing data analysis programs coupled with infrastructure for evaluating the program. Hive is responsible for providing data summarization, query, and analysis.
In 2010 Flume, Avro, Whirr, and Sqoop were in implementation process along with existing components. This enabled implementation of Hadoop in more fields of industry. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Whereas, Avro works to get Rich data structure. The use of Whirr services to run over the cloud has started. Moreover, Sqoop came to transfer data between Hadoop and relational databases or mainframes.
At present many tools and techniques are in association with Hadoop like Hcatalog, Mrunit, Bigtop, Oozie, etc., which increased the power of it and widened the implementation area.
Additions
The significant additions to the Hadoop framework came with the release of Hadoop 2.0 in 2013. The major difference between Hadoop 1.0 and 2.0 is the computation platform they use. Hadoop 1.0 uses MRv1 whereas Hadoop 2.0 uses YARN (MRv2).
MRv1: Master to JobTracker, Slave to TaskTracker
YARN: Master to Resource Manager, Slave to Node Manager and application specific application master
The transformation from 1.0 to 2.0 was a complete improvement from the architectural point of view. The JobTracker functionality in 1.0 splits into two components:
1. application specific application master
2. Global resource manager.
YARN introduced a concept called ‘container’. The container is a bunch of resources such as ‘x amount of memory, y number of cores’. The Resource Manager allocates these ‘containers’ for different tasks. The tasks are actually launched by the Application Master in the allocated containers. As a result, there are no more map/reduce dedicated slots on each task tracker. However, there is a ‘container’ allocated for each task in an application. Apache Hadoop 2.0 scales better and is more general than Hadoop 1.0 version. The comparison between them is shown in Figure 2. The improved changes at the HDFS level in Hadoop 2.0 is also there.
Key Fact: How big is big data!
A study has shown that the digital universe will keep on evolving. It will grow from 3.2 zettabytes today to 40 zettabytes in a period of only six years. (One zettabyte is around a billion terabytes). For example, according to Stephen Brobst, the CTO of Teradata, Boeing jet generates 10 terabytes of data per engine in every 30 minutes of flight. So for only six-hour, cross-country flight from Los Angeles to New York (the most frequently used plane) the total amount of data generated would be an enormous, 240 terabytes of data!
Conclusion
The Hadoop story is far from over and is developing every day. A revolution is taking the world by storm, and it will continue to do so in the near future. Additions are made to it, and it is constantly evolving, every day becoming the most coveted tool of the 21st century.
Accordingly, how far do you think the story of Hadoop will go?