Hadoop is an open-source (i.e. free) framework for storing data and running applications on clusters with commodity hardware. It is a cheap and scalable way of storing data; thus, adopted by many companies which are facing problems relating to big data. So, here’s a list of 10 biggies that use Hadoop:
In 2008, Adobe started using Hadoop to solve its big data problems such as weblog processing, flash analytics, image analytics, and recommendations for its media player users. It currently has about 30 nodes running Hadoop and 5 to 14 node clusters running Apache HBase for both production and development purposes. It runs MapReduce jobs to process data from Apache HBase and then stores it back to Apache HBase or external systems.
The Chinese e-commerce giant uses Hadoop to recommend products to its customers, analyze cash flows, optimize stock, analyze credit risk, etc. It uses a 15-node cluster for large-scale data processing. Each of these nodes is octa-core and has 16GB RAM with 1.4TB storage. (Now that’s what we call Big Data!) Alibaba also has a cloud platform called Apsara for storage and data processing. A single Apsara cluster can have up to 5,000 servers with a 100 petabytes storage capacity and 100,000 CPU cores.
eBay, another very popular e-commerce company, uses Hadoop to analyze clickstream data, to recommend products to its users, and to improve customer experience. It has over 4000 nodes in their clusters with each of the node having octa-core processors with a total storage capacity of over 50 petabytes. e-Bay uses MapReduce, Apache Pig, Apache Hive, and Apache HBase. It also uses Hadoop to index multiple web pages, with a low latency rate, on Cassini (i.e. eBay’s search engine). Additionally, Cassini uses Apache HBase for random access of their product’s information.
The social media platform has over 1.65 billion monthly active users worldwide. Surely, the amount of data generated by Facebook users has compelled the world’s biggest social networking website to switch to Hadoop. So, it started using Apache Hadoop to store internal logs and use it as a source for analytics. It currently has 2 major clusters, one with 1100 nodes and the other with 300 nodes. Each of these nodes has 8 cores and 12 TB of storage capacity. Apache Hive was built by Facebook which has an SQL-like query language called HiveQL that allows its non-Java programmers and developers to use Hadoop.
The world’s largest professional networking website helps its users to connect with their colleagues and potential recruiters. LinkedIn provides recommendations to users about the people they may know and suggest them jobs based on their profile and skill sets too.
It has 3 major clusters having 800 nodes (each node has 8 cores, 24GB RAM, 12 TB storage), 1900 nodes (each node has 12 cores, 24GB RAM, 12 TB storage) and 1400 nodes (each node has 12 cores, 32GB RAM, 12 TB storage). It uses Apache Pig, Apache Hive, Apache Avro and Apache Kafka.
6. The New York Times
The American newspaper has an online digital platform, wherein it provides PDF versions of its articles. Due to its requirement to generate millions of PDFs per day, it started using Hadoop. It uses Amazon S3 for storing the articles and Amazon EC2 to run Hadoop on large virtual clusters. It was able to produce PDFs for 11 million articles in just less than 24 hours using 100 Amazon EC2 instances and storing it in Amazon S3.
By providing and managing cloud services to companies, Rackspace eliminates the need for several businesses to hire cloud computing professionals. This helps the companies to save 1000s of dollars as their CAPEX and OPEX is reduced dramatically! It not only provides businesses with storage space but also allows them to build applications on its infrastructure.
This Texas based cloud provider has a 30 node cluster with each node having 8GB RAM, 1.5TB storage, and a dual-core processor. It has also started to host emails that generate hundreds of gigabytes of log data every day. Rackspace uses Hadoop to parse and index these logs to fasten search results.
The popular online music streaming company uses Apache Hadoop for content generation, statistical analysis of log data and recommendation of music to millions of song lovers every day. It uses Apache Crunch and MapReduce for data processing.
It has a 1650 node cluster which has a total of 43,000 virtualized cores, with a cumulative RAM of 70 TB and 65 PB of storage space.
The online social media behemoth uses Apache Hadoop to store and process tweets, log files, and other forms of data that are generated every day by its 310 million users worldwide.
It uses Scala and Java to write MapReduce codes and is also a frequent user of Apache Pig for both scheduled and ad-hoc queries. It has thousands of nodes across many clusters that run on Hadoop.
Yahoo is the world’s biggest user and contributor of Apache Hadoop. Dough Cutting and Mike Cafarella created Hadoop while the two were working at Yahoo. They used the distributed computing framework of their project (i.e. Nutch) and incorporated it into Yahoo’s web crawler for superior indexing and searching of webpages. So, Yahoo lacked employees who had prior programming knowledge of Java. Thus, it developed Apache Pig to make it simpler for such employees to use Hadoop.
Today, Yahoo has over 40,000 nodes spread across 4 data centers running Hadoop with the processing power of 100,000 CPUs. Its biggest cluster has 4500 nodes, with each node having octa-core processors with 4 TB storage capacity & 16 GB RAM. It uses Hadoop to manage the metrics for ranking webpages on its search engine. Apart from this, it also uses Hadoop for content optimization, email spam filtering, and its ad systems.
Hence, from social networking sites to e-commerce platforms, all use Hadoop to deal with Big data. In conclusion, Hadoop finds use cases across many sectors. With Big Data being on the rise, the number of companies using Hadoop will indeed grow!