A Hadoop cluster is a reservoir of heterogeneous data, both structured and unstructured, coming from a variety of sources. Apache Hive is a data warehouse tool that can easily crunch petabytes of data and works well for interactive SQL queries. Industry giants widely use it. It generates actionable business insights like determining customer churn, price changes in e-commerce, buying patterns in retail industry, etc.
What is Apache Hive?
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It gives an SQL-like interface to query data which is stored in various databases and file systems that integrate with Hadoop. It also provides an SQL-like (Hive Query Language or HiveQL) interface for users to extract data out of Hadoop system. Anyone having decent working knowledge of SQL will be able to use it quickly. It is not an OLTP (Online Transaction Processing) tool. Hive does not provide a row-wise update and insert, which is the biggest disadvantage of using it. It is, however, closer to OLAP (Online Analytical Processing).
How does it work?
Hive turns queries into Map Reduce jobs, which run on the Hadoop cluster. Hive queries converted into MapReduce leads to start up overhead. Therefore, they have higher latency. Due to this, queries would take more time even on a smaller set of data. Hive uses “Schema on Read” unlike a traditional database, which uses “Schema on Write”. A table’s schema enforced at data load time is called Schema on Write. If the data being loaded does not conform to the schema, then it gets rejected. This leads to, slowing down of the mechanism the loading process of the dataset. Whereas, Schema on Read does not verify the data when it is loaded, but rather when a new query is issued.
Did You Know? Apache Hive uses Apache Derby as a meta store. This improves performance and MySQL can replace it.
Understanding Hive Architecture
Hive UI – Hive User Interface submits the queries. The Driver then executes these queries. Queries can also be submitted from Command Line (Hive interactive shell mode).
Thrift server – It connects the user to the Hive Driver. It comes into play when the user uses JDBC/ODBC programs.
Driver – The driver is responsible for creating a session and provides APIs for executing and fetching data.
Compiler – The compiler generates the execution plan for the query. Compiler talks to the metastore to collect the metadata for the query. The compiler generates a directed acyclic graph of stages with each stage being a MapReduce job, an operation on HDFS.
Executor – It executes the execution plan created by the compiler. It also manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components (Job Tracker, NameNode).
Metastore – It stores references to tables stored in HDFS. It also helps in reducing the time to perform the semantic checks.
Facebook started Apache Hive as a project in 2008 with two goals:-
- SQL-based declarative language that allowed engineers to plug in their own scripts and programs when SQL did not suffice.
- It also stores centralised meta-data about all the (Hadoop-based) data sets of an organisation. This was indispensable for creating a data-driven organisation.
Hive community is rolling out new versions. Hive 2.1 version released recently. It is performing well at handling petabytes of data with speed, security and SQL support which makes it a must-learn tool for every aspiring data scientist.
Apache Hive has performance issues when working with small datasets. Is this always true?