To analyze big data, we need to bring the data into Hadoop Distributed File System (HDFS). This data can be from application logs, sensors and machine data, geo-location data, and social media. Apache Flume is the component responsible for collecting and moving large amounts of data to HDFS.
What is Apache Flume?
It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving high volume streaming data into the HDFS. Primarily, the Flume works as a logging system that gathers logs files from different machines and stores them in HDFS storage. It is robust with built-in reliability, failover and recovery mechanisms.
Did You Know – The word Flume means a channel generally used to transport goods between a source and a destination. It allows you to transfer data reliably from the source to Hadoop storage.
Why do we need a separate component to transfer the data?
Let’s explore other available options for transferring data to HDFS.
- Using PUT command, we can transfer only one file at a time while the data generators generate data at a much higher rate. Since the analysis made on older data is less accurate, we need to have a solution to transfer data in real time.
- It is not possible to use PUT command as web servers generate data continuously.
- Building a fault tolerant system ensures no loss of data during transit.
- When the rate of incoming data is more than the rate at which data is written to the destination, Flume provides a steady flow of data between them.
Components of Apache Flume
- Event – Flume transports a single unit of data, called an event. An event has a header and byte payload.
- Client – It is an entity that generates events and sends them to one or more agents. E.g., Web server
- Agent – It is a container having Source, Channel, Sink and other components that transport events from one place to another. It is an independent daemon running in JVM.
- Source – It receives events from client and transfers data to one or more Channels. There are different types of sources like Syslog, Netcat, Avro, and Twitter Source
- Channel – A channel acts as a buffer for incoming events until they transport to the sink. They are fully transactional and work with any number of sources and sinks. E.g., file system channel, memory channel
- Sink – Sink receives events from a Channel and transmits them to data storage units like HDFS and HBASE. Different types of Sinks: Terminal Sinks that deposit events to their final destination. It requires at least one channel for the sink to function. E.g., Flume Avro sink is a regular sink that can transmit messages to other agents that are running an Avro source.
Configuring Flume Agent
A flume configuration file consists of
- Name of sources, channel, sinks – There are many inbuilt components to use common applications.
- Configure the Source – Describe the source properties like type, key, tokens (if any).
- Configure the Sink – Configure sink similarly to the source. Add the properties specific to the sink used in the agent.
- Configure the Channel – Choose the correct channel which acts as a buffer. The channel memory has to be decided keeping in mind any worst case downtimes. This ensure no loss of data.
- Start the Flume agent – Add the code to start flume agent in a shell script to save yourself from typing few lines.
Reliability and fault tolerance
A flume agent source uses transactions to write data to the channel. A transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability (ACID properties) to ensure data integrity. Similarly, Sink uses transactions to remove the data from the channel. These transactions commit only after the complete data is transferred from the source to the destination. That ensures no data loss in the process.
Flume is used by internet companies like Goibibo to transfer logs from production logs to HDFS, Mozilla for buildBot projects, Asana building team collaboration software. We can also use Flume with Kafka, Apache Solr to build systems capable of capturing and real-time analytics of data.
Where else do you think Apache Flume is used?