Hadoop is one of the most popular sets of big data processing technologies/frameworks in use today. From Adobe and eBay to Facebook and Hulu, Hadoop is used by a wide range of companies for everything from machine learning and natural language processing to data analysis and ad optimization. It is now a common definition of Hadoop that it caters to “the 3 V’s” – volume, velocity, and variety. With use cases in analytics, search, log processing, analysis, and recommendation engines, Hadoop is commonly used in applications that are data-intensive.
The Hadoop Distributed File System (HDFS) is a reliable Hadoop shared storage. It scales horizontally, which means we can add as many nodes/servers without shutting down the system, which with commodity hardware can be very cost effective. It can be used to process both structured as well as unstructured data. It is highly optimized for very large sets of data.
According to Forbes, 53% of companies in one recent study were employing big data analytics, which was “defined as systems that enable end-user access to and analysis of data contained and managed within the Hadoop ecosystem.” For a list of companies that use Hadoop and short descriptions of how they use it, see this Powered By page on the Hadoop Wiki.
Prior to Hadoop 2.0, i.e. Hadoop 1.0, there was a single point of failure (SPOF) in NameNode. By this, we mean if the NameNode fails, then whole Hadoop system/Cluster goes down. Then it can only be recovered manually with the help of a secondary NameNode which results in overall downtime of the Hadoop Cluster, i.e. the Hadoop cluster would not be available unless the Hadoop administrator restarts the NameNode.
The shortcoming in Hadoop 1.0 was overcome in Hadoop 2.0 by providing support for two NameNodes. It introduces the Hadoop 2.0 High Availability feature that brings in an extra NameNode (Passive Standby NameNode) to the Hadoop Architecture, which is configured for automatic failover. Now Hadoop has its built-in fault tolerance. Data is replicated across multiple nodes by means of replication factor, which can be configured at the admin end and if a node (Active NameNode) goes down, the required data can be read from another node (Standby NameNode), which has the copy of that data by courtesy of Journal nodes.
If you are running Hadoop and want to ensure stability and reliability of your cluster, one can overcome it by facilitating Hadoop’s high availability and automatic failover options. The problem this addresses is that without using high availability and automatic failover, any time the NameNode goes down, the entire system goes down. You can think of high availability and automatic failover as your insurance policy against a single point of failure.
As per Hortonworks:
Besides a small downtime using HA, the other important point to be noted is that it is important for an HA cluster that only one of the NameNodes is active at a time. Otherwise, the namespace state would be divided between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario”.
In this case, multiple master nodes think they’re in charge of the cluster, is a “time skew” on the order of perhaps 10 or more seconds between the current master node and the other master. This is however taken care by JournalNodes, they will only allow a single NameNode to be a writer at a time. During a failover, the Standby NameNode which is to become active will play the role of writing to the JournalNodes, and prevent the other NameNode from continuing in the Active state, allowing the new Active NameNode to successfully take the failover
This video tutorial was created for Hadoop administrators or database administrators responsible for administering their company’s Hadoop clusters. It assumes intermediate knowledge of Hadoop and provides a step-by-step walkthrough of how to enable high availability and automatic failover using Zookeeper.
The following changes are made in the files after implementing HA in Hadoop cluster:
Automatic Failover in Hadoop starts automatically in case of NameNode crashdown. Automatic failover adds ZooKeeper quorum and ZKFailoverController Process (ZKFC) components to an HDFS deployment. Work is going on to make Automatic failover a better process, like without starting the HDFS service and to reduce the HA time.
As per Cloudera, work was recently started to allow Hadoop configuration changes without a restart.