Deep dive into Big Data with Hadoop (Part 2): Hadoop Architecture

Pronay Ghosh

Published in

Accredian

6 min readMar 31, 2022

by Pronay Ghosh and Hiren Rupchandani

In the previous article, we covered an overview of Hadoop and why did it come into the real-time picture.
In the next article, we will learn about the architecture of Hadoop and its various key components.
Hadoop, as we all know, is a Java-based framework for managing and storing enormous amounts of data on a large cluster of commodity hardware.
Hadoop is based on the Google-invented MapReduce Programming Algorithm.
Many Massive Brand Companies, such as Facebook, Yahoo, Netflix, eBay, and others, are now embracing Hadoop in their organizations to cope with big data.

The Hadoop Architecture is made up of four main components. They are respectively MapReduce, HDFS, YARN, and Hadoop Common or also known as Common Utilities of Hadoop.

In this article, we will cover a high-level overview of the Hadoop architecture.

1. Mapreduce

MapReduce is nothing more than an Algorithm or a Data Structure built on the YARN framework.
MapReduce’s main characteristic is that it does distributed processing in parallel in a Hadoop cluster.
This is what makes Hadoop so fast.
Serial processing is no longer useful when dealing with Big Data.
MapReduce consists mostly of two jobs, each of which is separated into phases.
A map function is used in the first phase, while reduce function is used in the second phase.
We can see that the Input is passed to the Map() function, which then passes its result to the Reduce function, which ultimately returns our final output.
Let’s have a look at what Map() and Reduce() perform.
Now that we’re utilizing Big Data, we can see that an Input is passed to Map().
The Input is a collection of information.
The Map() method converts DataBlocks into Tuples, which are just key-value pairs.
These key-value pairs are now passed to the Reduce function ().
The Reduce() method then joins these broken Tuples or key-value pairs based on their Key value and creates a set of Tuples, and performs operations such as sorting, summing, and so on, before sending them to the final Output Node.
The output is finally obtained.
Data processing is always done in Reducer, based on the industry’s business requirements.
This is how Map() and Reduce() are used one after the other.

The key components of Mapping:

I. RecordReader :

The record reader's job is to shatter records.
In a Map() method, it is responsible for delivering key-value pairs.
The value is the data connected with it, and the key provides its locational information.

II. Map:

A map is just a user-defined function that processes the Tuples acquired by the record reader.
The Map() method either produces no key-value pairs or produces numerous pairs of these tuples.

III. Combiner:

In the Map process, the Combiner is utilized to organize the data.
It works in the same way as a local reducer.
This combiner is used to combine the intermediate key-value pairs created by the Map.
It is not required to use a combiner because it is an alternative.

IV. Partitioner:

The partitioner is in charge of retrieving key-value pairs created throughout the Mapper Phases.
The shards corresponding to each reducer are generated by the partitioner.
This partition also retrieves the hashcode of each key.
The partitioner then does its modulus using the number of reducers (key.hashcode() percent (number of reducers)).

The key components of Reduce:

I. Shuffle and Sort:

The Reducer job begins with this phase, which is the method by which the Mapper produces intermediate key-value pairs and sends them to the Reducer task.
The system may sort the data by key value using the shuffling technique.
Shuffling begins after some of the Mapping jobs are completed, which is why it is a speedier procedure that does not wait for the Mapper to complete the work.

II. Reduce:

Reduce’s main job or task is to collect the Tuples created by Map and then conduct some sorting and aggregating on those key-value pairs based on their key element.

III. OutputFormat:

Once all of the operations have been completed, the key-value pairs are written into the file using a record writer, with each record starting on a new line and the key and value separated by a space.

2. HDFS(Hadoop Distributed File System)

In a Hadoop cluster, HDFS (Hadoop Distributed File System) is used for storage authorization.
It’s primarily intended for use with commodity hardware (low-cost devices) and a distributed file system design.
HDFS is built in such a way that it prioritizes storing data in huge chunks of blocks over storing little data blocks.
Hadoop’s HDFS provides fault tolerance and high availability to the storage layer as well as the other devices in the cluster.
These devices are known as HDFS Data Storage Nodes or HDFS Data Nodes.

3. YARN(Yet Another Resource Negotiator)

MapReduce is based on the YARN Framework.
Job scheduling and resource management are two tasks that YARN accomplishes.
The goal of job scheduling is to break down a large work into smaller tasks so that each one may be allocated to different slaves in a Hadoop cluster and processing is optimized.
Work Scheduler also maintains track of which jobs are more vital, which jobs have higher priority, job dependencies, and other information such as job time.
Resource Manager is used to managing all of the resources that are made available for a Hadoop cluster to run.

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing more than our java library and java files.
In other words, the java scripts that we require for all of the other Hadoop cluster components.
HDFS, YARN, and MapReduce all need these tools to run the cluster.
Hadoop Common verifies that hardware failure in a Hadoop cluster is common, requiring Hadoop Framework to fix it automatically via software.

Conclusion:

So far in this article, we covered an overview of Hadoop’s architecture.
We learned about the various key components of the MapReduce method.
In the next article, we will learn about the architecture of HDFS and its various key components.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.