FAQ1:
What is Hadoop?
Hadoop
emerged as a solution to the "Big Data" problems. It is a part of the
Apache project sponsored by the Apache Software Foundation (ASF). Hadoop is an
open-source software framework for distributed storage and distributed
processing of extremely large data sets. Open source means it is freely
available and even its source code can be changed as per our requirements.
Hadoop
makes it possible to run applications on the system with thousands of commodity
hardware nodes, and to handle thousands of terabytes of data. It’s distributed
file system has the provision of rapid data transfer rates among nodes and
allows the system to continue operating in case of node failure.
Apache
Hadoop provides storage layer- HDFS, a batch processing engine- MapReduce and a
Resource Management Layer- YARN.
Hadoop
is an open-source software framework used for distributed storage and
distributed processing of datasets of big data across clusters built from
commodity hardware.
Core
of hadoop consists of two main components:
HDFS (Hadoop distributed file system) and
MapReduce.
1)
HDFS - It is the distributed storage part, where
each file is split into 64mb or 128mb blocks and spread across the cluster.
2)
MapReduce- It is the distributed parallel
processing part, where the stored data is processed parallelly across the
cluster.
Here,
the code travels to the nodes where the data is present and gets executed there
itself, which accounts for Data Locality. Hence the processing is fast.
Hadoop
can be scalable from a single node to thousands of nodes in a cluster.
The
framework consists of other projects like,
hive, sqoop, flume, hbase, etc.
FAQ2:
Why Hadoop?
Hadoop
is not only a storage system but is a platform for data storage as well as
processing. It is scalable (more nodes can be added on the fly), Fault
Tolerant(Even if nodes go down, data can be processed by another node) and Open
source (can modify the source code if required).
Following
characteristics of Hadoop make is a unique platform: