FAQ1: What is Hadoop?
Hadoop emerged as a solution to the "Big Data" problems. It is a part of the Apache project sponsored by the Apache Software Foundation (ASF). Hadoop is an open-source software framework for distributed storage and distributed processing of extremely large data sets. Open source means it is freely available and even its source code can be changed as per our requirements.
Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. It’s distributed file system has the provision of rapid data transfer rates among nodes and allows the system to continue operating in case of node failure.
Apache Hadoop provides storage layer- HDFS, a batch processing engine- MapReduce and a Resource Management Layer- YARN.
Hadoop is an open-source software framework used for distributed storage and distributed processing of datasets of big data across clusters built from commodity hardware.
Core of hadoop consists of two main components: HDFS (Hadoop distributed file system) and MapReduce.
1) HDFS - It is the distributed storage part, where each file is split into 64mb or 128mb blocks and spread across the cluster.
2) MapReduce- It is the distributed parallel processing part, where the stored data is processed parallelly across the cluster.
Here, the code travels to the nodes where the data is present and gets executed there itself, which accounts for Data Locality. Hence the processing is fast.
Hadoop can be scalable from a single node to thousands of nodes in a cluster.
The framework consists of other projects like, hive, sqoop, flume, hbase, etc.
FAQ2: Why Hadoop?
Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault Tolerant(Even if nodes go down, data can be processed by another node) and Open source (can modify the source code if required).
Following characteristics of Hadoop make is a unique platform: