FAQ1: What is Spark?
Spark is an open-source big data framework. It has an expressive APIs to allow big data professionals to efficiently execute streaming as well as the batch. It provides faster and more general data processing platform engine. It is basically designed for fast computation. It was developed at UC Berkeley in 2009. Spark is an Apache project which is also known as "lighting fast cluster computing ".
It distributes data in file system across the cluster, and process that data in parallel. It covers a wide range of workloads like batch applications, iterative algorithms, interactive queries and streaming. It lets you write an application in Java, Python or Scala.
It was developed to overcome the limitations of MapReduce cluster computing paradigm. Spark keeps things in memory whereas map reduce keep shuffling things in and out of disk. It allows to cache data in memory which is beneficial in iterative algorithm those used in machine learning.
Spark is easier to develop as it knows how to operate on data. It supports SQL queries, streaming data as well as graph data processing. Spark doesn’t need Hadoop to run, it can run on its own using other storages like Cassandra, S3 from which spark can read and write. In terms of speed spark run programs up to 100x faster in memory or 10x faster on disk than Map Reduce.
FAQ2:Why Apache Spark?
Basically, we had so many general purposes cluster computing tools. For example, Hadoop MapReduce, Apache Storm, Apache Impala, Apache Storm, Apache Giraph and many more. But each one has some limitations in their functionality as well. Such as:
1. Hadoop MapReduce can only allow for batch processing.
2. If we talk about stream processing only Apache Storm / S4 can perform.
3. Again, for interactive processing, we need Apache Impala / Apache Tez.
4. While we need to perform graph processing, we opt for Neo4j / Apache Giraph.
Therefore, no single engine can perform all the tasks together. hence there was a big demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode
Also, which can respond to sub-second and perform in-memory processing
In this way, Apache Spark comes in picture. It is a powerful open-source engine that offers interactive processing, real-time stream processing, graph processing, in-memory processing as well as batch processing. Even with very fast speed, ease of use and also standard interface at the same time.
Spark Core contains the basic functionality of , including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines , which are Spark’s main programming abstraction. It also provides many APIs for building and manipulating these RDDS.
Spark SQL: Spark SQL provides an interface to work with structured data. It allows querying in SQL as well as Apache Hive variant of SQL(HQL). It supports many sources.
Spark Streaming: It is spark component that enables processing of live streams of data.
MLlib: Spark comes with common machine learning package called MLlib
GraphX: GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.