What is Spark?
Spark is a cluster
computing framework designed to be fast and general purpose.
Spark is designed to
cover a wide range of workloads that previously required separate distributed
systems, including batch applications, interactive algorithms, interactive queries,
and streaming.
Spark is designed to
be highly accessible, offering simple APIs in Scala, Python, Java and SQL and
has rice built-in libraries.
Spark can run on
Hadoop cluster and is capable of accessing diverse data sources including HDFS,
HBase, Cassandra, Mongodb and others.
To learn complete course visit:big data and hadoop course.
Explain key features of Spark.
- Spark allows integration with Hadoop and files included in HDFS.
- Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
- Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
- Spark supports multiple analytics tools that are used for interactive query analysis, real-time analysis and graph processing.
Difference between MapReduce and Spark.
Properties
|
MapReduce
|
Spark
|
Data Storage (Caching)
|
Hard disk
|
In-memory
|
Processing Speeds
|
Good
|
Excellent(upto 100x faster)
|
Interactive jobs performance
|
Average
|
Excellent
|
Hadoop Independency
|
No
|
Yes
|
Machine learning applications
|
Average
|
Excellent
|
Usage
|
Batch Processing
|
Real-time processing
|
Written in
|
Java
|
Scala
|
What is Spark Core?
Spark Core contains
the basic functionality of Spark which include components for task scheduling,
memory management, fault recovery, interacting with storage systems and more.
Spark Core acts as
home to the API that defines resilient distributed datasets (RDDs), which are
Spark’s main programming abstraction.
Spark stack
Cluster Managers in Spark.
Spark
depends on a cluster manager to launch executors and in certain cases, to
launch the driver.
The Spark
framework supports three major types of Cluster Managers:
• Standalone Scheduler: a basic cluster manager to set up a cluster
• Mesos: It is a generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: It is responsible for resource management in Hadoop
• Standalone Scheduler: a basic cluster manager to set up a cluster
• Mesos: It is a generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: It is responsible for resource management in Hadoop
Core Spark Concepts
Every Spark
application consists of a driver program
that lunches various parallel operations on a cluster.
The driver program
contains your applications main function and defines distributed datasets on
the cluster, then applies operations to them.
Driver programs
access Spark through a SparkContext
object, which represents a connection to a computing cluster.
To run the
operations, driver programs typically manages the number of nodes called executors.
Spark connect to a cluster
to analyse data in parallel.
What does a
Spark Engine do?
Spark Engine
is responsible for scheduling, distributing and monitoring the data application
across the cluster.
What is SparkContext?
A SparkContext
represents the connection to a Spark cluster, and can be used to create RDDs,
accumulators and broadcast variables on that cluster.
What is RDD (Resilient distributed dataset)?
An RDD in Spark is
an immutable distributed collection of objects. Each RDD is split into multiple
partitions, which may be computed on different nodes of the cluster.
How to create RDDs?
Spark provides two
ways to create RDDs:
·
Loading an external dataset
·
Parallelizing a collection in a driver program
One way to create
RDDs is to load data from external storage.
Eg. val lines= sc.textfile(“/path/to/README.md”)
Another way to
create RDDs is to take an existing collection in a program and pass it to
SparkContext’s parallelize() method.
Eg.
val lines= sc.parallelize(List(“Spark”, ”It is very fast”))
What are RDD operations?
RDDs support two
types of operations:
·
Transformations-
Transformations construct a new RDD from previous one.
·
Actions- Actions
compute a result based on an RDD.
What are Transformations operators?
Transformations are
operations on RDDs that returns a new RDD. Transformations on RDDs are lazily evaluated,
which means that Spark will not begin to execute until it sees an action. map() and filter() are examples
of transformations, where the former applies the function passed to it on each
element of RDD and results into another RDD.
What are Actions operators?
Actions are the
operators that return a final value to the driver program or write data to an
external storage system. Actions force the evaluation of the transformations
required for the RDD they were called on, since they need to actually produce
output.
reduce() is
an action that implements the function passed again and again until one value if
left. take() action takes all the values from RDD to local node.
What is RDD Lineage graph?
Spark keeps
track of the set of dependencies between different RDDs, called lineage graph. Spark does not support data
replication in the memory and thus, if any data is lost, it is rebuild using
RDD lineage. RDD lineage is a process that reconstructs lost data partitions.
The best is that RDD always remembers how to build from other datasets.
Define Partitions.
As the name suggests, partition is a smaller and logical division of data
similar to ‘split’ in MapReduce. Partitioning is the process to derive logical
units of data to speed up the processing process. Everything in Spark is a
partitioned RDD.
Spark’s partitioning is available on all RDDs of key/value pairs, and
causes the system to group elements based on a function of each key.
Eg. rdd.partitionBy(100)
What is Spark Driver?
“Spark
Driver” is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. In simple terms, driver in Spark
creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
What is Spark Executor?
When
SparkContext connect to a cluster manager, it acquires an Executor on nodes in
the cluster. Executors are Spark processes that run computations and store the
data on the worker node. The final tasks by SparkContext are transferred to
executors for their execution.
What is worker node?
Worker node
refers to any node that can run the application code in a cluster.
What is Hive on Spark?
Hive on Spark provides Hive with the ability to utilize Apache Spark as
its execution engine.
set
hive.execution.engine=spark;
The
main task around implementing the Spark execution engine for Hive lies in query
planning, where Hive operator plans from the semantic analyser which is
translated to a task plan that Spark can execute. It also includes query
execution, where the generated Spark plan gets actually executed in the Spark
cluster.
What are Spark’s Ecosystems?
• Spark SQL for
working with structured data
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.
What is Spark SQL?
Spark SQL is Spark’s
package for working with structured data. It allows querying of data via SQL as
well as the Hive variant of SQL which is called as Hive Query Language(HQL). It
supports many sources of data which includes Hive tables, Parquet and JSON.
What is Spark Streaming?
Spark Streaming is a
Spark component that enables processing of live streams of data. Data streams
includes log files generated by production web servers, or queues of messages
containing status updates posted by users of a web service.
What is MLlib?
Spark comes with a
library containing common machine learning (ML) functionality, called MLlib.
MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering and collaborative filtering, as well as
supporting functionality such as model evaluation and data import.
What is GraphX?
GraphX is a library for manipulating graphs
(e.g. a social networks friend graph) and performing graph-parallel
computations. GraphX also provides various operators for manipulating graphs (e.g.
subgraph and mapVertices) and a library of common graph algorithms (e.g.
PageRank and triangle counting)
What is PageRank?
PageRank is
an iterative algorithm that can be used to rank web pages and can perform many
joins. It is a unique feature and algorithm in graph, PageRank is the measure
of each vertex in the graph. For instance, an edge from u to v represents
endorsement of v’s importance by u. In simple terms, if a user at Instagram is
followed massively, it will rank high on that platform.
What is Yarn?
Similar to
Hadoop, Yarn is one of the key features in Spark, providing a central and
resource management platform to deliver scalable operations across the cluster.
Running Spark on Yarn necessitates a binary distribution of Spar as built on
Yarn support.
What are the common mistakes
developers make when running Spark applications?
Developers
often make the mistake of-
·
Hitting the web service several times by using multiple
clusters.
·
Run everything on the local node instead of distributing it.
·
Developers need
to be careful with this, as Spark makes use of memory for processing.
What is the difference between
persist() and cache()?
persist() allows the user to specify the storage level whereas cache() uses the default storage level.
What is a Parquet file?
Parquet is a
columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet file and consider it be
one of the best big data analytics format so far.
What is the advantage of a
Parquet file?
Parquet file
is a columnar format file that helps –
·
Limit I/O operations
·
Consumes less space
·
Fetches only required columns.
What are the various data
sources available in SparkSQL?
·
Parquet file
·
JSON Datasets
·
Hive tables
How Spark uses Hadoop?
Spark has its own cluster management
computation and mainly uses Hadoop for storage.
How can you trigger automatic
clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter
‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches
and writing the intermediary results to the disk.
What are the benefits of using
Spark with Apache Mesos?
It renders
scalable partitioning among various Spark instances and dynamic partitioning
between Spark and other big data frameworks.
When running Spark
applications, is it necessary to install Spark on all the nodes of YARN
cluster?
Spark need
not be installed when running a job under YARN or Mesos because Spark can
execute on top of YARN or Mesos clusters without affecting any change to the
cluster.
Which spark library allows
reliable file sharing at memory speed across different cluster frameworks?
Tachyon
What do you understand by Pair
RDD?
Special
operations can be performed on RDDs in Spark using key/value pairs and such
RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in
parallel. They have a reduceByKey () method that collects data based on each
key and a join () method that combines different RDDs together, based on the
elements having the same key.
How can you remove the elements
with a key present in any other RDD?
Use the
subtractByKey () function
What are the various levels of
persistence in Apache Spark?
Apache Spark
automatically persists the intermediary data from various shuffle operations,
however it is often suggested that users call persist () method on the RDD in
case they plan to reuse it. Spark has various persistence levels to store the
RDDs on disk or in memory or as a combination of both with different
replication levels.
The various
storage/persistence levels in Spark are -
· MEMORY_ONLY
· MEMORY_ONLY_SER
· MEMORY_AND_DISK
· MEMORY_AND_DISK_SER,
DISK_ONLY
· OFF_HEAP
How Spark handles monitoring
and logging in Standalone mode?
Spark has a
web based user interface for monitoring the cluster in standalone mode that
shows the cluster and job statistics. The log output for each job is written to
the work directory of the slave nodes.
Does Apache Spark provide check
pointing?
Lineage
graphs are always useful to recover RDDs from a failure but this is generally
time consuming if the RDDs have long lineage chains. Spark has an API for check
pointing i.e. a REPLICATE flag to persist. However, the decision on which data to
checkpoint - is decided by the user. Checkpoints are useful when the lineage
graphs are long and have wide dependencies.
How can you launch Spark jobs
inside Hadoop MapReduce?
Using SIMR
(Spark in MapReduce) users can run any spark job inside MapReduce without
requiring any admin rights.
How Spark uses Akka?
Spark uses
Akka basically for scheduling. All the workers request for a task to master
after registering. The master just assigns the task. Here Spark uses Akka for
messaging between the workers and masters.
How can you achieve high
availability in Spark?
·
Implementing single node recovery with local file system
·
Using StandBy Masters with Apache ZooKeeper.
Hadoop uses replication to
achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is
based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always
has the information on how to build from other datasets. If any partition of a
RDD is lost due to failure, lineage helps build only that particular lost
partition.
What do you understand by
SchemaRDD?
An RDD that consists of row objects
(wrappers around basic string or integer arrays) with schema information about
the type of data in each column
What are
Shared Variables?
When
a function passed to a Spark operation (such as
map
or reduce
) is executed on a remote cluster node, it works
on separate copies of all the variables used in the function. These variables
are copied to each machine, and no updates to the variables on the remote
machine are propagated back to the driver program. Supporting general,
read-write shared variables across tasks would be inefficient. However, Spark
does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
What is an
“Accumulator”?
“Accumulators”
provides a simple syntax for aggregating values from worker nodes back to the
driver program. Accumulators are Spark’s offline debuggers. Similar to “Hadoop
Counters”, Accumulators provide the number of “events” in a program.
Accumulators
are the variables that can be added through associative operations. Spark
natively supports accumulators of numeric value types and standard mutable
collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.
What are “Broadcast variables”?
“Broadcast
variables” allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for
example, to give every node a copy of a large input dataset in an efficient
manner. Spark also attempts to distribute broadcast variables using efficient
broadcast algorithms to reduce communication cost.
What is sbt (simple build tool) in Spark?
sbt is a
newer build tool most often used for Scala projects. sbt assumes a similar
project layout to Maven. sbt build files are written in a configuration
language where we assign values to specific keys in order to define the build
for our project.
To more Big Data Interview questions visit,big data online course Blog
No comments:
Post a Comment