MutltiTechTutors: Spark Interview Questions and Answers

What is Spark?

Spark is a cluster computing framework designed to be fast and general purpose.

Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, interactive algorithms, interactive queries, and streaming.

Spark is designed to be highly accessible, offering simple APIs in Scala, Python, Java and SQL and has rice built-in libraries.

Spark can run on Hadoop cluster and is capable of accessing diverse data sources including HDFS, HBase, Cassandra, Mongodb and others.

To learn complete course visit:big data and hadoop course.

Explain key features of Spark.

Spark allows integration with Hadoop and files included in HDFS.
Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
Spark supports multiple analytics tools that are used for interactive query analysis, real-time analysis and graph processing.

Difference between MapReduce and Spark.

Properties	MapReduce	Spark
Data Storage (Caching)	Hard disk	In-memory
Processing Speeds	Good	Excellent(upto 100x faster)
Interactive jobs performance	Average	Excellent
Hadoop Independency	No	Yes
Machine learning applications	Average	Excellent
Usage	Batch Processing	Real-time processing
Written in	Java	Scala

What is Spark Core?

Spark Core contains the basic functionality of Spark which include components for task scheduling, memory management, fault recovery, interacting with storage systems and more.

Spark Core acts as home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.

Spark stack

Cluster Managers in Spark.

Spark depends on a cluster manager to launch executors and in certain cases, to launch the driver.

The Spark framework supports three major types of Cluster Managers:
• Standalone Scheduler: a basic cluster manager to set up a cluster
• Mesos: It is a generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: It is responsible for resource management in Hadoop

Core Spark Concepts

Every Spark application consists of a driver program that lunches various parallel operations on a cluster.

The driver program contains your applications main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.

To run the operations, driver programs typically manages the number of nodes called executors.

Spark connect to a cluster to analyse data in parallel.

What does a Spark Engine do?

Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

What is SparkContext?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

What is RDD (Resilient distributed dataset)?

An RDD in Spark is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

How to create RDDs?

Spark provides two ways to create RDDs:

· Loading an external dataset

· Parallelizing a collection in a driver program

One way to create RDDs is to load data from external storage.

Eg. val lines= sc.textfile(“/path/to/README.md”)

Another way to create RDDs is to take an existing collection in a program and pass it to SparkContext’s parallelize() method.

Eg. val lines= sc.parallelize(List(“Spark”, ”It is very fast”))

What are RDD operations?

RDDs support two types of operations:

· Transformations- Transformations construct a new RDD from previous one.

· Actions- Actions compute a result based on an RDD.

What are Transformations operators?

Transformations are operations on RDDs that returns a new RDD. Transformations on RDDs are lazily evaluated, which means that Spark will not begin to execute until it sees an action. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD.

What are Actions operators?

Actions are the operators that return a final value to the driver program or write data to an external storage system. Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to local node.

What is RDD Lineage graph?

Spark keeps track of the set of dependencies between different RDDs, called lineage graph. Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Define Partitions.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.

Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system to group elements based on a function of each key.

Eg. rdd.partitionBy(100)

What is Spark Driver?

“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

What is Spark Executor?

When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

What is worker node?

Worker node refers to any node that can run the application code in a cluster.

What is Hive on Spark?

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

set hive.execution.engine=spark;

The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyser which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

What are Spark’s Ecosystems?

• Spark SQL for working with structured data
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.

What is Spark SQL?

Spark SQL is Spark’s package for working with structured data. It allows querying of data via SQL as well as the Hive variant of SQL which is called as Hive Query Language(HQL). It supports many sources of data which includes Hive tables, Parquet and JSON.

What is Spark Streaming?

Spark Streaming is a Spark component that enables processing of live streams of data. Data streams includes log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.

What is MLlib?

Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

What is GraphX?

GraphX is a library for manipulating graphs (e.g. a social networks friend graph) and performing graph-parallel computations. GraphX also provides various operators for manipulating graphs (e.g. subgraph and mapVertices) and a library of common graph algorithms (e.g. PageRank and triangle counting)

What is PageRank?

PageRank is an iterative algorithm that can be used to rank web pages and can perform many joins. It is a unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

What is Yarn?

Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.

What are the common mistakes developers make when running Spark applications?

Developers often make the mistake of-

· Hitting the web service several times by using multiple clusters.

· Run everything on the local node instead of distributing it.

· Developers need to be careful with this, as Spark makes use of memory for processing.

What is the difference between persist() and cache()?

persist() allows the user to specify the storage level whereas cache() uses the default storage level.

What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

· Limit I/O operations

· Consumes less space

· Fetches only required columns.

What are the various data sources available in SparkSQL?

· Parquet file

· JSON Datasets

· Hive tables

How Spark uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

What are the benefits of using Spark with Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Which spark library allows reliable file sharing at memory speed across different cluster frameworks?

Tachyon

What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

How can you remove the elements with a key present in any other RDD?

Use the subtractByKey () function

What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are -

· MEMORY_ONLY

· MEMORY_ONLY_SER

· MEMORY_AND_DISK

· MEMORY_AND_DISK_SER, DISK_ONLY

· OFF_HEAP

How Spark handles monitoring and logging in Standalone mode?

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Does Apache Spark provide check pointing?

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

How can you launch Spark jobs inside Hadoop MapReduce?

Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

How can you achieve high availability in Spark?

· Implementing single node recovery with local file system

· Using StandBy Masters with Apache ZooKeeper.

Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

What do you understand by SchemaRDD?

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column

What are Shared Variables?

When a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

What is an “Accumulator”?

“Accumulators” provides a simple syntax for aggregating values from worker nodes back to the driver program. Accumulators are Spark’s offline debuggers. Similar to “Hadoop Counters”, Accumulators provide the number of “events” in a program.

Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

What are “Broadcast variables”?

“Broadcast variables” allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

What is sbt (simple build tool) in Spark?

sbt is a newer build tool most often used for Scala projects. sbt assumes a similar project layout to Maven. sbt build files are written in a configuration language where we assign values to specific keys in order to define the build for our project.

To more Big Data Interview questions visit,big data online course Blog

MutltiTechTutors

Wednesday, July 29, 2020

Spark Interview Questions and Answers

No comments:

Post a Comment

Popular Posts