MutltiTechTutors: July 2020

Friday, July 31, 2020

Explain Hive concept and Data storage in Hadoop

This article will help you explain what Hive partitioning is, what partitioning requires it to be, how it improves performance. Partitioning is the technique of optimization at Hive which significantly improves efficiency. Apache Hive is the top-of-Hadoop data warehouse that enables ad-hoc analysis over structured and semi-structured data. Let's go into depth about partitioning Apache Hive. So, let's continue the Hive Partitions, Hive Partitioning, Hive Partitioning Styles, etc. article. But, first let's think about Hadoop data storage.

To more information visit:big data online course.

Data storage in Hadoop Distributed file system

In a single Hadoop Distributed File System HIVE, you can find data storage as a resource of choice for performing queries on large datasets. This can be used especially for those needing complete table scans. HIVE features advanced partitioning. The partitioning of hive data files is very useful in reducing query times for prune data during the query. There are some cases where users need to filter the data on specific columns.

• HIVE users can use the HIVE partitioning feature to identify columns that subdivide the data you can use to organize the data.

• Work can only be carried out on a suitable subset of data using partitioning, resulting in a substantially improved performance of HIVE queries.

You'll read more about the partitioning function in the sections below. The diagram below shows data storage in a single Hadoop Distributed File System, or HDFS directory.

data-storage-in-a-single-hadoop-distributed-file-system.jpg

What is Partition in Hive?

Apache Hive renders tables structured into partitions. Partitioning is a way to divide a table into different sections based on the values of common columns such as date, town and section. Every table in the hive may mark a specific partition with one or more partition keys. It's quick to do queries on slices of the data in Hadoop using partition.

Importance of Hive Partitioning in Hadoop

We know that the enormous amount of data that is in the range of petabytes is being stored in HDFS during the current century. So it is very hard for Hadoop users to access this massive volume of data because of this.

You may add the Hive to lower the data querying pressure. Apache Hive transforms the SQL queries into MapReduce jobs and submits them to the Hadoop cluster afterwards. When we're sending a SQL query, Hive will read the entire data collection. Therefore, running MapReduce jobs over a wide table is inefficient. Thus, building partitions in tables solves this. Apache Hive makes this job of implementing partitions very simple by generating partitions at the time of table development using its automatic partitioning scheme.

You can divide all of the table data into multiple partitions using the Partitioning process. Increasing partition corresponds to a certain value(s) of the column(s) of partitions. Inside the table record present in the Hadoop HDFS, you can hold this as a sub-record. Thus, when querying a specific table, the correct table partition is queried which contains the query value. Therefore, this reduces the time needed for the question to I / O. Hence the pace of the output increases.

Create Partitions in Hive

Now let's understand data partitioning in Hive with an illustration. Consider a table which is called Tab1. The table contains information of the client such as I d, name, department and year of accession. Suppose we have to collect the data of all customers who entered in 2012. Then, the question will scan the entire table for the information needed. But if we partition the customer data with the year and store it in a separate register, the processing time for the application will be cut. The example below will help us learn how to partition a file and its data-The name of the file means that file1 contains a table of client dat

Tab1/clientdata /

file1 I d, name, dept, yoj

balajee, SC, 2009

prashanth, HR, 2009

narayana, SC, 2010

Only the data of the specified partition will be queried when we are retrieving the data from the table. It's like building a partitioned table.

Build TABLE table tab1 (id INT, name STRING, dept STRING, yoj INT)

PARTITIONED BY (year STRING);

Types of hive partitions

Until now we have discussed how to construct Hive Partitions. Now we will implement the data partitioning types at Hive. Apache Hive includes two types of Partitioning.

• Static Partitioning

• Dynamic Partitioning

Let's address these types of Hive Partitioning one by one

-Hive Static Partitioning

Insert Static Partitioning data files into a partition table individually.
Typically static partitions are favored when loading directories (big files) into Hive Tables.
Static Partition saves the loading time in contrast to dynamic partition.
You connect a partition in the table "statically" and transfer the file into a table partition.
The partition can be modified in a static partition.

Without reading the whole big file, you can get the partition column value from the filename, date, etc

Set the property set to hive.mapred.mode = strict. This property set to hive-site.xml.
Static partition is by default in Strict Mode, if you want to use the hive Static partition. Set the property set to hive.mapred.mode = strict .This property set to hive-site.xml Static partition is by default in Strict Mode, if you want to use the hive Static partition.
Using where to use the cap clause in a static partition.
You can use the Hive Manage table or external table to perform a Static partition.

Hive Dynamic Partitioning

• Dynamic partitioning is defined as a single insert to partition table.

• Dynamic partition typically loads data from an unpartitioned stack.

• Dynamic partitioning takes longer to load data than static partition loading.

• If large data is stored in a table then the Dynamic partition is sufficient.

• If you want to partition a number of columns but don't know how many columns, then dynamic partition is also sufficient.

• There is no dynamic partition needed where the use of limit clause is necessary.

• On the Dynamic partition, you can not execute a change.

• Dynamic partitioning can be done on external hive table and controlled table.

• If you want to use the hive Dynamic partition then the mode will be in non-strict mode.

Hive Partitioning-Advantages and Disadvantages

Let's address some advantages and weaknesses of Apache Hive Partitioning.

Hive Partitioning Advantages

• Hive Partitioning distributes execution load horizontally.

• Quicker execution of queries with a low data volume occurs in the partition. For instance, Vatican City search population returns very fast instead of searching for entire world population.

Hive Partitioning Drawbacks

• Too many creations of tiny partitions-too many folders-are possible.

• Partition is successful when data is low volume. But there are some queries that take a long time to execute like group based on large data volume. For eg, it will take a long time to group China's population as opposed to a Vatican City population grouping.

• No need to scan the entire table column for a single document.

So, all of that was about Hive Partitions. Hope the article plases you.

Conclusion

Hope this article will help you a lot in learning what is Hive partitioning, what is Hive static partitioning, what is Hive dynamic partitioning. I have discussed various advantages and disadvantages of partitioning Hive. For more information on Hive partitioning and data storage, you can go to big data and adoop Online Training.

Wednesday, July 29, 2020

Spark Interview Questions and Answers

What is Spark?

Spark is a cluster computing framework designed to be fast and general purpose.

Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, interactive algorithms, interactive queries, and streaming.

Spark is designed to be highly accessible, offering simple APIs in Scala, Python, Java and SQL and has rice built-in libraries.

Spark can run on Hadoop cluster and is capable of accessing diverse data sources including HDFS, HBase, Cassandra, Mongodb and others.

To learn complete course visit:big data and hadoop course.

Explain key features of Spark.

Spark allows integration with Hadoop and files included in HDFS.
Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
Spark supports multiple analytics tools that are used for interactive query analysis, real-time analysis and graph processing.

Difference between MapReduce and Spark.

Properties	MapReduce	Spark
Data Storage (Caching)	Hard disk	In-memory
Processing Speeds	Good	Excellent(upto 100x faster)
Interactive jobs performance	Average	Excellent
Hadoop Independency	No	Yes
Machine learning applications	Average	Excellent
Usage	Batch Processing	Real-time processing
Written in	Java	Scala

What is Spark Core?

Spark Core contains the basic functionality of Spark which include components for task scheduling, memory management, fault recovery, interacting with storage systems and more.

Spark Core acts as home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.

Spark stack

Cluster Managers in Spark.

Spark depends on a cluster manager to launch executors and in certain cases, to launch the driver.

The Spark framework supports three major types of Cluster Managers:
• Standalone Scheduler: a basic cluster manager to set up a cluster
• Mesos: It is a generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: It is responsible for resource management in Hadoop

Core Spark Concepts

Every Spark application consists of a driver program that lunches various parallel operations on a cluster.

The driver program contains your applications main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.

To run the operations, driver programs typically manages the number of nodes called executors.

Spark connect to a cluster to analyse data in parallel.

What does a Spark Engine do?

Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

What is SparkContext?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

What is RDD (Resilient distributed dataset)?

An RDD in Spark is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

How to create RDDs?

Spark provides two ways to create RDDs:

· Loading an external dataset

· Parallelizing a collection in a driver program

One way to create RDDs is to load data from external storage.

Eg. val lines= sc.textfile(“/path/to/README.md”)

Another way to create RDDs is to take an existing collection in a program and pass it to SparkContext’s parallelize() method.

Eg. val lines= sc.parallelize(List(“Spark”, ”It is very fast”))

What are RDD operations?

RDDs support two types of operations:

· Transformations- Transformations construct a new RDD from previous one.

· Actions- Actions compute a result based on an RDD.

What are Transformations operators?

Transformations are operations on RDDs that returns a new RDD. Transformations on RDDs are lazily evaluated, which means that Spark will not begin to execute until it sees an action. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD.

What are Actions operators?

Actions are the operators that return a final value to the driver program or write data to an external storage system. Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to local node.

What is RDD Lineage graph?

Spark keeps track of the set of dependencies between different RDDs, called lineage graph. Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Define Partitions.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.

Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system to group elements based on a function of each key.

Eg. rdd.partitionBy(100)

What is Spark Driver?

“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

What is Spark Executor?

When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

What is worker node?

Worker node refers to any node that can run the application code in a cluster.

What is Hive on Spark?

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

set hive.execution.engine=spark;

The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyser which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

What are Spark’s Ecosystems?

• Spark SQL for working with structured data
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.

What is Spark SQL?

Spark SQL is Spark’s package for working with structured data. It allows querying of data via SQL as well as the Hive variant of SQL which is called as Hive Query Language(HQL). It supports many sources of data which includes Hive tables, Parquet and JSON.

What is Spark Streaming?

Spark Streaming is a Spark component that enables processing of live streams of data. Data streams includes log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.

What is MLlib?

Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

What is GraphX?

GraphX is a library for manipulating graphs (e.g. a social networks friend graph) and performing graph-parallel computations. GraphX also provides various operators for manipulating graphs (e.g. subgraph and mapVertices) and a library of common graph algorithms (e.g. PageRank and triangle counting)

What is PageRank?

PageRank is an iterative algorithm that can be used to rank web pages and can perform many joins. It is a unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

What is Yarn?

Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.

What are the common mistakes developers make when running Spark applications?

Developers often make the mistake of-

· Hitting the web service several times by using multiple clusters.

· Run everything on the local node instead of distributing it.

· Developers need to be careful with this, as Spark makes use of memory for processing.

What is the difference between persist() and cache()?

persist() allows the user to specify the storage level whereas cache() uses the default storage level.

What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

· Limit I/O operations

· Consumes less space

· Fetches only required columns.

What are the various data sources available in SparkSQL?

· Parquet file

· JSON Datasets

· Hive tables

How Spark uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

What are the benefits of using Spark with Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Which spark library allows reliable file sharing at memory speed across different cluster frameworks?

Tachyon

What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

How can you remove the elements with a key present in any other RDD?

Use the subtractByKey () function

What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are -

· MEMORY_ONLY

· MEMORY_ONLY_SER

· MEMORY_AND_DISK

· MEMORY_AND_DISK_SER, DISK_ONLY

· OFF_HEAP

How Spark handles monitoring and logging in Standalone mode?

Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Does Apache Spark provide check pointing?

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

How can you launch Spark jobs inside Hadoop MapReduce?

Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

How can you achieve high availability in Spark?

· Implementing single node recovery with local file system

· Using StandBy Masters with Apache ZooKeeper.

Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

What do you understand by SchemaRDD?

An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column

What are Shared Variables?

When a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

What is an “Accumulator”?

“Accumulators” provides a simple syntax for aggregating values from worker nodes back to the driver program. Accumulators are Spark’s offline debuggers. Similar to “Hadoop Counters”, Accumulators provide the number of “events” in a program.

Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

What are “Broadcast variables”?

“Broadcast variables” allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

What is sbt (simple build tool) in Spark?

sbt is a newer build tool most often used for Scala projects. sbt assumes a similar project layout to Maven. sbt build files are written in a configuration language where we assign values to specific keys in order to define the build for our project.

To more Big Data Interview questions visit,big data online course Blog