Tuesday, June 30, 2020

Hadoop 3: Comparison with Hadoop 2 and Spark

The release of Hadoop 3 in December 2017 marked the beginning of a new era for data science. The Hadoop framework is at the core of the entire Hadoop ecosystem, and various other libraries strongly depend on it.

In this article, we will discuss the major changes in Hadoop 3 when compared to Hadoop 2. We will also explain the differences between Hadoop and Apache Spark, and advise how to choose the best tool for your particular task.

To learn complete tutorials visit:online courses for big data and hadoop.

General information

Hadoop 2 and Hadoop 3 are data processing engines developed in Java and released in 2013 and 2017 respectively. Hadoop was created with the primary goal to maintain the data analysis from a disk, known as batch processing. Therefore, native Hadoop does not support the real-time analytics and interactivity.

Spark 2.X is a processing and analytics engine developed in Scala and released in 2016. The real-time analysis of the information was becoming crucial, as many giant internet services strongly relied on the ability to process data immediately. Consequently, Apache Spark was built for live data processing and is now popular because it can efficiently deal with live streams of information and process data in an interactive mode.

Both Hadoop and Spark are open source, Apache 2 licensed.

Level of abstraction and difficulty to learn and use

One of the major differences between these frameworks is the level of abstraction which is low for Hadoop and high for Spark. Therefore, Hadoop is more challenging to learn and use, as the developers must know how to code a lot of basic operations. Hadoop is only the core engine, so using advanced functionality requires plug-in of other components, which makes the system more complicated.

Unlike Hadoop, Apache Spark is a complete tool for data analytics. It has many useful built-in high-level functions that operate with the Resilient Distributed Dataset (RDD) — the core concept in Spark. This framework has many helpful libraries included in the cluster. For example, MLliballows using machine learning, Spark SQL can be used to perform SQL queries, etc.

Hardware and cost

Hadoop works with a disk, so it does not need a lot of RAM to operate. This can be cheaper than having large RAM. Hadoop 3 requires less disk space than Hadoop 2 due to changes in fault-tolerance providing system.

Spark needs a lot of RAM to operate in the in-memory mode so that the total cost can be more expensive than Hadoop.

Support of programming languages

Both versions of Hadoop support several programming languages using Hadoop Streaming, but the primary one is Java. Spark 2.X supports Scala, Java, Python, and R.

Speed

Generally, Hadoop is slower than Spark, as it works with a disk. Hadoop cannot cache the data in memory. Hadoop 3 can work up to 30% faster than Hadoop 2 due to the addition of native Java implementation of the map output collector to the MapReduce.

Spark can process the information in memory 100 times faster than Hadoop. If working with a disk, Spark is 10 times faster than Hadoop.

Security

Hadoop is considered to be more secure than Spark due to the usage of the Kerberos (computer network authentication protocol) and the support of the Access Control Lists (ACL). Spark in its turn provides authentication only with a shared secret password.

Fault tolerance

The fault tolerance in Hadoop 2 is provided by the replication technique where each block of information is copied to create 2 replicas. This means that instead of storing 1 piece of information, Hadoop 2 stores three times more. This raises the problem of wasting the disk space.

In Hadoop 3 the fault tolerance is provided by the erasure coding. This method allows recovering a block of information using the other block and the parity block. Hadoop 3 creates one parity block on every two blocks of data. This requires only 1,5 times more disk space compared with 3 times more with the replications in Hadoop 2. The level of fault tolerance in Hadoop 3 remains the same, but less disk space is required for its operations.

Spark can recover information by the recomputation of the DAG (Directed Acyclic Graph). DAG is formed by vertices and edges. Vertices represent RDDs, and edges represent the operations on the RDDs. In the situation, where some part of the data was lost, Spark can recover it by applying the sequence of operations to the RDDs. Note, that each time you will need to recompute RDD, you will need to wait until Spark performs all the necessary calculations. Spark also creates checkpoints to protect against failures.

YARN version

Hadoop 2 uses YARN version 1. YARN (Yet Another Resource Negotiator) is the resource manager. It manages the available resources (CPU, memory, disk). Besides, YARN performs Jobs Scheduling.

YARN was updated to version 2 in Hadoop 3. There are several significant changes improving usability and scalability. YARN 2 supports the flows — logical groups of YARN application and provides aggregating metrics at the level of flows. The separation between the collection processes (writing data) and the serving processes (reading data) improves the scalability. Also, YARN 2 uses Apache HBase as the primary backing storage.

Spark can operate independently, on a cluster with YARN, or with Mesos.

Number of NameNodes

Hadoop 2 supports single active NameNode and single standby NameNode for the entire Namespace while Hadoop 3 works with multiple standby NameNodes.

Spark runs Driver Program on the master node which manages SparkContext.

File systems

The main Hadoop 2 file system is HDFS — Hadoop Distributed File System. The framework is also compatible with several other file systems, Blob stores like Amazon S3 and Azure storage, as well as alternatively distributed file systems.

Hadoop 3 supports all the file systems, as Hadoop 2. In addition, Hadoop 3 is compatible with Microsoft Azure Data Lake and Aliyun Object Storage System.

Spark supports local file systems, Amazon S3 and HDFS.

For your convenience, we created a table that summarises all of the above information and presents a brief comparison of the key parameters of the two versions of Hadoop and Spark 2.X.

Conclusion

The major difference between Hadoop 3 and 2 is that the new version provides better optimization and usability, as well as certain architectural improvements.

Spark and Hadoop differ mainly in the level of abstraction. Hadoop was created as the engine for processing large amounts of existing data. It has a low level of abstraction that allows performing complex manipulations but can cause learning and managing difficulties. Spark is easier and faster, with a lot of convenient high-level tools and functions that can simplify your work. Spark operates on top of Hadoop and has many good libraries like Spark SQL or machine learning library MLlib. To summarize, if your work does not require special features, Spark can be the most reasonable choice.

To learn complete big data course visit:big data and hadoop course.

What is MapReduce in Big Data ?

MapReduce is pivotal to big data as it has allowed the processing of massive datasets, which the earlier preferred format for data storage and analysis, RBDMS was not capable of doing.

In terms of big data analytics, the open-source framework of Hadoop has been a game changer. It has enabled storage of large datasets going up to petabytes in a cluster format, and faster processing of this distributed data. An essential feature of Hadoop is its ability for parallel processing which is executed via MapReduce.

MapReduce is defined as the distributed data processing and querying programming engine that effectively splits and spreads around the necessary computation activities on a dataset across a wide range of servers which are known as data clusters. A query that needs to run through a mega data set may take hours if situated in one computer server. This is however cut down to minutes when done in parallel over a spread of servers.

To more information visit :big data online course Blog.

The term MapReduce refers to two critical tasks it handles on the Hadoop Distributed File System (HDFS) – the Map Job and the Reduce Job. The Map function takes the different input data elements available and processes them into an output data element, creating key value pairs. The Reduce function aggregates outputs created under the key value pairs, put them back together quickly and reliably in order to produce the required end-result.

Structurally, MapReduce has a single master Job Tracker and several slaves Task Tracker, one each per cluster. The master distributes and schedules the tasks to these slaves and keeps track of the assigned jobs, redoing any that fail. The slave tracker ensures that the assigned task is executed and communicates with the master

There are number of benefits of MapReduce that has made it an important element of Hadoop.

It allows developers to use any language like Java or C++ to write the applications, although Java is most preferred.
MapReduce can handle all forms of data whether structured or unstructured.
Another core feature of MapReduce is that it can easily run through petabytes of data stored in a data center due to its construct.
The MapReduce framework is also highly flexible in case of failures. If one dataset fails but is available in another machine, it can index and use the alternate location.

Today, there are several additional data processing engines like Pig or Hive that can be used to extract data from a Hadoop framework. These eliminate some of the complexity of MapReduce and make it easier to generate insights.

To learn big data hadoop course visit the following blog,big data and hadoop training.

Monday, June 29, 2020

Bacics of Apache spark in Big data Hadoop

In this lesson, we will discuss Bacics of Apache spark

Basics of Apache Spark Tutorial

Welcome to the tenth lesson ‘Basics of Apache Spark’ which is a part of ‘big data and hadoop online training’ offered by Online IT Guru.

In this lesson, you will learn about the basics of Spark, which is a component of the Hadoop ecosystem. Let us look at the objectives of this Spark Tutorial for beginners.

Objectives

After completing this lesson, you will be able to:

Describe the limitations of MapReduce in Hadoop
Compare batch and real-time analytics
Explain Spark, its architecture, and its advantages
Understand RRD Operations
Compare the Spark Hadoop ecosystem with the MapReduce ecosystem
Understand the functional programming in Spark

Let us now look at the history of Spark, and its needs, components, and advantages.

History of Spark

Spark as a data processing framework was developed at UC Berkeley’s AMPLab by Matei Zaharia in 2009.

In 2010, it became an open-source project under a Berkeley Software Distribution license. In the year 2013, the project was donated to the Apache Software Foundation, and the license was changed to Apache 2.0.

In February 2014, Spark became an Apache Top-Level Project. By November 2014, Spark was used by the engineering team at Databricks, a company founded by the creators of Apache Spark to set a world record in large-scale sorting.

Now Databricks provides commercial support and certification for taking the Spark programming test. At present, Spark exists as a next-generation real-time and batch processing framework.

You can check the Course Preview of big data hadoop course

Let’s try to understand what batch and real-time processing mean.

Batch vs. real-time processing

Batch processing and real-time processing differences are explained in the table given below.

Batch processing

Real-time processing

A large amount of data or transactions are processed in a single run over a period.
The associated jobs generally run entirely without any manual intervention.
The entire data is pre-selected and fed using command-line parameters and scripts.
In typical cases, batch processing is used to execute multiple operations, handle heavy data load, generate reports, and manage data workflow which is offline.
An example is to create daily or hourly reports for decision making.

It occurs instantaneously on data entry or command receipt.
It needs to execute within stringent response time constraints.
An example is fraud detection.

Let us look at the Limitations of MapReduce in Hadoop in the next section of this Apache Spark Tutorial.

Limitations of MapReduce in Hadoop

The need for Spark was created by the limitations of MapReduce, which is another data processing framework in Hadoop. Let’s see what these limitations are.

Unsuitable for real-time processing

MapReduce is suitable for batch processing, where data is processed as a periodic job. Thus, it takes time to process data and provide results if the data is high.

Depending on the amount of data and the number of nodes in the cluster to complete a job, it just takes minutes to process the data. However, it is not a good choice for real-time processing.

Unsuitable for trivial operations

MapReduce is also not suitable for writing trivial operations such as Filter and Join. To write such operations, you might need to rewrite the jobs using the MapReduce framework, which becomes complex because of the key-value pattern.

This pattern is required to be followed in reducer and mapper codes.

Unfit for large data on the network

MapReduce doesn’t work well with large data on a network. The reason is that it takes a lot of time to copy the data, which may cause bandwidth issues. It works on the data locality principle and hence works well on the node where the data resides.

Unsuitable with OLTP

MapReduce is also unsuitable for Online Transaction Processing or OLTP that includes a large number of short transactions. Since it works on the batch-oriented framework, it lacks latency of seconds or sub-seconds.

Unfit for processing graphs

MapReduce is unfit for processing graphs. Graphs represent the structures to explore relationships between various points, for example, finding common friends in Social Media like Facebook.

Hadoop has the Apache Giraph library for such cases; it runs on the top of MapReduce and adds to the complexity.

Unfit for iterative execution

Another important limitation is its unsuitability for iterative program execution. Some use cases, like K-means, need such execution where data needs to be processed again and again to refine results. MapReduce runs from the start every time as it is a state-less executor.

Introduction to Apache Spark

Spark is an open-source cluster-computing framework which addresses all the limitations of MapReduce.

It is suitable for real-time processing, trivial operations, and processing larger data on the network. It is also suitable for OLTP, graphs, and iterative execution.

As compared to the disk-based, two-stage MapReduce of Hadoop, Spark provides up to 100 times faster performance for a few applications with in-memory primitives.

Fast performance makes it suitable for machine learning algorithms as it allows programs to load data into the memory of a cluster and query the data constantly.

Let’s discuss the components of Spark.

Components of a Spark Project

A Spark project comprises various components such as:

Spark Core and Resilient Distributed Datasets or RDDs
Spark SQL
Spark Streaming
Machine Learning Library or MLlib
GraphX

Let us now talk about each component in detail.

Spark Core and RDDs

Spark Core and RDDs are the foundation of the entire Spark project. They provide basic Input/Output functionalities, distributed task dispatching, and scheduling.

RDDs are the basic programming abstraction and are a collection of data that is partitioned across machines logically. RDDs can be created by applying coarse-grained transformations on the existing RDDs or by referencing external datasets.

The examples of these transformations are reduced, join, filter, and map.

The abstraction of RDDs is exposed similarly as in-process and local collections through a language-integrated Application Programming Interface or API in Python, Java, and Scala.

As a result, the RDD abstraction, the complexity of programming is simplified, as the manner in which applications change RDDs is similar to changing local data collections.

Spark SQL

Spark SQL resides at the top of Spark Core. It introduces SchemaRDD, which is a new data abstraction and supports semi-structured and structured data.

SchemaRDD can be manipulated in any of the provided domain-specific such as Java, Scala, and Python by the Spark SQL. Spark SQL also supports SQL with Open Database Connectivity or Java Database Connectivity, commonly known as ODBC or JDBC server and command-line interfaces.

Spark Streaming

Spark Streaming leverages the fast scheduling capability of Spark Core for streaming analytics, ingesting data in small batches, and performing RDD transformations on them.

With this design, the same application code set written for batch analytics can be used on a single engine for streaming analytics.

Machine Learning Library

Machine Learning Library, also known as MLlib lies on top of Spark and is a distributed machine learning framework.

MLlib applies various common statistical and machine learning algorithms. With its memory-based architecture, it is nine times faster than the Apache Mahout Hadoop disk-based version.

Also, the library performs even better than Vowpal Wabbit or VW. The VW project is a fast out-of-core learning system sponsored by Microsoft.

GraphX

GraphX also lies on the top of Spark and is a distributed graph processing framework. For the computation of graphs, it provides an API and an optimized runtime for the Pregel abstraction.

Pregel is a system for large-scale graph processing. The API can also model the Pregel abstraction. We discussed earlier that Spark provides up to 100 times faster performance for a few applications with in-memory primitives.

In the next section, let’s discuss the application of in-memory processing using column-centric databases.

Application of In-Memory Processing

In column-centric databases, similar information can be stored together and hence data can be stored with more compression and efficiency.

The working of in-memory processing can be explained as follows:

It permits to store large data amounts in the same space, thereby reducing the amount of memory required for performing a query. It also increased the speed of processing.

In an in-memory database, the entire information is loaded into memory, eliminating the need for indexes, aggregates, optimized databases, star schemas, and cubes.
With the use of in-memory tools, compression algorithms can be implemented that decrease the in-memory size, even beyond what is required for hard disks. Users querying data loaded in memory is different from caching.
In-memory processing also helps to avoid performance bottlenecks and slow database access. Caching is a popular method for speeding up the performance of a query, where caches are subsets of very particular organized data, which are already defined.
With in-memory tools, data analysis can be flexible in size and can be accessed within seconds by concurrent users with an excellent analytics potential. This is possible as data lies completely in-memory. In theoretical terms, this leads to data access improvement that is 10,000 to 1,000,000 times fast, when compared to a disk.

It also reduces the performance tuning needed by IT professionals and therefore provides faster data access for end users.
With in-memory processing, it is also possible to access visually rich dashboards and existing data sources. This ability is provided by several vendors.
In-memory processing allows end users and business analytics to create customized queries and reports without any need of extensive expertise or training.

In the next section, let us understand what Language Flexibility in Spark is.

Language Flexibility in Spark

We have already discussed that Spark provides performance, which in turn provides developers a rewarding experience.

Spark is chosen over MapReduce, mainly for its performance advantages and versatility. Apart from this, another critical advantage is its development experience along with language flexibility.

Spark provides support to various development languages like Java, Scala, and Python and will likely support R as well. Also, Spark has the capability to define functions inline.

With the temporary exception of Java, a common element in these languages is that they provide methods to express operations using lambda functions and closures.

Using lambda closures, you can use the application core logic to define the functions inline, which helps to create easy-to-comprehend codes and preserve application flow.

Let’s look at MapReduce in a Hadoop ecosystem.

Hadoop ecosystem vs. Spark

Let us compare the main features of MapReduce in a Hadoop ecosystem and Apache Spark

Hadoop ecosystem

Apache Spark

Uses MapReduce for batch analytics
In Hadoop, third-party support is also available
- Example - By using ETL Talend tools, various batch-oriented workflows can be designed

Supports Pig and Hive queries that enables non-Java developers to use and prepare batch workflows using SQL scripts.

Supports both real-time and batch processing

You can perform every type of data processing using Spark that you execute in Hadoop.

For batch processing, Spark batch can be used over Hadoop MapReduce.
For Structured Data Analysis, Spark SQL can be used using SQL.
For Machine Learning Analysis, the Machine Learning Library can be used for clustering, recommendation, and classification.
For Interactive SQL Analysis, Spark SQL can be used instead of Impala.

Also, for real-time Streaming Data Analysis, Spark streaming can be used in place of a specialized library like Storm.

Spark - Advantages

Spark has three main advantages which are:

Provide Speed capability
Combines the various processing types
Supports Hadoop.

Speed

The feature of speed is critical to process large datasets, as this implies the difference between waiting for hours or minutes and exploring the data interactively.

Spark has extended the MapReduce model to support computations like stream processing and interactive queries, supporting run computations in memory concerning speed.

Also, its related system is more effective when compared to MapReduce to run complex applications on a disk. This adds to the speed capability of Spark.

Combination

Spark covers various workloads that require different distributed systems such as streaming, iterative algorithms, and batch applications. As these workloads are supported on the same engine, combining different processing types is easy.

Spark is normally required in the production of data analysis pipelines. The combination feature also allows easy management of separate tools.

Hadoop Support

Spark is capable of creating distributed datasets from any file that is stored in the Hadoop Distributed File System or any other supported storage systems. You must note that Spark does not need Hadoop.

It supports the storage systems that implement the APIs of Hadoop, and Sequence Files, Parquet, Avro, text files, and all other Input/Output formats of Hadoop. Unification not only provides developers with the advantage of learning only one platform but also allows users to take their apps everywhere.

The graph shown below shows the apps and systems that can be combined with Spark.

Other advantages of Spark are:

A Spark project includes various closely-integrated components for distributing, scheduling, and monitoring applications with many computational tasks across a computing cluster or various worker machines.
The Spark’s core engine is general purpose and fast. As a result, it empowers various higher-level components that are specialized for different workloads like machine learning or SQL. These components can interoperate closely.
It integrates tightly, allowing you to create applications that easily combine different processing models. An example is the ability to write an application using machine learning to categorize data in real time as it is ingested from sources of streaming.
It allows analysts to query the data which results through SQL. Moreover, data scientists and engineers can access the same data through the Python shell for ad-hoc analysis and in standalone batch applications. For all this, the IT team needs to maintain one system only.

Let’s now understand the components of the Spark architecture.

Spark - Architecture

The components of the Spark execution architecture are:

Spark-submit script
Spark applications
Cluster managers
Spark’s EC2 launch scripts

The components of the Spark execution architecture are explained below:

Spark-submit script

Spark architecture comprises a Spark-submit script that is used to launch applications on a Spark cluster.

The Spark-submit script can use all cluster managers supported by Spark using an even interface. As a result, you need not configure your application for each one specifically.

Spark applications

Spark applications run independently as a set of processes on a Spark cluster. These are coordinated by the SparkContext object in the driver program, which is your main program.

Cluster managers

SparkContext can connect to different cluster managers, which are of three types: ‘Standalone, Apache Mesos, and Hadoop YARN.

A standalone cluster manager is a simple one that makes setting up a cluster easy.
Apache Mesos is a general cluster manager that is also capable of running service applications and MapReduce.
Hadoop YARN is the resource manager in Hadoop

Spark’s EC2 launch scripts

The Spark’s EC2 launch scripts ease the launch of a standalone cluster on Amazon EC2.

The interaction of these components is shown in the diagram.

Let us now look at the Spark Execution Architecture.

Spark Execution Architecture

The SparkContext connects to the cluster managers. We discussed earlier that the three types of Cluster managers are Standalone, Apache Mesos, and Hadoop YARN. The cluster manager allocates resources across applications.

The diagram given below shows the process of Spark execution.

Once the SparkContext is connected to the cluster manager, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

Next, it sends your application code to the executors. Finally, SparkContext sends tasks to the executors to run.

Let’s now talk about the automatic parallelization of complex flows in Spark.

Spark Execution - Automatic Parallelization

Spark execution is explained through the points below.

It is your task to make the sequence of MapReduce jobs parallel in case of a complex pipeline. Here, a scheduler tool like Oozie is generally required for constructing this sequence carefully.
Using Spark, the series of individual tasks is expressed regarding a program flow. To give a complete picture of the execution graph to the system, this flow is lazily evaluated.
In lazy evaluation, the operations are only executed in the end. Using this approach, the core scheduler can map the dependencies lying between various application stages correctly.
This allows parallelizing the operators flow automatically without any intervention.
With automatic parallelizing capability, you can also achieve a few optimizations to the engine with less burden.

How automatic parallelization works can be explained in the given diagram.

Running Spark in Different Modes

The different deployment modes of Spark are listed below.

Spark as Standalone:

The standalone mode is a simple one that can be launched manually, by using launch scripts or by starting a master and workers. This mode is usually used for development and testing.

Spark on Mesos:

Spark can also be run on hardware clusters that are managed by Mesos. Running Spark in this mode has advantages like scalable partitioning among different Spark instances and dynamic partitioning between Spark and other frameworks.

Spark on YARN:

Running Spark on YARN runs parallel processing and avails all benefits of the Hadoop cluster.

Spark on EC2:

By running Spark on EC2, you will receive the key-value pair benefits of Amazon.

Spark - Use Cases

Companies like NTT DATA, Yahoo, GROUPON, NASA, Nokia, and others are using Spark for creating applications for different use cases.

These use cases are given to the network machine data which performs Big Data analytics for subscriber personalization in the telecommunications domain.

It executes the Big Data Content platform, which is a B2B content asset management service that provides an aggregated and searchable source of public domain media, live news feeds, and archives of content.

A few more use cases are building data intelligence and e-Commerce solutions in the retail industry, and analyzing and visualizing patterns in large-scale recordings of brain activities.

Let us discuss some of the important concepts of Spark, such as Spark-Shell and SparkContext in the next section.

Spark-Shell

The Spark-Shell provides interactive data exploration. It helps in prototyping an operation quickly instead of developing a full program. It is available in Python and Scala.

You can access the Python Spark-Shell using pyspark and Scala Spark-Shell using spark-shell.

SparkContext

SparkContext is the main entry point of Spark API. Every Spark application requires a SparkContext. Spark-Shell provides a preconfigured SparkContext called “sc.”

You can view, through spark-shell and also through pyspark as shown in the diagrams given below.

In the next section, we will learn more about Resilient Distributed Datasets or RDDs, which is a Spark component. We will also learn how an RDD is created from a file and get introduced to RDD operations, Lazy evaluation, RDD lineage, and pipelining.

Resilient Distributed Datasets(RDDs) and its Operation

Most Spark programming consists of performing operations on RDDs, where ‚ means data in memory can be recreated if it is lost.

“Distributed”‚ means processed across the cluster while ‚“dataset”‚ is initial data that can come from a file or can be created programmatically.

RDDs are the fundamental unit of data in Spark. Most Spark programming consists of performing operations on RDDs.

There are three ways to create an RDD which are:

From a file or set of files
From data in memory
From another RDD.

Let’s discuss how to create an RDD from a file.

How to Create an RDD from a File

The example below shows how the file, simplilearn.txt is distributed across multiple worker nodes.

The first line of code creates an RDD named ‚“mydata.

scala> val mydata =

sc.textFile("simplilearn.txt")

The second line of code counts the number of lines present in,“mydata”‚ RDD using the count function. The output is four.

scala> mydata.count

16/08/10 06:10:41 INFO

scheduler.DAGScheduler: Job 0 finished:

count at <console>:24, took 1.010566 s

res0: Long = 4

Let us look at the types of RDD operations in the next section.

RDD Operations

RDDs provide support to two different types of operations, which are ‚“actions”‚ and “transformations.” These operations are shown in the diagram given below.

As explained in the diagram above, Actions allow returning a value to the driver program once a computation on the dataset is run. Transformations allow the creation of another dataset from an existing one.

Let’s look at each operation type closely.

RDD Operations - Action

On the table in this section, some common actions available in Scala and Python are listed.

Common Actions
count()	Returns the number of elements
take()	Returns an array of the first N elements
collect()	Returns an array of all elements
saveAsTextFile(file)	Saves RDD to text file

The image shown below shows RDD Operations - Action using Scala.

The image shown below shows DD Operations - Action using Python.

RDD Operations - Transformation

RDDs are immutable. Data in an RDD never changes; they transform in sequence to modify the data as needed. You don’t need to make changes to an RDD; you can simply create a new RDD using a transformation. Spark programs are mostly made up of a series of transformations.

There are two common transformations, Map and filter functions.

Map(function): A map creates a new RDD by performing a function on each record in the base RDD
Filter (function): A filter creates a new RDD by including or excluding each record in the base RDD.

Example - Map and Filter Transformation

The example below shows you how each code returns the same result in two successive transformations in Python and Scala.

Using Python:

Using Scala:

The result of this map and filter transformation in python and scala is shown below.

The first code maps each original line by converting the string to upper case and the second code alters outlines which do not start with the letter “I.”

Let us now understand the Lazy evaluation in Spark and its steps.

Lazy Evaluation Steps

Lazy evaluation improves the use of disk and memory in Spark. Until the execution of the Action operation, the file is not even read. Let’s understand Lazy Evaluation works with the example given below.

Let us now understand the steps for the same example.

Step 01

The first step starts with a text file. The example shown above is simplilearn.txt. The only step that happens at this stage is that lazy evaluation creates the variable for the RDD and links it to the file, but it doesn’t read the file.

Step 02

It transforms the base RDD to upper case. Here, lazy evaluation creates an empty RDD and links it to its parent; however, it still doesn't process it. The file still hasn’t been read.

Step 03

It alters the upper case RDD for lines beginning with “I.” The lineage continues to grow.

Step 04

The fourth step is called as the action step. “Count” is an action, which means that it processes the data in the RDD and returns a value to the calling program.

At the time when the operation is performed on the last RDD, Spark recursively processes each RDD and its parent. The important point to note here is that nothing happens until that final call or action is taken.

Since the data is only processed at the end during the action step, it is termed Lazy evaluation. Therefore, each command line is considered separate from the one above it.

The RDDs stay populated after the call, which isn’t the case in this example. They get evaluated only when the script reaches the action. You would have noticed in the previous example that the output of each transformation was assigned to a new variable.

Let us now look at the chaining transformation

Chaining Transformations

Chains the calls together without any reference to the intermediate datasets. This has no impact on the result, as it is a syntactical difference.

On this section you will notice that by chaining transformations, you get the same results like as in the preceding code.

val mydata = sc.textFile(“simplilearn.txt”)

Val mydata_uc = mydata.map(line=> line.toUpperCase())

Val mydata_filter = mydata_uc.filter(line=> line.startsWith (“T”))

Mydata_filter.count()

This Is exactly equivalent to -

Sc.textFile(“simplilearn.txt”).map(line=> line.toUpperCase())

.filetr(line=> line.startswith (“T”)).count()

You can’t chain a transformation after an action because the output of the actions are values, not RDDs.

The last operation in this example is an action that will trigger the lazy evaluation, but you can’t extend the chain further because the output of count is an integer.

Please note that an RDD is still created for each of these transformations, even if there are no variables pointing to the RDD. Spark maintains a big table of all RDDs.

You saw this example in the Scala version. In the Python version, it appears as below.

val mydata = sc.textFile(“simplilearn.txt”)

Val mydata_uc = mydata.map(lambda s: s.upper())

Val mydata_filter = mydata_uc.filter(lambda s:s.startswith (“T”))

Mydata_filter.count()

This Is exactly equivalent to -

Sc.textFile(“simplilearn.txt”).map(lambda s: s.upper())

.filetr(lambda s:s.startswith (“T”)).count()

RDD Lineage from the Stack - Scala

Spark maintains each RDD’s lineage, that is, the previous RDDs on which it depends. You can use the toDebugString command to view the lineage of an RDD. You can also see on the section that the RDD is created by typing val maydata_debug = sc.textFile. This is called the Scala version.

RDD Lineage from the Stack - Python

The Python version of the same file is shown in the image below.

“toDebugString()‚always returns a single string containing multiple lines, delineated by new-line characters. The Scala shell will automatically display these on separate lines.

On the other hand, the Python shell will display newlines as “n, and the Python print function will convert them to actual new lines.

You may notice that the Scala and Python output for “toDebugString() isn't identical, even after improvisation. This reflects underlying differences between the implementation of Spark in Python and Scala.

Note that the Python lineage shows three RDDs in the lineage and the Scala lineage shows four, including two rather than one MappedRDD. This is because of how sc.textFile is implemented in Scala.

In Scala, the output of text File is a MappedRDD which depends on a HadoopRDD, whereas in Python the output is HadoopRDD.

Pipelining

Spark performs the sequence of transformations row-wise since it does not store data. This concept is called pipelining. Let’s understand it through an example.

You can see that the file simplilearn.txt is brought into spark memory using sc.textFile.

On the next the section, you can see that the whole file is not being processed; instead, it starts with only the first row.

Now, let’s perform the “toUppercase transformation on the first row of the simplilearn.txt file. You can see that the first row is now in upper case. After the “toUpperCase” transformation, let’s perform the other transformation “startsWith”.

You can see in the image that the transformation has been executed in the first row. Now, the action (take) will print this line.

Similarly, all the above steps will be executed for the second, third, and the fourth row, one after the other. Since the second and third rows are being filtered in mydata_filtrdd, the take(2) action will print the second and third lines. Both the lines start with the letter“I.”

Now, there will be no further processing required on the fourth row because two rows are now printed as required by the take(2) action.

Let us now look at the functional programming in spark in the next section of this tutorial.

Functional Programming in Spark

Spark depends heavily on the concepts of functional programming. Functions are the fundamental units of programming. They contain input and output only without any state or side effects.

Key Concepts in Functional Programming

The key concepts in the functional programming of Spark include passing functions as input to other functions. This is done using Anonymous functions.

The lambda operator or lambda function is a way to create small anonymous functions, that is, functions without a name. These functions are specific functions, that is, they are needed only where they have been created.

Lambda functions are mainly used in combination with the functions such as filter (), map (), and reduce ().

RDD Map Operation

The section below shows you how a typical Spark function, in this case, map, is implemented.

Here, a function as a parameter is being executed sequentially over a series of data.

Passing Named Functions

The section below shows how named functions are passed using python and scala.

In this example, we define a function which takes a single parameter. You can use a shortcut to pass the function to the “map” transformation and just pass the name of the function since it is taking a single parameter.

Then Spark will call the function for each item in the RDD, passing the item to the function. This explains passing named function. Both these examples shown on section give the same result.

Here, you are defining your function, which returns a string converted to upper case. You would have also observed that you can define your own named function and pass that instead of passing s.upper().

Anonymous Functions

Spark doesn’t necessarily need anonymous functions. You can define and name each function separately, but that would make the code cumbersome. That’s why Anonymous functions are used extensively in Spark, even though they are not specically a Spark concept.

Java 7 does not support Anonymous functions, and there is no Spark-Shell for Java. This is why the course is taught in Python or Scala.

The newly released Java 8 includes anonymous functions using the greater than the operator. Java 8 support was added to Spark in 1.0.

Note that the lambda syntax, used to create anonymous functions in Python is beyond the scope of this course. However, if you are new to the concept, you are encouraged to read about it in the Python documentation or closure syntax in Scala.

Passing Anonymous Functions

The code on the below section does the same as the toUpper() example in the previous section, but instead of defning the function separately and giving it a name, we define the function inline.

It doesn’t have a name and is, therefore, called an anonymous function.

These are also called lambda functions or closures. Scala provides a shortcut for referencing the value passed to an anonymous function. The use of the underscore is frequently used in the Spark code but not used here as it makes the code less understandable.

The use can also be confusing as you can’t reference the same variable twice. For example, a second underscore refers to a second parameter and so on. Also, you can only use the underscore in the outermost scope.

There is a big difference between the syntax for Scala closures and lambda functions in Python. In Scala, you can have arbitrarily complex functions that span multiple lines. Whereas, in Python, lambda functions can confine in a single line.

Scala closures can defne locally-scoped variables, as well as have side effects. This is not possible in Python. It becomes apparent in Spark when you look at all the example codes, which often includes long, complex anonymous functions.

Looking to learn more about Big Data Hadoop, why not enroll for big data online training

Summary

Now let’s summarize what we learned in this spark tutorial for beginners.

Apache Spark is a fast, general engine for large-scale data processing.
RDDs are the fundamental units of data in Spark.
Parallelize an existing collection or reference an external dataset to create RDDs.
Actions and transformation are two types of RDD operations. You can either pass name functions or make them anonymous.

Conclusion

This concludes the lesson on Basics of Apache Spark.