Monday, November 2, 2020

Explain the process of distributed data using Spark

 Distributed data processing refers to the distribution of computer networks

across different locations where computer systems interconnected & share data.

Apache Spark is an open-source general distributed data processing engine with

capacity to handle heavy volumes of data.

Moreover, it supports different types of resources or cluster managers. Such as

Standalone, Kubernetes, Apache Mesos & Apache Hadoop YARN (Yet Another

Resource Negotiator).

It includes an extensive set of libraries and APIs and supports different

programming languages like Java, Scala, Python, R, etc. Moreover, its flexibility

makes it suitable for a wide range of use cases.

Apache Spark is also useful with distributed data stores like MapR XD, Hadoop’s

HDFS, etc. And with popular NoSQL databases like MapR Database, Apache

HBase, and MongoDB. And it also used with distributed messaging stores like

MapR Event Store and Apache Kafka.


To learn big data course visit OnlineITGuru's big data and hadoop online training Blog

An Introduction to Big Data: Distributed Data Processing | by ...



Spark distributed computing example

It includes the concept of distributed datasets that contains the objects Python or

Java. There are different types of Spark APIs available on the top of the Spark that

are the building blocks of Spark API. These are RDD API, DataFrame API & the

Machine Learning API.

Besides, these APIs also provides the way to conduct DDP operations under Spark.

RDD (Resilient Distributed Dataset) is the major abstraction that Spark provides to

handle distributed data processing. The examples of RDD API include Word count

and Pi Estimation. These are useful in computing extensive tasks and helps in

building datasets of different kinds.

A DataFrame is a collection of distributed data organized into different name

columns. DataFrame API is used by the users to perform different relational

operations.

It is useful for both external data sources and its inbuilt distributed data

collections. This is used without giving specific instructions for data- processing. It

includes examples like text search & simple data operations.

Sparks MLib or Machine Learning library provides different types of distributed

ML algorithms. Such as extraction, classification, regression, clustering, etc.

Moreover, it also provides tools like ML pipelines to building workflows.

Distributed data processing with Spark

As we know that Apache Spark is a distributed data processing engine, we will

discuss the way of the data processing done using Spark. To speed up the data

processing, we need to use the Apache Spark processing framework. At first, we

need to install and run it.

Resource management

Most of the jobs in Spark are run on a shared processing cluster that will

distribute available resources. It allocates among all running jobs based on some

parameters.

Moreover, we need to allocate memory to the executor where there are two

relevant settings:

 The size of memory available for the Spark 'Java' process is --executor-

memory 1G

 The amount of memory required for Python or R script is: --conf

spark.yarn.executor.memory Overhead=2048

Number of parallel jobs

The number of jobs that are parallel processed identified dynamically with spark.

Therefore, we should use the following parameters:

 --conf spark.shuffle.service.enabled =true –conf

spark.dynamicAllocation.enabled=true

There is an option that we can set upper or lower bounds also:

 --conf spark.dynamicAllocation.maxExecutors =30 --conf

spark.dynamicAllocation. minExecutors=10

Number of dependencies

There are a lot of commonly used Python dependencies require to preinstall on

the cluster. But in some cases, we can provide our own.

To do this, we need to get a package containing our dependency. Besides, the

PySpark supports zip, egg, or whl packages only. Using pip, we can get such a

package easily.

 pip download Flask==1.0.2

Now it will download the package including all of its dependencies. Pip prefers to

download a wheel if it is available, but it may also return a ".tar.gz" file, which we

will need to repackage it as zip or wheel.

To repackage a tar.gz as wheel we need to do:

 tar xzvf package.tar.gz

 cd package

 python setup.py bdist_wheel

But, the above files depend on the version of Python we use. So, we need to

install a higher version of Python for this. In this way, we came to know how

distributed data processing is actually done.

Spark architecture

It includes a well-defined layered architecture that comprises of loosely coupled

components and layers. It also integrates various extensions and libraries. The

Apache Spark Architecture is based on the following abstractions-

 Resilient Distributed Datasets (RDD)

 Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets or RDD is a collection of data sets split into

partitions. These are stored in memory on worker nodes of the Apache spark

cluster. In terms of datasets, it supports two types of RDD’s – (a) Hadoop

Datasets, created from the files stored on HDFS. (b) Parallelized collections, that

are based on existing Scala collections. Moreover, the RDD’s support two

different kinds of operations –

-Transformations

- Actions

DAG (Directed Acyclic Graph)

DAG or Directed Acyclic Graph is a series of computations that perform on data.

In this, each node is a partition RDD & the edge is a transformation on top of data.

Advantages of Distributed Data Processing

The distributed data processing helps in the allocation of data among different

computer- networks in different locations for sharing data processing capability.

There are many advantages of distributed data processing. Such as;

Reliable

In a single-server or system processing, there may be some hardware issues and

software crashes that cause malfunction and failure. Moreover, it also results in a

complete system breakdown. But, the distributed data processing is much reliable

due to different control centers spread across different systems. A disruption in

any one system does not impact the network since another system takes over its

processing capability. Furthermore, it makes the distributed data processing

system more reliable and powerful.

Lower Cost

In large companies it needs to invest expensive mainframe and supercomputers

to function as centralized servers in business operations. Each mainframe

machine costs several hundred thousand $ in comparison to several thousand $

for a few mini computers. Again distributed data processing helps to lower the

cost of data sharing and networking across the organization. It comprises several

minicomputer systems that cost less than a mainframe machine.

More Flexible

Many individual computers that comprise a distributed network present at

different places. For example, an organization with a distributed network system

comprises 3 computer systems having each machine in a different branch. These

three machines are interconnected through the Internet. And they can process

data in parallel but from different locations. This makes us understand distributed

data-processing networks more flexible to use.

Moreover, this system is flexible also in terms of enhancing or minimizing

processing capability. For example, by adding more nodes or computers to the

network enhances its processing power. But reducing computer systems from the

network minimizes its processing power.

Performance improvement and Reduced Processing Time

A single computer system is limited in its performance and efficiency but adding

another computer to a network enhances power processing. By adding one more

system will further enhance performance. So, distributed data processing works

on this principle and makes that a task gets done faster if different machines are

working in parallel.

For instance, complex statistical problems are broken into different modules and

allocated to different machines where they are processed in parallel. This

significantly minimizes processing time and improves the performance of the

computer.

Conclusion

Thus, we reach a conclusion in the above article that explains the process of

distributed data processing using Spark. Learn more things from big data hadoop training.


No comments:

Post a Comment