Showing posts with label big_data. Show all posts
Showing posts with label big_data. Show all posts

Monday, November 2, 2020

Explain the process of distributed data using Spark

 Distributed data processing refers to the distribution of computer networks

across different locations where computer systems interconnected & share data.

Apache Spark is an open-source general distributed data processing engine with

capacity to handle heavy volumes of data.

Moreover, it supports different types of resources or cluster managers. Such as

Standalone, Kubernetes, Apache Mesos & Apache Hadoop YARN (Yet Another

Resource Negotiator).

It includes an extensive set of libraries and APIs and supports different

programming languages like Java, Scala, Python, R, etc. Moreover, its flexibility

makes it suitable for a wide range of use cases.

Apache Spark is also useful with distributed data stores like MapR XD, Hadoop’s

HDFS, etc. And with popular NoSQL databases like MapR Database, Apache

HBase, and MongoDB. And it also used with distributed messaging stores like

MapR Event Store and Apache Kafka.


To learn big data course visit OnlineITGuru's big data and hadoop online training Blog

An Introduction to Big Data: Distributed Data Processing | by ...



Spark distributed computing example

It includes the concept of distributed datasets that contains the objects Python or

Java. There are different types of Spark APIs available on the top of the Spark that

are the building blocks of Spark API. These are RDD API, DataFrame API & the

Machine Learning API.

Besides, these APIs also provides the way to conduct DDP operations under Spark.

RDD (Resilient Distributed Dataset) is the major abstraction that Spark provides to

handle distributed data processing. The examples of RDD API include Word count

and Pi Estimation. These are useful in computing extensive tasks and helps in

building datasets of different kinds.

A DataFrame is a collection of distributed data organized into different name

columns. DataFrame API is used by the users to perform different relational

operations.

It is useful for both external data sources and its inbuilt distributed data

collections. This is used without giving specific instructions for data- processing. It

includes examples like text search & simple data operations.

Sparks MLib or Machine Learning library provides different types of distributed

ML algorithms. Such as extraction, classification, regression, clustering, etc.

Moreover, it also provides tools like ML pipelines to building workflows.

Distributed data processing with Spark

As we know that Apache Spark is a distributed data processing engine, we will

discuss the way of the data processing done using Spark. To speed up the data

processing, we need to use the Apache Spark processing framework. At first, we

need to install and run it.

Resource management

Most of the jobs in Spark are run on a shared processing cluster that will

distribute available resources. It allocates among all running jobs based on some

parameters.

Moreover, we need to allocate memory to the executor where there are two

relevant settings:

 The size of memory available for the Spark 'Java' process is --executor-

memory 1G

 The amount of memory required for Python or R script is: --conf

spark.yarn.executor.memory Overhead=2048

Number of parallel jobs

The number of jobs that are parallel processed identified dynamically with spark.

Therefore, we should use the following parameters:

 --conf spark.shuffle.service.enabled =true –conf

spark.dynamicAllocation.enabled=true

There is an option that we can set upper or lower bounds also:

 --conf spark.dynamicAllocation.maxExecutors =30 --conf

spark.dynamicAllocation. minExecutors=10

Number of dependencies

There are a lot of commonly used Python dependencies require to preinstall on

the cluster. But in some cases, we can provide our own.

To do this, we need to get a package containing our dependency. Besides, the

PySpark supports zip, egg, or whl packages only. Using pip, we can get such a

package easily.

 pip download Flask==1.0.2

Now it will download the package including all of its dependencies. Pip prefers to

download a wheel if it is available, but it may also return a ".tar.gz" file, which we

will need to repackage it as zip or wheel.

To repackage a tar.gz as wheel we need to do:

 tar xzvf package.tar.gz

 cd package

 python setup.py bdist_wheel

But, the above files depend on the version of Python we use. So, we need to

install a higher version of Python for this. In this way, we came to know how

distributed data processing is actually done.

Spark architecture

It includes a well-defined layered architecture that comprises of loosely coupled

components and layers. It also integrates various extensions and libraries. The

Apache Spark Architecture is based on the following abstractions-

 Resilient Distributed Datasets (RDD)

 Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets or RDD is a collection of data sets split into

partitions. These are stored in memory on worker nodes of the Apache spark

cluster. In terms of datasets, it supports two types of RDD’s – (a) Hadoop

Datasets, created from the files stored on HDFS. (b) Parallelized collections, that

are based on existing Scala collections. Moreover, the RDD’s support two

different kinds of operations –

-Transformations

- Actions

DAG (Directed Acyclic Graph)

DAG or Directed Acyclic Graph is a series of computations that perform on data.

In this, each node is a partition RDD & the edge is a transformation on top of data.

Advantages of Distributed Data Processing

The distributed data processing helps in the allocation of data among different

computer- networks in different locations for sharing data processing capability.

There are many advantages of distributed data processing. Such as;

Reliable

In a single-server or system processing, there may be some hardware issues and

software crashes that cause malfunction and failure. Moreover, it also results in a

complete system breakdown. But, the distributed data processing is much reliable

due to different control centers spread across different systems. A disruption in

any one system does not impact the network since another system takes over its

processing capability. Furthermore, it makes the distributed data processing

system more reliable and powerful.

Lower Cost

In large companies it needs to invest expensive mainframe and supercomputers

to function as centralized servers in business operations. Each mainframe

machine costs several hundred thousand $ in comparison to several thousand $

for a few mini computers. Again distributed data processing helps to lower the

cost of data sharing and networking across the organization. It comprises several

minicomputer systems that cost less than a mainframe machine.

More Flexible

Many individual computers that comprise a distributed network present at

different places. For example, an organization with a distributed network system

comprises 3 computer systems having each machine in a different branch. These

three machines are interconnected through the Internet. And they can process

data in parallel but from different locations. This makes us understand distributed

data-processing networks more flexible to use.

Moreover, this system is flexible also in terms of enhancing or minimizing

processing capability. For example, by adding more nodes or computers to the

network enhances its processing power. But reducing computer systems from the

network minimizes its processing power.

Performance improvement and Reduced Processing Time

A single computer system is limited in its performance and efficiency but adding

another computer to a network enhances power processing. By adding one more

system will further enhance performance. So, distributed data processing works

on this principle and makes that a task gets done faster if different machines are

working in parallel.

For instance, complex statistical problems are broken into different modules and

allocated to different machines where they are processed in parallel. This

significantly minimizes processing time and improves the performance of the

computer.

Conclusion

Thus, we reach a conclusion in the above article that explains the process of

distributed data processing using Spark. Learn more things from big data hadoop training.


Friday, September 25, 2020

Data Validation Framework in Apache Spark for Big Data Migration Workloads

 In Big Data, testing and assuring quality is the key area.

However, data quality problems may destroy the success of many Data Lake, Big Data, and ETL projects. Whether it’s a big data or small, the need for the quality data doesn’t change. Moreover, high-quality data is the perfect driver to get insights from it. The data quality is measured based on the business satisfaction by deriving the necessary insights,More info go through big data online course

Steps included in the Big Data validation.

  • Row and Column count

  • Checking Column names

  • Checking Subset Data without Hashing

  • Statistics Comparison- Min, Max, Mean, Median, 25th, 50th, 75th percentile

  • SHA256 Hash Validation on entire data

Debugging

When there is a mismatch between source and sink, then we should know how to find out particular corrupt data within the entire data which may include 3000+ columns and Millions of records.

Let’s discuss the same by looking into the columns that come in the way.

Context

Under Big Data, we have transferred the data from MySQL to Data Lake. Moreover, the quality of the data has to be verified before it is ingested by downstream applications.

For example purpose, I have gone with a sample customer data (having 1000 records) within Spark Dataframe. However, the demo is with a small amount of data, this solution can be scaled to the enormous data volume.

Plot-1

Thus, the same data exists within two Dataframes, so our Data validation framework will be a green signal.

I have intentionally changed the data in the previous records in the 2nd data frame. So, we can see how this hash validation framework helps.

Let’s see the steps in the data validation process as this is the core part of this validation in Big Data.

Step-01: Row and Column count

This check will happen in a typical data migration pipeline.

spark 01.png

Step-02: Checking Column Names 

The following check will make sure that we don’t have corrupt or additional columns in this Big Data validation.

spark 01.png

Step-03: Checking Subset Data without Hashing

This type of checking is like fruit to fruit comparison. Moreover, it means that this will validate the actual data (Big Data) without applying the hash function. But, this checking has a limitation up to a few records only, as this may consume more resources if we do it using Big data or huge data.

Step-04: Statistics Comparison using— (Min, Max, Mean, Median, 25th, 50th, 75th percentile)

In rare cases, there can be a collision or attack in hash validation. This may lead to data corruption. Moreover, this can be evaded by calculating statistics on each column in the data.

Step-05: SHA256 Hash Validation on entire data

For this example, I have chosen SHA256 but there are some other hashing algorithms also available as well such as MD5.

Hash Function

Hashing algorithms or functions are useful to generate a fixed-length result (the hash, or hash value) from the given input. Moreover, this hash value is an abstract of the actual data.

EX:

  • Hash function                                                            

Good boy        

  • Hash Value

2debae171bc5220f601ce6fea18f9672a5b8ad927e685ef902504736f9a8fffa

The above example explains the function hash and its related hash value under the SHA256 algorithm. A small change in the character can change the total hash value.

This kind of technique is extremely used in digital signatures, authentication, indexing Big Data in hash tables, detecting duplicates, etc. Moreover, this technique is useful to detect whether a sent file didn’t suffer any accidental or intentional data corruption.

Verifying a Hash

In Big Data, data can be compared to a hash value to ascertain its integrity. Typically, data is hashed at a precise time and the corresponding hash value is secured in some way. At a later time, the data can be divided again and compared to the secured value. In case, the hash value matches, the data has not been modified. If the value does not match, it means the data has been corrupted or faulted. For this system to work, the secured hash or division must be encrypted or kept secret from all suspicious parties.

Hash Collision

A collision or attack occurs when two different keys include the same hashCode. This may happen as two unlike objects within Java can have a similar hashCode. Moreover, a Hash Collision is an attempt to determine two input strings of a function hash that gives the same hash result. Because these functions have immense input length and a predefined output length. Thus, there is automatically going to be the possibility of two distinct inputs that give the same output hash as a result.

When we have millions of records and 3000+ columns, it becomes hard to compare the source and destiny system for data mismatch in Big Data. For doing this, we need a big memory and calculation power engines. To address this, we are using Hashing to link all the 30k+ columns together in a chain into one single hash value column. This just includes 64 characters in length. This amount is unimportant while comparing the 30k+ column length and size.

Why Data migration is important in Big Data?

Nonetheless of the actual purpose of data migration in Big Data, the goal is generally to enhance performance and competitiveness.

Anyhow, we have to get it right!

When the migration is less successful, it results in incorrect data that includes dismissals and unknowns. This may happen even when the source data is completely usable and appropriate. Furthermore, any issues that do exist in the source data can be turned up while bringing it into a new, more advanced system.

A complete Big Data migration strategy averts a sub-parallel experience that ends up developing more issues than it resolves. Apart from missing deadlines and exceptional budgets, incomplete plans can cause migration projects to fail completely. In planning and strategizing the work, teams need to give complete attention towards migrations, instead of making them secondary to another project with a big scope.

A strategic data migration plan should include the reflection of the following important factors:

Knowing the data

Before performing the migration, all source data needs a complete audit check. Unexpected issues may occur in case this step is ignored.

Cleanup:

Once the user recognizes any problem with the source data, they must be resolved in less time. Moreover, this work may require additional software tools and third-party resources due to the large scale work.

Maintenance and security:

Data undergoes humiliation after some time, making it untrustworthy. This means there must be the maintenance of data quality by placing controls.

Governance:

It becomes necessary to track and report on data quality because it allows a better understanding of data honesty. Furthermore, the processes and tools used to generate this information should be highly useful and do automated functions where possible.

Final Thought

Thus, we have discussed how a framework in Spark for Big Data migration validates data. However, poor data quality will put a burden on the working team's quality time in fixing them. I hope, this article will help to address the data quality problems after migration from source to destination using Spark. Get more insights from big data online training

Thursday, August 13, 2020

Hadoop: Data Processing and Modelling

You need a general purpose frame for your cluster. It is because all other types of systems can address a particular use case (e.g., graph processing, machine learning, etc.) and are not adequate by themselves to manage the variety of computing needs that are likely to occur in the organization. In comparison, many of the other frameworks depend on common-purpose frameworks. Also the special-purpose frameworks that don't build on general-purpose frameworks also depend on their bits and pieces.

To learn complete hadoop course visit:hadoop admin online course


Hadoop Frameworks


MapReduce, Spark, and Tez are the traditional frameworks in this category — and newer frameworks, such as Apache Flink, are emerging. Usually MapReduce is still built on clusters as of today. Certain general purpose systems, including input / output formats, rely on bits and pieces from the MapReduce stack. Nonetheless, other frameworks such as Tez or Spark can still be used without having MapReduce built on your cluster. 


MapReduce is the most advanced, but it is the slowest, arguably. Both Spark and Tez are DAG systems and don't have the overhead of running a Map often accompanied by a Reduce job; both are more versatile than MapReduce. Spark is one of the Hadoop ecosystem's most successful ventures, and has a lot of traction. It's considered by many to be MapReduce 's successor — I advise you to use Spark over MapReduce whenever possible.


Notably, MapReduce and Spark have different APIs. That means you'll have to rewrite your jobs in Spark because you're using an abstraction system, if you're switching from MapReduce to Spark. It's also worth noting that while Spark is a general-purpose engine built on it with other abstraction systems, it also offers high-level processing APIs. Spark API can therefore also be seen as an abstraction system itself in this way. The amount of time and code needed to write a Spark job is therefore typically much less than writing an equivalent MapReduce job.

At this level, Tez is better suited to building abstraction frameworks as a framework, rather than developing applications using its API.


The important thing to remember is that just because you have a general purpose processing system  built on your cluster doesn't mean you need to write all of your processing jobs using the API of that system. In general, it is recommended that abstraction frameworks (e.g., Pig, Crunch, Cascading) or SQL frameworks (e.g., Hive and Impala) be used whenever possible for writing processing jobs (there are two exceptions to this rule, as discussed in the next section).


Hadoop Abstraction and SQL frameworks:


Abstraction frameworks (e.g., Pig, Crunch, and Cascading) and SQL frameworks (e.g., Hive and Impala) minimize the amount of time spent explicitly writing jobs for general-purpose frames in Hadoop.


Abstraction frameworks: 


Pig is an abstraction system which can run on MapReduce, Spark, or Tez as seen in the diagram above. Apache Crunch offers a higher level API for performing MapReduce or Spark jobs. Cascading is another abstraction system based on the API, which can run on either MapReduce or Tez.


SQL frameworks:


Hive can run on top of MapReduce or Tez as far as SQL engines go, and work is under way to make Hive run on Spark. There are several SQL engines specially designed for faster SQL, including Impala, Presto and Apache Drill

.

Main points about the benefits of using an Hadoop abstraction or SQL framework:


You can save a lot of time by not needing to use the low-level APIs of general purpose systems to implement common processing tasks.

You may change the underlying frameworks (as required and applicable) for general purpose processing. Coding directly on the frame means that if you decided to change systems, you would have to rewrite your jobs. Using an abstraction or SQL framework which builds on abstracts that away from a generic framework.


Running a job on an abstraction or SQL system needs just a small percentage of the overhead needed for an equivalent job written directly within the framework of general purpose. Also, running a query on a special purpose processing system ( e.g., Impala, or Presto for SQL) is much faster than running an equivalent MapReduce task, as they use a completely different execution model, designed to run fast SQL queries.


Hadoop Two examples, where a general purpose system can be used:


If you have other data (i.e. metadata) information that can not be expressed and exploited in an abstraction or SQL system. For example, if you construct a logical data set in an abstraction or SQL system, let 's assume that your data set is partitioned or ordered in a specific way that you can not express. Using such partitioning / sorting metadata in your job can also speed up processing. In such a scenario it makes sense to program directly inside a general-purpose processing framework's low-level API. In such situations, the time savings of running a job over and over again more than the extra time for growth pays off.


If a general purpose design is better suited to your use case. Generally there is a small percentage of use cases where the analysis is very complex and can not easily be represented in a DSL such as SQL or Pig Latin. Crunch and Cascading should be considered in these situations, but sometimes you can only need to program directly using a general purpose processing system.

If you have chosen to use an abstraction or SQL framework, which specific framework you typically use depends on the in-house knowledge and experience

.

Chart, machine learning, and real-time frameworks/streaming


Generally there is no need to ask users to follow graphs, machine learning and real-time / streaming systems. If a particular case of use is important to you, you will probably need to use a system that will solve the case.


Hadoop Frames in maps


The popular graph processing frameworks include Giraph, GraphX, and GraphLab.

Apache Giraph is a library running on MapReduce.

GraphX is a graph processing library running on Spark.

Graph Lab was a stand-alone, special purpose graph processing system now capable of handling tabular data as well.


Hadoop Frameworks for machine learning


Mahout, MLlib, Oryx, and H2O are widely used as frameworks for machine learning.

Mahout is a library on top of MapReduce, though plans are being made to get Mahout running on Spark.

MLlib is Spark's machine-learning library.

Oryx and H2O are machine learning engines which are stand-alone, special purpose.


Framework for real-time / streaming


Spark Streaming and Storm + Trident are widely used mechanisms for quasi-real-time data processing.

Spark Streaming is a micro-batch streaming research library which is built on top of Spark.

Apache Storm is a special purpose, distributed, real-time computing engine with Trident being used on top of that as an abstraction engine.


Hadoop Partitions:


Partition means dividing a table into coarse grained parts based on a column value like 'information.' It makes it easier to do queries on data slices


Hive Data Samples


What is Partition 's function, then? 


Determines how data is stored by the partition keys. Each single value of the Partition key here determines a table partition. For simplicity the Partitions are numbered after the dates. This is close to HDFS's 'Block Splitting.'


Buckets:


Buckets offer the data extra structure which can be used for efficient queries. A combination of two tables bucketed on the same columns can be enforced as a Map-Side Combination, including the join column. Bucketing by the used ID means we can test a user-based query easily by running it on a randomized sample of the total user collection.


Conclusion


The Hadoop ecosystem has evolved to the point where using MapReduce isn't the only way to test Hadoop data anymore. With the variety of options available now, selecting which system to use to process the Hadoop data can be difficult. You can learn more through hadoop admin online training