Monday, August 31, 2020

Pig Latin data model and data types

 Pig Latin is the language used by Apache Pig to analyze data in Hadoop. In this chapter we will discuss the basics of Pig Latin such as statements from Pig Latin, data types, general and relational operators and UDF's from Pig Latin.

Pig Latin Data Model 

As discussed in previous chapters, Pig's data model is nested to the full. The Pig Latin data model's outermost structure is a Reference. And it's a case where −

  • A bag is a Tuple set.

  • A tuple is a series of directed areas.

  • A field is just one piece of data.

Pig Latin Statements

Statements are the basic constructs when processing data using Pig Latin.

  • Those statements function on relationships. They include expressions and schemes.

    • statement terminates with a semicolon (;).

  • You will perform various operations via statements, using operators provided by Pig Latin.

  • With the exception of LOAD and STORE, Pig Latin statements take on a relationship as input while performing all other operations, and generate another relationship as output.

  • As soon as you enter a Load statement in the Grunt container, it will perform its semantic testing. You need to use the Dump operator to access the contents of the schema. The MapReduce job to load the data into the file system will only be carried out after the dump operation has been completed.More tutorials visit:big data online course.

Example of Pig latin

A statement Pig Latin, which loads data to Apache Pig, is given below.

Grunt > LOAD 'student data.txt' USING PigStorage(')as

(ID: int, first name: chararray, last name: chararray, city: chararray);

Pig Data Types

Pig's data types make up the data model for the way Pig thinks about the data structure it processes. With Pig, when the data is loaded the data model is specified. Any data that you load from the disk into Pig will have a specific schema and structure. Pig needs to understand the structure so the data can automatically go through a mapping when you do the loading.

  • Fortunately, the Pig data model is rich enough for you to manage most of what's thrown in its way like table-like structures and nested hierarchical data structures. However, Pig data types can be divided into two groups in general terms: scalar forms and complex types. Scalar types contain a single value, while complex types include other values, such as the values of Tuple, Container, and Map below.

In its data model, Pig Latin has those four types

  • Atom: 

An atom is any single attribute, like, for example, a string or a number — 'Diego' The atomic values of Pig are scalar forms that appear, for example, in most programming languages — int, long, float, double, char array and byte array.

  • Tuple: 

A tuple is a record generated by a series of fields. For example, each field can be of any form — 'Diego,' 'Gomez,' or 6). Just think of a tuple in a table as a row.

  • Bag: 

A pocket is a set of tuples, which are not special. The bag's schema is flexible — each tuple in the set can contain an arbitrary number of fields, and can be of any sort.

  • Map: 

A map is a set of pairs with main values. The value can store any type, and the key needs to be unique. A char array must be the key of a map, and the value may be of any kind.

pig latin.jpg

The figure also offers some fine examples of data types Tuple, Container, and Map.

Also the value of all these forms may be null. The null semance is similar to those used in SQL. In Pig 's notion of null means the value is unknown. In cases where values are unreadable or unrecognizable, nulls can show up in the data — for example, if you were using a wrong data form in the LOAD statement.

Null

Null may be used as a placeholder before data has been added, or as an optional value for a sector.

Pig Latin has a simple syntax of strong semics that you will be using to perform two primary operations.

  • Data access

  • Transform

Accessing data in a Hadoop sense means allowing developers to load, store, and stream data, while transforming data means profiting from Pig's ability to group, enter, merge, break, filter, and sort data. The next paragraph provides an overview of the relevant operators for each operation.

Latin Pig Operators

Explanation of service by operator

LOAD / STORE

Data Access Read and write data to file system

DUMP 

Write to default output (stdout)

STREAM 

All records are sent via external binary

FOREACH Transformations 

Use expression for each record and output, one or more

Documents FILTER 

Submit predicting and deleting records that do not reach

Contract GROUP / COGROUP

Files records from one or more with the same key

Returns

Enter Join two or more records according to a condition

CROSS

Cartesian product of two inputs or more

ORDER 

Sort key baseline records

DISTINCT

Duplicate documents removed

UNION 

Merge 2 data sets

SPLIT 

Divide the data into two or more predicate dependent bags

LIMIT 

subdivides record number

Pig also offers a few operators to support debug and troubleshoot, as shown.

  • To debug and troubleshoot operators

  • Definition of operator's activity

  • Debug 

  • Explain Give back the scheme of a partnership.

DUMP

Dump the contents of a screen connection.

EXPLAIN 

View plans for execution with MapReduce.

Part of Hadoop's paradigm change is that you apply your scheme at Read rather than Load. When you load data into your database system, you must load it into a well-defined set of tables according to the old way of doing things — the RDBMS way. Hadoop lets you store up front all the raw data and apply the Read scheme.

With Pig, you do this with the aid of the LOAD operator, during data loading.

  • The optional Usage statement defines how to map the data structure inside the file to the Pig data model — in this case, the data structure of PigStorage), (which parses delimited text files. (This section of the USING declaration is often referred to as LOAD Func and functions like a custom deserializer)

  • The optional AS clause sets out a schema for the mapped data. If you are not using an AS clause, you are basically telling the default LOAD Func to expect a plain text file delimited by tab. Without a schema, the fields must be referenced by location, since no name is specified.

  • Using AS clauses ensures you have a read-time schema in place for your text files, allowing users to get started quickly and offering agile schema modeling and versatility so you can add more data to your analytics.

Load operator

The LOAD operator works on the lazy assessment principle, often referred to as call-by-need. Now lazy doesn't sound particularly praiseworthy, but all it does is postpone an expression evaluation until you really need it.

In the Pig example context, that means that no data is transferred after the LOAD statement is executed — nothing gets shunted around — until a statement is found to write data. You can have a Pig script that is a page loaded with complex transformations, but until the DUMP or STORE statement is found nothing will be executed.

Conclusion

I hope you reach to a conclusion about Pig latin data types. You can learn more through big data hadoop training

Saturday, August 29, 2020

TOP 6 BIG DATA TRENDS IN THE NEAR FUTURE

 Big Data is a buzz word we all are familiar with now. But behind the buzz, there have been rapid developments which has changed business models and brought big data to the strategic foreground. 2016 has been a pretty eventful year for BigData and the future indicates promising. Let’s take a look at the top trends that will follow in the upcoming year:


1) CUSTOMER DIGITAL ASSISTANTS

One of the surprising trends we saw this year was the growing interests in Digital Assistants. The logic had been simple: If we could gather and process data to generate meaningful results, why do we need humans to convey them to customers? The most devoted users are perhaps the gamers, who have fully accepted this technology in the likes of XBox One and Sony PS4. With advanced NLP and audio-recognition, mobile digital assistants like Cortana, Siri and Google Now are almost the must-haves today, and all signs indicate that digital assistants will play an even more important role in the upcoming year.

To learn big data hadoop course visit:big data online training.

2) SIMPLER DATA ANALYSIS

Like many past years, data saw an unprecedented growth in volume and veracity. With this rate, the current data analysis techniques would soon be obsolete. However, the upcoming trend in 2017 might focus on simplifying the data analysis process, to an extent where even non-coders could easily analyze huge datasets. Giants like Microsoft and Salesforce are working upon it, while complementary tools to SQL like Spark will continue to make storage and access of data easier.

3) MACHINE LEARNING IS THE FUTURE

Not far ago, machine learning was considered purely a research field. For the benefit of all, this perception soon changed and today, machine learning has dedicated departments in numerous companies. For business purposes, the idea of machine learning is to serve as an extension to predictive analytics, thereby minimizing the work and maximizing the profits. This trend will continue to be one of the top business strategies in the future.

4) DATA-AS-A-SERVICE

Although it took a long, long time; but today, companies realize the importance of their data. This, in turn, is giving rise to an entire new business model of data-as-a-service (DaaS). With IBm's acquisition of The Weather Channel, more tech giants might realize that their data can, in fact, be converted into a profitable service.

5) THE TRANSITION OF BIG DATA to “ACTIONABLE DATA”

Big data will continue to face its existing challenges- the most prominent being the required manpower to handle the ever-increasing volume. Privacy concerns will also continue to haunt the general perception regarding the increased use of Big Data. Amidst all that is the new question: Why to worry about big data when most companies only use a fraction of it anyway? The answer to this question is giving rise to a new trend of "actionable data", data that is relevant to the business. It is entirely possible that big data may be replaced by actionable data in upcoming years.


6) INTERNET-OF-THINGS


One of the most revolutionary digital concepts of this century, IoT still fascinates masses, even if its application continues to face hurdles. But the rise and success of IoT is inevitable. With the rapid rate with which devices are becoming integral parts of our lives, IoT can provide us with unmeasured potential. While the initial cost of converting every device as a node in a vast, digital world is pretty high, it is estimated that IoT will grow by 30% in next 5 years, creating an economic value of $4-11 trillion by 2025.


To more information about new courses visit onlineitguru's,big data and hadoop online training


Friday, August 28, 2020

Spark Algorithm Tutorial in Big data hadoop

 Welcome to the  lesson ‘Spark Algorithm’ of Big Data Hadoop Tutorial which is a part of 'big data and hadoop online training' offered by  OnlineItGuru.

In this lesson, you will learn about the kinds of processing and analysis that Spark supports. You will also learn about Spark machine learning and its applications, and how GraphX and MLlib work with Spark.

Let us look at the adjectives of this lesson in the next section.

Objectives

After completing this lesson, you will be able to:

  • Describe the kinds of processing and analysis that Spark supports

  • Describe the applications of Spark machine learning

  • Explain how GraphX and MLlib work with Spark.

When does Spark work best?

Spark is an open-source cluster-computing framework and provides up to 100 times faster performance for a few applications with in-memory primitives as compared to the disk-based, two-stage MapReduce of Hadoop.

Fast performance makes it suitable for Apache Spark machine learning algorithms, as it allows programs to load data into the memory of a cluster and query the data constantly.

Spark works best in the following use cases:

  • There is a large amount of data that requires distributed storage

  • There is intensive computation that requires distributed computing

  • There are instances where iterative algorithms are present that requires in-memory processing and pipelining.

Check out the big data hadoop course.

Uses of Apache Spark

Here are some examples where Spark is beneficial:

  • Spark helps you answer the question for risk analysis, “How likely is it that this borrower will pay back a loan?”

  • Spark can answer questions on recommendation such as, “Which products will this customer enjoy?”

  • It can help predict events by answering questions such as, “How can we prevent service outages instead of simply reacting to them?”

  • Spark helps to classify by answering the question, “How can we tell which mail is spam and which is legitimate?”

With these examples in mind, let’s delve more into the world of Spark. In the following sections, you will learn about how an iterative algorithm runs in Spark.

Spark: Iterative algorithm

Spark is designed for systems that are required to implement an iterative algorithm.

Let’s look at PageRank Algorithm, which is an iterative algorithm.

PageRank Algorithm: FEATURES

PageRank is an example of an iterative algorithm. It is one of the methods used to determine the relevance or importance of a webpage. It gives web pages a ranking score based on links from other pages.

A higher rank is given when the links are from many pages, and when the links are from high ranked pages. The algorithm outputs a probability distribution used to represent the likelihood that a person clicking on the links will arrive at a particular page.

PageRank is important because it is a classic example of big data analysis, like WordCount. As there is a lot of data, an algorithm is required that is distributable and scalable.

PageRank is iterative, which means the more the iteration, the better is the answer.

Let’s look at how the PageRank algorithm works.

PageRank Algorithm: WORKING

Start each page with a rank of 1.0. On each iteration, a page contributes to its neighbors its own rank divided by the number of its neighbors.

Here is the function:

contribp = rankp / neighborsp

You can then set each page’s new rank based on the sum of its neighbors contribution. The function is given here:

new-rank = Σcontribs * .85 + .15

Each iteration incrementally improves the page ranking as shown in the below diagram.

working-of-pagerank-algorithm

In the next section, you will learn about graph-parallel system.

Graph Parallel System

Today, big graphs exist in various important applications, be it web, advertising, or social networks. Examples of a few of such graphs are presented on the sections.

web-graphs-in-graph-parallel-system

user-item-graphs-in-graph-parallel-system

These graphs allow the users to perform tasks such as target advertising, identify communities, and decipher the meaning of documents. This is possible by modeling the relations between products, users, and ideas.

As the size and significance of graph data is growing, various new large-scale distributed graph-parallel frameworks, such as GraphLab, Giraph, and PowerGraph, have been developed. With each framework, a new programming abstraction is available. These abstractions allow the users to explain graph algorithms in a compact manner.

They also explain the related runtime engine that can execute these algorithms efficiently on distributed and multicore systems. Additionally, these frameworks abstract the issues of the large-scale distributed system design.

Therefore, they are capable of simplifying the design, application, and implementation of the new sophisticated graph algorithms to large-scale real-world graph problems.

Let’s look at a few limitations of the Graph-parallel system.

Limitations of Graph-Parallel System

Firstly, although the current graph-parallel system frameworks have various common properties, each of them presents different graph computations. These computations are custom-made for a specific graph applications and algorithms family or the original domain.

Secondly, all these frameworks depend on different runtime. Therefore, it is tricky to create these programming abstractions.

And finally, these frameworks cannot resolve the data Extract, Transform, Load or ETL issues and issues related to the process of deciphering and applying the computation results.

The new graph-parallel system frameworks, however, have built-in support available for interactive graph computation.

Next, you will learn about Spark GraphX.

What is Spark GraphX?

Spark GraphX is a graph computation system running in the framework of the data parallel system which focuses on distributing the data across different nodes that operate on the data in parallel.

  • It addresses the limitations posed by the graph parallel system.

  • Spark GraphX is more of a real-time processing framework for the data that can be represented in a graph form.

  • Spark GraphX extends the Resilient Distributed Dataset or RDD abstraction and hence, introduces a new feature called Resilient Distributed Graph or RDG.

  • Spark GraphX simplifies the graph ETL and analysis process substantially by providing new operations for viewing, filtering, and transforming graphs.

Features of Spark GraphX

Spark GraphX has many features like:

  • Spark GraphX combines the benefits of graph parallel and data parallel systems as it efficiently expresses graph computations within the framework of the data parallel system.

  • Spark GraphX distributes graphs efficiently as tabular data structures by leveraging new ideas in their representations.

  • It uses in-memory computation and fault tolerance by leveraging the improvements of the data flow systems.

  • Spark GraphX also simplifies the graph construction and transformation process by providing powerful new operations.

With the use of these features, you can see that Spark is well suited for graph-parallel algorithm.

In the next topic, let's look at machine learning, which explores the study and construction of algorithms that can learn from and make predictions on data, it's applications, and standard machine learning clustering algorithms like k means algorithm.

What is Machine Learning?

It is a subfield of artificial intelligence that has empowered various smart applications. It deals with the construction and study of systems that can learn from data.

For instance, machine learning can be used in medical diagnosis to answer a question such as "Is this cancer?"

It can learn from data and help diagnose a patient as a sufferer or not. Another example is fraud detection where machine learning can learn from data and provide an answer to a question such as "Is this credit card transaction fraudulent?"

Therefore, the objective of machine learning is to let a computer predict something. An obvious scenario is to predict an event in future. Apart from this, it can also predict unknown things or events.

This means that something that has not been programmed or inputted into it. In other words, computers act without being explicitly programmed. Machine learning can be seen as building blocks to make computers behave more intelligently.

History of Machine Learning

In 1959, Arthur Samuel defined machine learning as, "A field of study that gives computers the ability to learn without being explicitly programmed." Later, in 1997 Tom Mitchell gave another definition that proved more useful for engineering purposes.

"A machine learning computer program is said to learn from experience E concerning some class of tasks T and performance measure P. If its performance at tasks in T as measured by P improves with experience E."

As data plays a big part in machine learning, it's important to understand and use the right terminology when talking about data. This will help you to understand machine learning algorithms in general.

Common Terminologies in Machine Learning

The common terminologies in Machine Learning are :

  • Vector Feature

  • Samples

  • Feature Space

  • Labeled Data

Let's begin with vector feature.

Vector Feature

It is an n-dimensional vector of numerical features that represent some object. It is a typical setting which is provided by objects or data points collection.

Each item in this collection is described by some features such as categorical and continuous features.

Samples

Samples are the items to process. Examples include a row in a database, a picture, or a document.

Feature Space

Feature space refers to the collection of features that are used to characterize your data.

In other words, feature space refers to the n dimensions where your variables live.

If a feature vector is a vector length L, each data point can be considered being mapped to a D dimensional vector space called the feature space.

Labeled Data

Labeled data is the data with known classification results. Once a labeled data set is obtained, you can apply machine learning models to the data so that the new unlabeled data can be presented to the model.

A likely label can be guessed or predicted for that piece of unlabeled data.

Features Space: Example

Here is an example of features of two apples: one is red, and the other is green.

In machine learning, an object is used. In this example, the object is the Apple. The features of the object, which is Apple, include color, type, and shape.

In the first instance, the color is red, the type is fruit, and the shape is round.

In the second instance, there is a change in the feature described in the color of the Apple, which is now green.

Applications of Machine Learning

As machine learning is related to data mining, it is a way to fine-tune a system with tunable parameters. It can also identify patterns that humans tend to overlook or are unable to find quickly among large amounts of data.

As machine learning is transforming a wide variety of industries, it is helping companies make discoveries and identify and remediate issues faster. Here are some interesting real-world applications of machine learning.

Speech recognition

Machine learning has improved speech recognition systems. Machine learning uses automatic speech recognition or ASR as a large-scale realistic application to rigorously test the effectiveness of a given technique.

Effective web search

Machine learning techniques such as naive bayes extract the categories or a broad range of problems from the user entered a query to enhance the results quality. This is based on query logs to train the model.

Recommendation systems

According to Wikipedia, recommendation systems are a subclass of information filtering system that seeks to predict the rating or preference that a user would give to an item.

Recommendation systems have been using machine learning algorithms to provide users with product or service recommendations.

Here are some more applications of machine learning.

Computer vision: Computer vision, which is an extension of AI and cognitive neuroscience, is the field of building computer algorithms to automatically understand the contents of images.

By collecting a training data set of images and hand labeling each image appropriately, you can use ML algorithm to work out which patterns of pixels are relevant to your recognition tasks and which are nuisance factors.

Information retrieval: Information retrieval systems provide access to millions of documents from which users can recover any one document by providing an appropriate description.

Algorithms that mine documents are based on machine learning. These learning algorithms use examples, attributes, and values which information retrieval systems can supply in abundance.

Fraud detection: Machine learning aids financial leaders to understand their customer transactions better and to rapidly detect fraud.

It helps in extracting and analyzing a variety of different data sources to identify anomalies in real time to stop fraudulent activities as they happen.


Thursday, August 27, 2020

what are the steps for MapReduce in big data?

 

What is MapReduce?

A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.To more info visit:big data online training

Steps in Map Reduce

  • The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
  • Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
  • An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of values for unique keys, and Final output <key, value> will be stored/displayed.

34157723e69a7360b072c953c674642e.png

 

 

79b5eeeff4f9ab0b679656a57ceffc0a.png

Sort and Shuffle

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.

Usage of MapReduce

  • It can be used in various application like document clustering, distributed sorting, and web link-graph reversal.
  • It can be used for distributed pattern-based searching.
  • We can also use MapReduce in machine learning.
  • It was used by Google to regenerate Google's index of the World Wide Web.
  • It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile environment.

Prerequisite

Before learning MapReduce, you must have the basic knowledge of Big Data.

Audience

Our MapReduce tutorial is designed to help beginners and professionals.

Problem

We assure that you will not find any problem in this MapReduce tutorial. But if there is any mistake, please post the problem in contact form.

Data Flow In MapReduce

MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and distributed form, the data has to flow from various phases.

e1f7b6abc5d8b59185bb4904de24a811.png

Data Flow In MapReduce

Phases of MapReduce data flow

Input reader

The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64 MB to 128 MB). Each data block is associated with a Map function.

Once input reads the data, it generates the corresponding key-value pairs. The input files reside in HDFS.

Map function

The map function process the upcoming key-value pairs and generated the corresponding output key-value pairs. The map input and output type may be different from each other.

Partition function

The partition function assigns the output of each Map function to the appropriate reducer. The available key and value provide this function. It returns the index of reducers.

Shuffling and Sorting

The data are shuffled between/within nodes so that it moves out from the map and get ready to process for reduce function. Sometimes, the shuffling of data can take much computation time.

The sorting operation is performed on input data for Reduce function. Here, the data is compared using comparison function and arranged in a sorted form.

Reduce function

The Reduce function is assigned to each unique key. These keys are already arranged in sorted order. The values associated with the keys can iterate the Reduce and generates the corresponding output.

Output writer

Once the data flow from all the above phases, Output writer executes. The role of Output writer is to write the Reduce output to the stable storage.

MapReduce API

In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods used in MapReduce programming.

MapReduce Mapper Class

In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of intermediate key-value pairs. It transforms the input records into intermediate records.

These intermediate records associated with a given output key and passed to Reducer for the final output.

 

MapReduce Word Count Example

In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair.If you are intrested please visit:big data hadoop training

Pre-requisite

  • Java Installation - Check whether the Java is installed or not using the following command.
  • java -version
  • Hadoop Installation - Check whether the Hadoop is installed or not using the following command.
  • hadoop version

MapReduce Word Count Example

In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair.

Pre-requisite

  • Java Installation - Check whether the Java is installed or not using the following command.
  • java -version
  • Hadoop Installation - Check whether the Hadoop is installed or not using the following command.
  • hadoop version

Steps to execute MapReduce word count example

  • Create a text file in your local machine and write some text into it.
  • $ nano data.txt

50288105a9c3b24bc2430d6b0c21739b.png

MapReduce Word Count Example

  • Check the text written in the data.txt file.
  • $ cat data.txt

3f7019f4bdda3400ace789de2d5e66ce.png

MapReduce Word Count Example

In this example, we find out the frequency of each word exists in this text file.

  • Create a directory in HDFS, where to kept text file.
  • $ hdfs dfs -mkdir /test
  • Upload the data.txt file on HDFS in the specific directory.
  • $ hdfs dfs -put /home/codegyani/data.txt /test

45843df2877b782d3d805b61bfac8f09.png

MapReduce Word Count Example

  • Write the MapReduce program using eclipse.

File: WC_Mapper.java

 

 

  1. package com.javatpoint;  
  2.   
  3. import java.io.IOException;    
  4. import java.util.StringTokenizer;    
  5. import org.apache.hadoop.io.IntWritable;    
  6. import org.apache.hadoop.io.LongWritable;    
  7. import org.apache.hadoop.io.Text;    
  8. import org.apache.hadoop.mapred.MapReduceBase;    
  9. import org.apache.hadoop.mapred.Mapper;    
  10. import org.apache.hadoop.mapred.OutputCollector;    
  11. import org.apache.hadoop.mapred.Reporter;    
  12. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{    
  13.     private final static IntWritable one = new IntWritable(1);    
  14.     private Text word = new Text();    
  15.     public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,     
  16.            Reporter reporter) throws IOException{    
  17.         String line = value.toString();    
  18.         StringTokenizer  tokenizer = new StringTokenizer(line);    
  19.         while (tokenizer.hasMoreTokens()){    
  20.             word.set(tokenizer.nextToken());    
  21.             output.collect(word, one);    
  22.         }    
  23.     }    
  24.     
  25. }  

File: WC_Reducer.java

 

 

  1. package com.javatpoint;  
  2.     import java.io.IOException;    
  3.     import java.util.Iterator;    
  4.     import org.apache.hadoop.io.IntWritable;    
  5.     import org.apache.hadoop.io.Text;    
  6.     import org.apache.hadoop.mapred.MapReduceBase;    
  7.     import org.apache.hadoop.mapred.OutputCollector;    
  8.     import org.apache.hadoop.mapred.Reducer;    
  9.     import org.apache.hadoop.mapred.Reporter;    
  10.         
  11.     public class WC_Reducer  extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {    
  12.     public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output,    
  13.      Reporter reporter) throws IOException {    
  14.     int sum=0;    
  15.     while (values.hasNext()) {    
  16.     sum+=values.next().get();    
  17.     }    
  18.     output.collect(key,new IntWritable(sum));    
  19.     }    
  20.     }  

File: WC_Runner.java

 

 

  1. package com.javatpoint;  
  2.   
  3.     import java.io.IOException;    
  4.     import org.apache.hadoop.fs.Path;    
  5.     import org.apache.hadoop.io.IntWritable;    
  6.     import org.apache.hadoop.io.Text;    
  7.     import org.apache.hadoop.mapred.FileInputFormat;    
  8.     import org.apache.hadoop.mapred.FileOutputFormat;    
  9.     import org.apache.hadoop.mapred.JobClient;    
  10.     import org.apache.hadoop.mapred.JobConf;    
  11.     import org.apache.hadoop.mapred.TextInputFormat;    
  12.     import org.apache.hadoop.mapred.TextOutputFormat;    
  13.     public class WC_Runner {    
  14.         public static void main(String[] args) throws IOException{    
  15.             JobConf conf = new JobConf(WC_Runner.class);    
  16.             conf.setJobName("WordCount");    
  17.             conf.setOutputKeyClass(Text.class);    
  18.             conf.setOutputValueClass(IntWritable.class);            
  19.             conf.setMapperClass(WC_Mapper.class);    
  20.             conf.setCombinerClass(WC_Reducer.class);    
  21.             conf.setReducerClass(WC_Reducer.class);         
  22.             conf.setInputFormat(TextInputFormat.class);    
  23.             conf.setOutputFormat(TextOutputFormat.class);           
  24.             FileInputFormat.setInputPaths(conf,new Path(args[0]));    
  25.             FileOutputFormat.setOutputPath(conf,new Path(args[1]));     
  26.             JobClient.runJob(conf);    
  27.         }    
  28.     }    

Download the source code.

  • Create the jar file of this program and name it countworddemo.jar.
  • Run the jar file
  • hadoop jar /home/codegyani/wordcountdemo.jar com.javatpoint.WC_Runner /test/data.txt /r_output
  • The output is stored in /r_output/part-00000

89d98384a905d3780924fbd62f5ce7a1.png

MapReduce Word Count Example

  • Now execute the command to see the output.
  • hdfs dfs -cat /r_output/part-00000

e8d2994b6490b0a9b7616f81fd4c2485.png

MapReduce Word Count Example

To more info visit:big data hadoop course