Read and Write Operations for Hadoop HDFS Data
The Hadoop storage layer is the HDFS-Hadoop Distributed File System. It is the planet's most reliable storage system. Name node is the master daemon running on the master node, Data node is the slave daemon running on the slave node, and HDFS operates in master-slave fashion.
You can install Hadoop before you start to use HDFS. I advise you to—
Here we are going to cover the read and write operations of HDFS results. Let's first talk about the HDFS file writing process followed by the HDFS file reading operation—
Action with Hadoop HDFS Data Write
A client needs to communicate with the master, i.e. namenode, to write a file in HDFS (master). Name node now provides the address of the data nodes (slaves) that the client begins writing the data on. The client writes data directly to the data nodes, and now the data node builds the pipeline for data writing.
The first data node copies the block to another data node, which copies it to the third data node internally. After it generates the replicas of bricks, the acknowledgment is sent back.
a. Pipeline Hadoop Workflow HDFS Data Write
Let's now grasp the full HDFS data writing pipeline end-to-end. The HDFS client sends a Distributed File System APIs development request.
(ii) Distributed File System makes a name node RPC call to create a new file in the namespace of the file system.
To ensure that the file does not already exist and that the client has the permission to create the file, the name node performs several tests. Then only the name node allows a record of the new file when these checks pass; otherwise, file creation fails and an IOException is thrown at the client. Read in-depth about Hadoop HDFS Architecture, too.
(iii) The Distributed File System returns an FSData Output Stream to start writing data to the device. DFS Output Stream divides it into packets, which it writes to an internal queue, called a data queue, while the client writes data.
iv) A Hadoop pipeline is made up of a list of data nodes, and here we can presume that the degree of replication is three, so there are three nodes in the pipeline. Similarly, a packet is stored and forwarded to the third (and last) data node in the pipeline by the second data node. Read in-depth about HDFS Data Blocks.
V) A packet is only deleted from the ack queue when the data nodes in the pipeline have been recognized. Once necessary replicas are made, the Data node sends the recognition (3 by default). Similarly, all the blocks are stored and replicated on the various data nodes and copied in parallel with the data blocks.
Vi) It calls close() on the stream when the client has finished writing data.
Vii) This action flushes all remaining packets to the pipeline of the data node and waits for acknowledgments to signal that the file is complete before contacting the name node.
From the following diagram, we can summarise the HDFS data writing operation.
b. How to write a Hadoop HDFS-Java Program file
Follow this HDFS command part 1 to communicate with HDFS and perform various operations.
[php]FileSystem fileSystem = FileSystem.get(conf); = FileSystem.get(conf)
/ Check if a file already exists
Path = New Path('/path/to/file.ext'););
If (path.exists(fileSystem)) {
System.out.println("File "+ dest +" exists already");
Returning;
}
/ Generate and write data to a new file.
OutputFSDataStream= fileSystem.create(path);
InputStream to = new BufferedInputStream(new FileInputStream()
File(source))););;
Byte[] b = new byte[1024];; new byte[1024]
= 0; int numBytes;
Whereas ((numBytes = in.read(b)) > 0) { = in.read(b))
Out.write(b, 0, and numBytes);
}
/ Close all descriptors for files
Within.close();
Out.close();;-)
Close(); fileSystem.close();
[/php] PHP
Operation Read of Hadoop HDFS Data
A client needs to communicate with the name node (master) to read a file from HDFS because the name node is the core of the Hadoop cluster (it stores all the metadata i.e. data about the data). Now if the client has enough privileges, the name node checks for the necessary privileges, then the name node provides the address of the slaves where a file is stored. In order to read the data blocks, the client can now communicate directly with the respective data nodes.
HDFS Workflow Read File in Hadoop
Let's now understand the complete operation of reading HDFS data from end to end. The data read process in HDFS distributes, the client reads the data from data nodes in parallel, the data read cycle explained step by step.
The client opens the file it wants to read by calling open() on the File System object, which is the Distributed File System instance for HDFS. See HDFS Data Read Process
(ii) Distributed File System uses RPC to call the name node to decide the block positions for the first few blocks in the file.
Iii) Distributed File System returns to the client an FSDataInputStream from which it can read data. Therefore, FSDataInputStream wraps the DFSInputStream that handles the I/O of the data node and name node. On the stream, the client calls read(). The DFSInputStream that has stored the addresses of the data node then connects to the first block in the file with the nearest data node.
iv) Data is streamed back to the client from the data node, which enables the client to repeatedly call read() on the stream. When the block ends, the connection to the data node is closed by DFSInputStream and then the best data node for the next block is found. Learn about the HDFS data writing operation as well.
V) If an error is encountered by DFSInputStream while interacting with a data node, the next closest one will be tried for that block. Data nodes that have failed will also be remembered so that they do not needlessly retry them for later blocks. Checksums for the data transferred to it from the data node are also checked by the DFSInputStream. If a corrupt block is detected, the name node will report this before the DFSInputStream tries to read a replica of the block from another data node.vi) When the client has finished reading the data, the stream will call close().
From the following diagram, we can summarise the HDFS data reading operation.
b. How to Read an HDFS-Java Program File in Hadoop
The following is a sample code to read a file from HDFS (Follow this HDFS command component to perform HDFS read and write operations-3):
[php]FileSystem fileSystem = FileSystem.get(conf); = FileSystem.get(conf)
Path = New Path('/path/to/file.ext'););
If, for example, (!fileSystem.exists(path)) {
System.out.println('File is not present');;
Returning;
}
In = fileSystem.open(path); FSDataInputStream
= 0; int numBytes;
Whereas ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));/ code that is used to control the read data
}
Within.close();
Out.close();;-)
Close();[/php]; fileSystem.close()
HDFS Fault Tolerance in Hadoop
The part of the pipeline running a data node process fails. Hadoop has an advanced feature to manage this situation (HDFS is fault-tolerant). If a data node fails when data is written to it the following steps are taken, which are clear to the customer writing the details.
A new identity is given to the current block on the successful data node, which is transmitted to the name node so that if the failed data node recovery is later, the partial block on the failed data node is removed. Read High Accessibility in HDFS Name node also.
The failed data node is removed from the pipeline, and the remaining data from the block is written to the two successful data nodes in the pipeline.
Conclusion
In conclusion, this design enables HDFS to increase customer numbers. This is because all the cluster data nodes are spread by data traffic. It also offers high availability, rack recognition, coding for erasure, etc as a consequence, it empowers Hadoop.
If you like this post or have any queries about reading and writing operations for HDFS info, please leave a comment. We'll be able to get them solved. You can learn more through Big Data and Hadoop Training