Thursday, July 2, 2020

Migrating HDFS Data to Google Cloud Storage in Big Data

While Every other Enterprise is shifting to the cloud, one of the key problems they face is to migrate their existing data from their Hadoop clusters.
There are many ways to migrate data from HDFS to different cloud storage like Amazon S3 on AWSADLS, and WASB on Azure, and GCS on Google Cloud.
In this article, I will discuss different approaches to migrate data to Google Cloud.
There are 3 ways to move data from HDFS cluster to GCS:
  1. Using DataProc Cluster
  2. Using GSutil
  3. Using Cloud Storage connector
In the First Approach, We need to spin DataProc Cluster and then we can move all the data between DataProc and our Hadoop clusters using DistCp.
In the Second Approach, We need to first copy all the data to our local node storage and then move the data to GCS using gsutil command. This can be a problem when we want to move a large amount of data.
I will share the third approach in detail, Migrating data from HDFS to GCS using the Cloud Storage connector.

To learn big data visit:big data and hadoop course

Google Cloud Storage Connector¹

The Cloud Storage Connector is an open-source Java client library that runs in Hadoop JVMs (like data nodes, mappers, reducers, Spark executors, and more) and allows your workloads to access Cloud Storage. The connector lets your big data open-source software [such as Hadoop and Spark jobs, or the Hadoop Compatible File System (HCFS) CLI] read/write data directly to Cloud Storage.¹

Cloud Storage Connector Architecture¹


Cloud Storage Connector is an open-source Apache 2.0 implementation of an HCFS interface for Cloud Storage. Architecturally, it is composed of four major components:
  • gcs — implementation of the Hadoop Distributed File System and input/output channels
  • util-hadoop — common (authentication, authorization) Hadoop-related functionality shared with other Hadoop connectors
  • gcsio — the high-level abstraction of Cloud Storage JSON API¹
  • util — utility functions (error handling, HTTP transport configuration, etc.) used by gcs and gcsio components

Configuring Access to Google Cloud Storage

  1. Download Cloud Storage Connector from Google’s Official Site.
  2. Copy this jar file to your $HADOOP_COMMON_LIB_JARS_DIR directory.
  3. Create a service account and download the p12 service key for authentication with GCS API.
  4. Copy this p12 key to your node.
  5. Add the following entries to your core-site.xml file in your Hadoop config directory.
<property>
<name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
<property>
<name>fs.gs.project.id</name>
<value>your_project_id</value>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value>your_bucket_name</value>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>true</value>
</property><property>
<name>fs.gs.auth.service.account.email</name>
<value>user@your_gcp_service_acct.iam.gserviceaccount.com</value>
</property>
<property>
<name>fs.gs.auth.service.account.keyfile</name><value>p12_key_location</value>
</property>

6. fs.gs.system.bucket will be used for MapRed and storing temp files.
7. Now you should be able to access GCS from your Hadoop shell.
hadoop fs -ls gs://<Bucket_You_Want_To_List>/dir/
8. To move data you can simply run copy command from your HDFS shell.
hadoop fs -cp hdfs://<HOST_NAME>:<PORT>/Data_You_Want_To_Copy gs://<Bucket_name>
9. You can speed up the copy process by using DistCp, sync the lib/gcs-connector.jar, and conf/core-site.xml to all your Hadoop nodes.
hadoop distcp hdfs://<NAME_NODE>/Data_You_Want_To_Copy gs://<Bucket_name>

10. You can also run all you MapRed and Spark Jobs on GCS.

To more information visit onlineITguru's big data online course Blog

No comments:

Post a Comment