Friday, July 31, 2020

Explain Hive concept and Data storage in Hadoop

This article will help you explain what Hive partitioning is, what partitioning requires it to be, how it improves performance. Partitioning is the technique of optimization at Hive which significantly improves efficiency. Apache Hive is the top-of-Hadoop data warehouse that enables ad-hoc analysis over structured and semi-structured data. Let's go into depth about partitioning Apache Hive. So, let's continue the Hive Partitions, Hive Partitioning, Hive Partitioning Styles, etc. article. But, first let's think about Hadoop data storage.
To more information visit:big data online course.
Data storage in Hadoop Distributed file system
 In a single Hadoop Distributed File System HIVE, you can find data storage as a resource of choice for performing queries on large datasets. This can be used especially for those needing complete table scans. HIVE features advanced partitioning. The partitioning of hive data files is very useful in reducing query times for prune data during the query. There are some cases where users need to filter the data on specific columns.
• HIVE users can use the HIVE partitioning feature to identify columns that subdivide the data you can use to organize the data.
• Work can only be carried out on a suitable subset of data using partitioning, resulting in a substantially improved performance of HIVE queries.
You'll read more about the partitioning function in the sections below. The diagram below shows data storage in a single Hadoop Distributed File System, or HDFS directory.
data-storage-in-a-single-hadoop-distributed-file-system.jpg
What is Partition in Hive?
Apache Hive renders tables structured into partitions. Partitioning is a way to divide a table into different sections based on the values of common columns such as date, town and section. Every table in the hive may mark a specific partition with one or more partition keys. It's quick to do queries on slices of the data in Hadoop using partition.
Importance of Hive Partitioning in Hadoop 
We know that the enormous amount of data that is in the range of petabytes is being stored in HDFS during the current century. So it is very hard for Hadoop users to access this massive volume of data because of this.
You may add the Hive to lower the data querying pressure. Apache Hive transforms the SQL queries into MapReduce jobs and submits them to the Hadoop cluster afterwards. When we're sending a SQL query, Hive will read the entire data collection. Therefore, running MapReduce jobs over a wide table is inefficient. Thus, building partitions in tables solves this. Apache Hive makes this job of implementing partitions very simple by generating partitions at the time of table development using its automatic partitioning scheme.
You can divide all of the table data into multiple partitions using the Partitioning process. Increasing partition corresponds to a certain value(s) of the column(s) of partitions. Inside the table record present in the Hadoop HDFS, you can hold this as a sub-record. Thus, when querying a specific table, the correct table partition is queried which contains the query value. Therefore, this reduces the time needed for the question to I / O.  Hence the pace of the output increases.
Create Partitions in Hive
Now let's understand data partitioning in Hive with an illustration. Consider a table which is called Tab1. The table contains information of the client such as I d, name, department and year of accession. Suppose we have to collect the data of all customers who entered in 2012. Then, the question will scan the entire table for the information needed. But if we partition the customer data with the year and store it in a separate register, the processing time for the application will be cut. The example below will help us learn how to partition a file and its data-The name of the file means that file1 contains a table of client dat
Tab1/clientdata / 
file1 I d, name, dept, yoj 
balajee, SC, 2009 
prashanth, HR, 2009 
narayana, SC, 2010
 Only the data of the specified partition will be queried when we are retrieving the data from the table. It's like building a partitioned table.
Build TABLE table tab1 (id INT, name STRING, dept STRING, yoj INT) 
PARTITIONED BY (year STRING); 
Types of hive partitions
Until now we have discussed how to construct Hive Partitions. Now we will implement the data partitioning types at Hive. Apache Hive includes two types of Partitioning.
• Static Partitioning 
• Dynamic Partitioning
 Let's address these types of Hive Partitioning one by one
-Hive Static Partitioning 
  • Insert Static Partitioning data files into a partition table individually.
  • Typically static partitions are favored when loading directories (big files) into Hive Tables.
  • Static Partition saves the loading time in contrast to dynamic partition.
  • You connect a partition in the table "statically" and transfer the file into a table partition.
  • The partition can be modified in a static partition.
  • Without reading the whole big file, you can get the partition column value from the filename, date, etc
  • Set the property set to hive.mapred.mode = strict. This property set to hive-site.xml. 
  • Static partition is by default in Strict Mode, if you want to use the hive Static partition. Set the property set to hive.mapred.mode = strict .This property set to hive-site.xml Static partition is by default in Strict Mode, if you want to use the hive Static partition.
  • Using where to use the cap clause in a static partition.
  • You can use the Hive Manage table or external table to perform a Static partition.
Hive Dynamic Partitioning
 • Dynamic partitioning is defined as a single insert to partition table.
• Dynamic partition typically loads data from an unpartitioned stack.
• Dynamic partitioning takes longer to load data than static partition loading.
• If large data is stored in a table then the Dynamic partition is sufficient.
• If you want to partition a number of columns but don't know how many columns, then dynamic partition is also sufficient.
• There is no dynamic partition needed where the use of limit clause is necessary.
• On the Dynamic partition, you can not execute a change.
• Dynamic partitioning can be done on external hive table and controlled table.
• If you want to use the hive Dynamic partition then the mode will be in non-strict mode.
Hive Partitioning-Advantages and Disadvantages 
Let's address some advantages and weaknesses of Apache Hive Partitioning.
 Hive Partitioning Advantages
 • Hive Partitioning distributes execution load horizontally.
• Quicker execution of queries with a low data volume occurs in the partition. For instance, Vatican City search population returns very fast instead of searching for entire world population.
Hive Partitioning Drawbacks 
• Too many creations of tiny partitions-too many folders-are possible.
• Partition is successful when data is low volume. But there are some queries that take a long time to execute like group based on large data volume. For eg, it will take a long time to group China's population as opposed to a Vatican City population grouping.
• No need to scan the entire table column for a single document.
So, all of that was about Hive Partitions. Hope the article plases you.
Conclusion 
Hope this article will help you a lot in learning what is Hive partitioning, what is Hive static partitioning, what is Hive dynamic partitioning. I have discussed various advantages and disadvantages of partitioning Hive. For more information on Hive partitioning and data storage, you can go to big data and adoop Online Training.


No comments:

Post a Comment