Gourav Nagar Big Data Blog

Posts

Big Data - Hive design level optimization - Bucketing

September 01, 2022

Big Data - Hive design level optimization - Bucketing ========================================================================= We use hive for query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning and Bucketing is the best way to make this read efficient and make significant performance gain on hive. Bucketing is the one of the major optimization technique which we need to consider at the time of table design along with Partitioning. Bucketing can be done on columns with High cardinality - it means we can go for bucketing when we have more no of distinct value in particular column. Bucketing split data into manageable part by applying hash function on bucketing column. We can specifying the number of buckets while creating bucketed table. Each bucket is stored as file within the tables directory or the p...

Big Data - Hive design level optimization - Partitioning (Static & Dynamic )

August 08, 2022

========================================================================= Big Data - Hive design level optimization - Partitioning ========================================================================= We use hive to query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient and partitioning is the best way to make this read efficient and make significant performance gain on hive. Partition is the one of the major optimization technique which we need to consider at the time of table design. Partitioning can be done on columns with low cardinality - it means we can go for partition when we have less no of distinct value in particular column. Partition works by dividing the data into smaller logical segments. Separate record in manageable part based on column values so we finally scan less data w...

Big Data - Hive Optimization

August 04, 2022

======================= HIVE OPTIMIZATION ======================= Hive run on top of hadoop and widely used in industry to run queries to do analysis on PB's of data (big data). There is definitely need to consider performance when we do any sort of development on hive which ultimately transforming and processing huge amount of data for analytics purpose. There are several hive optimization techniques which we can used to improve hive performance. Hive Optimization broadly can be classified in 3 major category. 1) Table Design level (Structure level) 2) Query Level 3) Execution Level 1) Table Design level (Structure level) : Design level optimization is the one which we need to consider while designing and defining structure of tables few very important optimization concept like Partitioning , Bucketing ,Specialized File Format and various Compression ...

Search This Blog

Gourav Nagar Big Data Blog

Posts

Essential AWS S3 Commands for Data Engineers

Big Data - Hive design level optimization - Bucketing

Big Data - Hive design level optimization - Partitioning (Static & Dynamic )

Big Data - Hive Optimization