Posts

Essential AWS S3 Commands for Data Engineers

As a Data Engineer working with AWS S3, there are several essential commands you should be familiar with to manage and interact with your data stored in S3. Here are some commonly used AWS S3 commands: Creating a Bucket: aws s3 mb s3://bucket-name This command creates a new S3 bucket with the specified bucket name. Uploading Files: aws s3 cp file.txt s3://bucket-name/path/file.txt This command uploads a local file to the specified S3 bucket and path. Downloading Files: aws s3 cp s3://bucket-name/path/file.txt file.txt This command downloads a file from the specified S3 bucket and path to the local machine. Listing Buckets: aws s3 ls This command lists all the S3 buckets in your AWS account. Listing Objects in a Bucket: aws s3 ls s3://bucket-name This command lists all the objects (files) in the specified S3 bucket. Moving Files (Renaming): aws s3 mv s3://bucket-name/source-file.txt s3://bucket-name/destination-file.txt This command renames or moves a file within the same S3 bucket. Del...

Big Data - Hive design level optimization - Bucketing

        Big Data -  Hive design level optimization  - Bucketing =========================================================================      We use hive for query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient  and partitioning and Bucketing is the best way to make this read efficient and make significant performance gain on hive. Bucketing is the one of the major optimization technique which we need to consider at the time of table design along with Partitioning. Bucketing can be done on columns with High cardinality - it means we can go for bucketing when we have more no of distinct value in particular column. Bucketing split data into manageable part by applying hash function on bucketing column. We can specifying the number of buckets while creating bucketed table. Each bucket is stored as file within the tables directory or the p...

Big Data - Hive design level optimization - Partitioning (Static & Dynamic )

=========================================================================                  Big Data -  Hive design level optimization  - Partitioning =========================================================================      We use hive to query and analyze data set available on HDFS or AWS S3. While reading the specific set of data from TB's or PB's data reading should be efficient  and partitioning is the best way to make this read efficient and make significant performance gain on hive. Partition is the one of the major optimization technique which we need to consider at the time of table design.  Partitioning can be done on columns with low cardinality - it means we can go for partition when we have less no of distinct value in particular column. Partition works by dividing the data into smaller logical segments. Separate record in manageable part based on column values so we finally scan less data w...

Big Data - Hive Optimization

======================= HIVE OPTIMIZATION ======================= Hive run on top of hadoop and widely used in industry to run queries to do analysis on PB's of data (big data). There is definitely need to consider performance when we do any sort of development on hive which ultimately transforming and processing huge amount of data for analytics purpose. There are several hive optimization techniques which we can used to improve hive performance. Hive Optimization broadly can be classified in 3 major category. 1) Table Design level (Structure level) 2) Query Level  3) Execution Level 1)  Table Design level (Structure level) :                           Design level optimization is the one which we need to consider while designing and defining structure of tables few very important optimization concept like Partitioning , Bucketing ,Specialized File Format and various Compression ...