BigData-Assignment4-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

BigData-Assignment4-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CSP 554 – Big Data Technologies

Exercise 1

This article explores optimization techniques for improving query performance in Hive-based Big Data
Warehouses (BDWs). The authors focus on two main data organization strategies: partitioning and
bucketing. These strategies are tested to assess their impact on performance. They permit to division of
large datasets into smaller, but also manageable parts.

We have 3 key concepts :

1. Big Data Warehousing (BDW): BDWs are different than usual Data Warehouses because they
permit higher scalability, performance, and flexibility. They can be handled with, for example, Hive
or Hadoop, to provide storage capabilities on distributed systems (like HDFS). They also offer
querying capabilities.

2. Partitioning in Hive: Partitioning involves splitting tables, according to different attributes that
appear often in queries (for example, it can be years or region). This technique is beneficial
because it permits to reduction of the total amount of data, which will in the end reduce the
processing time. The article shows us that it can reduce processing time by up to 40-50% in some
cases.

3. Combining Partitioning and Bucketing: Partitioning can be very useful, however, it is possible to
combinate it with bucketing to optimize the performance. The results given by the article tell us
that partitioning by frequently queried attributes provides the most benefit, while the use of
bucketing can be beneficial only in specific cases, like in joins.

The results tell us that partitioning has a big benefit, especially when the attributes chosen match the
query filters. For example, when a query involves data ranges or regions, if we partition the query by these
attributes, we can reduce the query times by over 40%.

However, it’s not the same for bucketing: even if it’s beneficial in join operations (bucketing tables on join
keys for example), it doesn’t provide the same advantages as partitioning: we have to align the bucketing
attributes with query patterns to maximize the performance.

It is still possible to combine both techniques, but this has a big default: The study shows us that some
combinations reduce the processing times, while others introduce overhead that can reduce the benefits.

To conclude, the study shows us the best practices for organizing data in Hive-based BDWs, emphasizing
the fact that data organization strategies must be adapted to the query patterns and workloads to achieve
optimal performance. As the study showed us, partitioning can be very effective if it is well used (if it is
aligned with a query), and bucketing can also be effective, however, it needs special conditions involving
joins.
Exercise 2

Exercise 3
Exercise 4

Exercise 5

Exercise 6

1. Partitioning by critic’s name permits to improve the query performance when filtering by critic,
and because there is not a lot of critics, there will not be too many partitions.
2. Partitioning by place ID could have been a good choice in some case, however, in our case, there
are too many restaurants (so place IDs) which will lead to too many partitions, and can impact the
performance instead of improving it.

Exercise 7
Exercise 8

1. We choose a row format when we need frequently to access complete records (most of the
columns). We choose a column format when we want to perform analytical queries that only
require a subset of columns from a very large dataset.
2. Splittability, for a column file format, is the ability to break down large data files into smaller,
independent chunks, so we can process them in parallel (on different nodes of a cluster for
example). It is important because it permits to improve the processing speed and the efficiency,
especially for large datasets.
3. Files with repetitive data stored in columnar format can achieve better compression than those
stored in row format (like dates and times or categorical data).
4. It’s the best choice to use the “Parquet” column file format when we do analytical queries on
subsets of large dataset, when the data is read-heavy and it is well integrated to Hadoop
ecosystem.

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
Microsoft Azure Data Fundamentals Explore Core Data Concepts
No ratings yet
Microsoft Azure Data Fundamentals Explore Core Data Concepts
8 pages
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehouse systems
No ratings yet
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehouse systems
38 pages
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Applied Hudi Systems: Definitive Reference for Developers and Engineers
From Everand
Applied Hudi Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Essential Guide to DataStage Systems: Definitive Reference for Developers and Engineers
From Everand
Essential Guide to DataStage Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Optimized Caching Techniques: Application for Scalable Distributed Architectures
From Everand
Optimized Caching Techniques: Application for Scalable Distributed Architectures
Peter Jones
No ratings yet
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Amazon Athena Query Design and Optimization: Definitive Reference for Developers and Engineers
From Everand
Amazon Athena Query Design and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Random Question 1
No ratings yet
Random Question 1
16 pages
Analyzing and Processing Data Faster Bas PDF
No ratings yet
Analyzing and Processing Data Faster Bas PDF
6 pages
Mastering Elasticsearch - Second Edition
From Everand
Mastering Elasticsearch - Second Edition
Rafał Kuć
No ratings yet
Unit 5
No ratings yet
Unit 5
5 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
BigQuery Foundations and Advanced Techniques: Definitive Reference for Developers and Engineers
From Everand
BigQuery Foundations and Advanced Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
5/5 (5)
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Assingment:-2 Submitted To: - Mandeep Ma'Am Submitted By: - Nishant Ruhil UID:-17BCA1513 GROUP:-4 Class: - Bca-4D
No ratings yet
Assingment:-2 Submitted To: - Mandeep Ma'Am Submitted By: - Nishant Ruhil UID:-17BCA1513 GROUP:-4 Class: - Bca-4D
6 pages
Access 2016: Up To Speed
From Everand
Access 2016: Up To Speed
R.M. Hyttinen
5/5 (2)
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
DSE-3222-05-Mar-2025
No ratings yet
DSE-3222-05-Mar-2025
14 pages
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Method for developing and partitioning graph-based data warehouses using association rules
No ratings yet
Method for developing and partitioning graph-based data warehouses using association rules
12 pages
Mastering PrestoDB: Fast SQL Analytics at Scale
From Everand
Mastering PrestoDB: Fast SQL Analytics at Scale
Robert Johnson
No ratings yet
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
From Everand
Teradata Architecture and SQL Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
SampleQP-BDS-April-2023 EC3R-Comprehensive2
No ratings yet
SampleQP-BDS-April-2023 EC3R-Comprehensive2
3 pages
(Ebook) Index Structures for Data Warehouses by Marcus Jürgens ISBN 9783540433682, 3540433686 All Chapters Instant Download
100% (2)
(Ebook) Index Structures for Data Warehouses by Marcus Jürgens ISBN 9783540433682, 3540433686 All Chapters Instant Download
86 pages
Mastering BigQuery: Scalable Analytics on Google Cloud
From Everand
Mastering BigQuery: Scalable Analytics on Google Cloud
Robert Johnson
No ratings yet
Efficient Workflow Orchestration with Oozie: Definitive Reference for Developers and Engineers
From Everand
Efficient Workflow Orchestration with Oozie: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Expert Cube Development with SSAS Multidimensional Models
From Everand
Expert Cube Development with SSAS Multidimensional Models
Marco Russo
No ratings yet
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
From Everand
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Elasticsearch 5.x - Third Edition
From Everand
Mastering Elasticsearch 5.x - Third Edition
Bharvi Dixit
3/5 (1)
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Bacula Essentials: Definitive Reference for Developers and Engineers
From Everand
Bacula Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
From Everand
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
data_partition_survey
No ratings yet
data_partition_survey
23 pages
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
No ratings yet
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
5 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Advanced SQL Performance Tuning: Optimize Your Database Workloads
From Everand
Advanced SQL Performance Tuning: Optimize Your Database Workloads
Robert Johnson
No ratings yet
BigQuery Partitioning vs Clustering blog first draf
No ratings yet
BigQuery Partitioning vs Clustering blog first draf
7 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Btrblocks - Data Lake Compression
No ratings yet
Btrblocks - Data Lake Compression
14 pages
Deepanshu Sethi Azure Data Engineer
No ratings yet
Deepanshu Sethi Azure Data Engineer
2 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Practice Test 6 70 Questions Udemy
No ratings yet
Practice Test 6 70 Questions Udemy
51 pages
04 Proyek PDB
No ratings yet
04 Proyek PDB
39 pages
Hadoop File Formats - YoussefEtman
No ratings yet
Hadoop File Formats - YoussefEtman
8 pages
BigData-Assignment4-CSP 554
No ratings yet
BigData-Assignment4-CSP 554
4 pages
Unit 6
No ratings yet
Unit 6
143 pages
02 - Introduction To Data Lakehouse Open-Source Technologies
No ratings yet
02 - Introduction To Data Lakehouse Open-Source Technologies
42 pages
Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1
No ratings yet
Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1
161 pages
TCS Azure Data Engineer Interview Questions and Answers
No ratings yet
TCS Azure Data Engineer Interview Questions and Answers
7 pages
1731556887911
No ratings yet
1731556887911
275 pages
James Serra Azure Synapse Analytics Overview Big Data Conference Europe
No ratings yet
James Serra Azure Synapse Analytics Overview Big Data Conference Europe
72 pages
DataFusion Query Engine SIGMOD 2024-FINAL
No ratings yet
DataFusion Query Engine SIGMOD 2024-FINAL
13 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Comparison of File Formats for Big Data
No ratings yet
Comparison of File Formats for Big Data
4 pages
Databricks Practice Questions 1 (2)
No ratings yet
Databricks Practice Questions 1 (2)
10 pages
Syed Abdul Saleem - SDE - Resume
No ratings yet
Syed Abdul Saleem - SDE - Resume
1 page
Welcome to the Age of $10_month Lakehouses
No ratings yet
Welcome to the Age of $10_month Lakehouses
29 pages
databricks-certified-data-engineer-associate-exam-dumps-by-boone-22-1-2024-12qa-ebraindumps
No ratings yet
databricks-certified-data-engineer-associate-exam-dumps-by-boone-22-1-2024-12qa-ebraindumps
15 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Big Data Analytics With Hadoop and Apache Spark
No ratings yet
Big Data Analytics With Hadoop and Apache Spark
17 pages
Google GCP BigLake
No ratings yet
Google GCP BigLake
13 pages
Practical-1 csv to parquet within S3
No ratings yet
Practical-1 csv to parquet within S3
8 pages
Teradata Connector for Hadoop Tutorial v1.5 1.6 1.7 1.8 December 2020
No ratings yet
Teradata Connector for Hadoop Tutorial v1.5 1.6 1.7 1.8 December 2020
118 pages
UNIT III BASICS_OF_HADOOP
No ratings yet
UNIT III BASICS_OF_HADOOP
22 pages
Avro Parquet
No ratings yet
Avro Parquet
5 pages
01 Topol Arrow and Go
No ratings yet
01 Topol Arrow and Go
32 pages

BigData-Assignment4-CSP 554

Uploaded by

BigData-Assignment4-CSP 554

Uploaded by

CSP 554 – Big Data Technologies

We have 3 key concepts :

You might also like