BigData-Assignment4-CSP 554
BigData-Assignment4-CSP 554
Exercise 1
This article explores optimization techniques for improving query performance in Hive-based Big Data
Warehouses (BDWs). The authors focus on two main data organization strategies: partitioning and
bucketing. These strategies are tested to assess their impact on performance. They permit to division of
large datasets into smaller, but also manageable parts.
1. Big Data Warehousing (BDW): BDWs are different than usual Data Warehouses because they
permit higher scalability, performance, and flexibility. They can be handled with, for example, Hive
or Hadoop, to provide storage capabilities on distributed systems (like HDFS). They also offer
querying capabilities.
2. Partitioning in Hive: Partitioning involves splitting tables, according to different attributes that
appear often in queries (for example, it can be years or region). This technique is beneficial
because it permits to reduction of the total amount of data, which will in the end reduce the
processing time. The article shows us that it can reduce processing time by up to 40-50% in some
cases.
3. Combining Partitioning and Bucketing: Partitioning can be very useful, however, it is possible to
combinate it with bucketing to optimize the performance. The results given by the article tell us
that partitioning by frequently queried attributes provides the most benefit, while the use of
bucketing can be beneficial only in specific cases, like in joins.
The results tell us that partitioning has a big benefit, especially when the attributes chosen match the
query filters. For example, when a query involves data ranges or regions, if we partition the query by these
attributes, we can reduce the query times by over 40%.
However, it’s not the same for bucketing: even if it’s beneficial in join operations (bucketing tables on join
keys for example), it doesn’t provide the same advantages as partitioning: we have to align the bucketing
attributes with query patterns to maximize the performance.
It is still possible to combine both techniques, but this has a big default: The study shows us that some
combinations reduce the processing times, while others introduce overhead that can reduce the benefits.
To conclude, the study shows us the best practices for organizing data in Hive-based BDWs, emphasizing
the fact that data organization strategies must be adapted to the query patterns and workloads to achieve
optimal performance. As the study showed us, partitioning can be very effective if it is well used (if it is
aligned with a query), and bucketing can also be effective, however, it needs special conditions involving
joins.
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Exercise 6
1. Partitioning by critic’s name permits to improve the query performance when filtering by critic,
and because there is not a lot of critics, there will not be too many partitions.
2. Partitioning by place ID could have been a good choice in some case, however, in our case, there
are too many restaurants (so place IDs) which will lead to too many partitions, and can impact the
performance instead of improving it.
Exercise 7
Exercise 8
1. We choose a row format when we need frequently to access complete records (most of the
columns). We choose a column format when we want to perform analytical queries that only
require a subset of columns from a very large dataset.
2. Splittability, for a column file format, is the ability to break down large data files into smaller,
independent chunks, so we can process them in parallel (on different nodes of a cluster for
example). It is important because it permits to improve the processing speed and the efficiency,
especially for large datasets.
3. Files with repetitive data stored in columnar format can achieve better compression than those
stored in row format (like dates and times or categorical data).
4. It’s the best choice to use the “Parquet” column file format when we do analytical queries on
subsets of large dataset, when the data is read-heavy and it is well integrated to Hadoop
ecosystem.