0% found this document useful (0 votes)
3 views

Lab-2_asgmt-2 (1)

The document outlines an assignment for a Big Data Engineering course focused on database loading and querying using the TPC-H dataset. Students are required to create tables, load data, execute benchmark queries, and document their results, including SQL queries, execution times, and database metrics. The TPC-H dataset consists of 8 relational tables and includes a set of 22 business-oriented benchmark queries.

Uploaded by

hahaha hahaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lab-2_asgmt-2 (1)

The document outlines an assignment for a Big Data Engineering course focused on database loading and querying using the TPC-H dataset. Students are required to create tables, load data, execute benchmark queries, and document their results, including SQL queries, execution times, and database metrics. The TPC-H dataset consists of 8 relational tables and includes a set of 22 business-oriented benchmark queries.

Uploaded by

hahaha hahaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lab – 2 Assignment_2– Database Loading and Query

29-Jan-2025
Processing

DS614 Big Data Engineering, Winter'2025; Instructor: minal_bhise@daiict

Objectives: 1) Create Table and Load data into tables.


2) Run benchmark queries.
Submission: Each student needs to upload a single .pdf file which will contain following things for all the
queries listed.

1) English query and SQL Query in the given sequence.


2) Screenshot of the results.
3) Count of tuples in each table.
4) Size of database on disk.
5) Query execution time of each query.
6) List down and define parameters recorded by postgresql tool and its values for each query.

1. Benchmark TPC-H Dataset


Benchmark TPC-H is a decision support benchmark, it consists of a set of business oriented queries.
TPC-H dataset is a relational dataset. It comprises 8 tables; customer, lineitem, nation, orders, part,
partsupp, region, and supplier.

1.1. Benchmark TPC-H Dataset Schema


Benchmark TPC-H dataset schema consists of 8 tables as follows:

CREATE TABLE NATION ( N_NATIONKEY INTEGER NOT NULL,


N_NAME CHAR(25) NOT NULL,
N_REGIONKEY INTEGER NOT NULL,
N_COMMENT VARCHAR(152));

CREATE TABLE REGION ( R_REGIONKEY INTEGER NOT NULL,


R_NAME CHAR(25) NOT NULL,
R_COMMENT VARCHAR(152));

CREATE TABLE PART ( P_PARTKEY INTEGER NOT NULL,


P_NAME VARCHAR(55) NOT NULL,
P_MFGR CHAR(25) NOT NULL,
P_BRAND CHAR(10) NOT NULL,
P_TYPE VARCHAR(25) NOT NULL,
P_SIZE INTEGER NOT NULL,
P_CONTAINER CHAR(10) NOT NULL,
P_RETAILPRICE DECIMAL(15,2) NOT NULL,
P_COMMENT VARCHAR(23) NOT NULL );

CREATE TABLE SUPPLIER ( S_SUPPKEY INTEGER NOT NULL,


S_NAME CHAR(25) NOT NULL,
S_ADDRESS VARCHAR(40) NOT NULL,
S_NATIONKEY INTEGER NOT NULL,
S_PHONE CHAR(15) NOT NULL,
S_ACCTBAL DECIMAL(15,2) NOT NULL,
S_COMMENT VARCHAR(101) NOT NULL);

CREATE TABLE PARTSUPP ( PS_PARTKEY INTEGER NOT NULL,


PS_SUPPKEY INTEGER NOT NULL,
PS_AVAILQTY INTEGER NOT NULL,
PS_SUPPLYCOST DECIMAL(15,2) NOT NULL,
PS_COMMENT VARCHAR(199) NOT NULL );

CREATE TABLE CUSTOMER ( C_CUSTKEY INTEGER NOT NULL,


C_NAME VARCHAR(25) NOT NULL,
C_ADDRESS VARCHAR(40) NOT NULL,
C_NATIONKEY INTEGER NOT NULL,
C_PHONE CHAR(15) NOT NULL,
C_ACCTBAL DECIMAL(15,2) NOT NULL,
C_MKTSEGMENT CHAR(10) NOT NULL,
C_COMMENT VARCHAR(117) NOT NULL);

CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL,


O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS CHAR(1) NOT NULL,
O_TOTALPRICE DECIMAL(15,2) NOT NULL,
O_ORDERDATE DATE NOT NULL,
O_ORDERPRIORITY CHAR(15) NOT NULL,
O_CLERK CHAR(15) NOT NULL,
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT VARCHAR(79) NOT NULL);

CREATE TABLE LINEITEM ( L_ORDERKEY INTEGER NOT NULL,


L_PARTKEY INTEGER NOT NULL,
L_SUPPKEY INTEGER NOT NULL,
L_LINENUMBER INTEGER NOT NULL,
L_QUANTITY DECIMAL(15,2) NOT NULL,
L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
L_DISCOUNT DECIMAL(15,2) NOT NULL,
L_TAX DECIMAL(15,2) NOT NULL,
L_RETURNFLAG CHAR(1) NOT NULL,
L_LINESTATUS CHAR(1) NOT NULL,
L_SHIPDATE DATE NOT NULL,
L_COMMITDATE DATE NOT NULL,
L_RECEIPTDATE DATE NOT NULL,
L_SHIPINSTRUCT CHAR(25) NOT NULL,
L_SHIPMODE CHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL);

2. Populate Data in the Database.

Download the Dataset (tpc-h) provided through mail

Before running the below query, follow the steps provided in the image
The column name should be the same
COPY nation
FROM 'Path\\tpc-h\\nation.tbl'
WITH (
FORMAT CSV,
DELIMITER '|',
HEADER TRUE
);

3. Benchmark TPC-H Queryset


Benchmark TPC-H queryset consists of 22 queries as follows:

Query Plain Text SQL

Q1 Select pricing summary report select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,


from lineitem where shipdate sum(l_extendedprice) as sum_base_price, sum(l_extendedprice *
<= date '1998-12-01' - interval (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 -
'108' day l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as
avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as
avg_disc, count(*) as count_order from LINEITEM where
l_shipdate <= date '1998-12-01' - interval '108' day group by
l_returnflag, l_linestatus order by l_returnflag, l_linestatus;

Q2 select 100 minimum cost select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address,
supplier from part, partsupp, s_phone, s_comment from PART, SUPPLIER, PARTSUPP,
supplier, nation, region where NATION, REGION where p_partkey = ps_partkey and
name= ‘aisa’ s_suppkey = ps_suppkey and p_size = 30 and p_type like
'%STEEL' and s_nationkey = n_nationkey and n_regionkey =
r_regionkey and r_name = 'ASIA' and ps_supplycost = (select
min(ps_supplycost) from PARTSUPP, SUPPLIER, NATION,
REGION where p_partkey = ps_partkey and s_suppkey =
ps_suppkey and s_nationkey = n_nationkey and n_regionkey =
r_regionkey and r_name = 'ASIA') order by s_acctbal desc,
n_name, s_name, p_partkey limit 100;

Q3 select 10 shipping priority select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as


from customer, orders, revenue, o_orderdate, o_shippriority from CUSTOMER,
lineitem where o_orderdate < ORDERS, LINEITEM where c_mktsegment = 'AUTOMOBILE'
date '1995-03-13' and and c_custkey = o_custkey and l_orderkey = o_orderkey and
l_shipdate > date '1995-03-13' o_orderdate < date '1995-03-13' and l_shipdate > date '1995-03-
13' group by l_orderkey, o_orderdate, o_shippriority order by
revenue desc, o_orderdate limit 10;

Q4 select order priority from select o_orderpriority, count(*) as order_count from ORDERS
orders where o_orderdate >= where o_orderdate >= date '1995-01-01' and o_orderdate < date
date '1995-01-01' and '1995-01-01' + interval '3' month and exists (select * from
o_orderdate < date '1995-01- LINEITEM where l_orderkey = o_orderkey and l_commitdate <
01' + interval '3' month l_receiptdate) group by o_orderpriority order by o_orderpriority;

Q5 select local supplier volume in select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue


descending order from from CUSTOMER, ORDERS, LINEITEM, SUPPLIER,
customer, orders, lineitem, NATION, REGION where c_custkey = o_custkey and l_orderkey
supplier, nation, region where = o_orderkey and l_suppkey = s_suppkey and c_nationkey =
name = 'middle east' and s_nationkey and s_nationkey = n_nationkey and n_regionkey =
o_orderdate >= date '1994-01- r_regionkey and r_name = 'MIDDLE EAST' and o_orderdate >=
01' and o_orderdate < date date '1994-01-01' and o_orderdate < date '1994-01-01' + interval
'1994-01-01' + interval '1' '1' year group by n_name order by revenue desc;
year

Q6 select revenue change from select sum(l_extendedprice * l_discount) as revenue from


lineitem where l_shipdate >= LINEITEM where l_shipdate >= date '1994-01-01' and l_shipdate
date '1994-01-01' and < date '1994-01-01' + interval '1' year and l_discount between
l_shipdate < date '1994-01-01' 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 24;
+ interval '1' year and
l_discount between 0.06 -
0.01 and 0.06 + 0.01 and
l_quantity < 24

Q7 select volume shipping from select supp_nation, cust_nation, l_year, sum(volume) as revenue
supplier, lineitem, orders, from ( select n1.n_name as supp_nation, n2.n_name as
customer, nation where cust_nation, extract(year from l_shipdate) as l_year,
n1.name = 'JAPAN' and l_extendedprice * (1 - l_discount) as volume from SUPPLIER,
n2.name = 'INDIA' or LINEITEM, ORDERS, CUSTOMER, NATION n1, NATION n2
n1.name = ‘INDIA’ and where s_suppkey = l_suppkey and o_orderkey = l_orderkey and
n2.name = ‘JAPAN’ and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and
l_shipdate between date c_nationkey = n2.n_nationkey and ((n1.n_name = 'JAPAN' and
'1995-01-01' and date '1996- n2.n_name = 'INDIA') or (n1.n_name = 'INDIA' and n2.n_name =
12-31') 'JAPAN')) and l_shipdate between date '1995-01-01' and date
'1996-12-31') as shipping group by supp_nation, cust_nation,
l_year order by supp_nation, cust_nation, l_year;

Q8 select national market share select o_year, sum(case when nation = 'INDIA' then volume else
from from part, supplier, 0 end) / sum(volume) as mkt_share from (select extract(year from
lineitem, orders, customer, o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as
nation n1, nation n2, region volume, n2.n_name as nation from PART, SUPPLIER,
where o_orderdate between LINEITEM, ORDERS, CUSTOMER, NATION n1, NATION n2,
date '1995-01-01' and date REGION where p_partkey = l_partkey and s_suppkey =
'1996-12-31'and p_type = l_suppkey and l_orderkey = o_orderkey and o_custkey =
'SMALL PLATED COPPER' c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey
= r_regionkey and r_name = 'ASIA' and s_nationkey =
n2.n_nationkey and o_orderdate between date '1995-01-01' and
date '1996-12-31'and p_type = 'SMALL PLATED COPPER') as
all_nations group by o_year order by o_year;

Q9 select product type profit select nation, o_year, sum(amount) as sum_profit from (select
measure from part, supplier, n_name as nation, extract(year from o_orderdate) as o_year,
lineitem, partsupp, orders, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as
nation name like '%dim%' amount from PART, SUPPLIER, LINEITEM, PARTSUPP,
ORDERS, NATION where s_suppkey = l_suppkey and
ps_suppkey = l_suppkey and ps_partkey = l_partkey and
p_partkey = l_partkey and o_orderkey = l_orderkey and
s_nationkey = n_nationkey and p_name like '%dim%') as profit
group by nation, o_year order by nation, o_year desc;

Q10 select 20 returned item select c_custkey, c_name, sum(l_extendedprice * (1 -


reporting from customer, l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone,
orders, lineitem, nation c_comment from CUSTOMER, ORDERS, LINEITEM, NATION
o_orderdate >= date '1993-08- where c_custkey = o_custkey and l_orderkey = o_orderkey and
01' and o_orderdate < date o_orderdate >= date '1993-08-01' and o_orderdate < date
'1993-08-01' + interval '3' '1993-08-01' + interval '3' month and l_returnflag = 'R' and
month c_nationkey = n_nationkey group by c_custkey, c_name,
c_acctbal, c_phone, n_name, c_address, c_comment order by
revenue desc limit 20;

Q11 select important stock select ps_partkey, sum(ps_supplycost * ps_availqty) as value


identification in descending from PARTSUPP, SUPPLIER, NATION where ps_suppkey =
order from PARTSUPP, s_suppkey and s_nationkey = n_nationkey and n_name =
SUPPLIER, NATION where 'MOZAMBIQUE' group by ps_partkey having sum(ps_supplycost
name = 'MOZAMBIQUE' * ps_availqty) > (select sum(ps_supplycost * ps_availqty) *
0.0001000000 from PARTSUPP, SUPPLIER, NATION where
ps_suppkey = s_suppkey and s_nationkey = n_nationkey and
n_name = 'MOZAMBIQUE') order by value desc;

Q12 select shipping modes and select l_shipmode, sum(case when o_orderpriority = '1-URGENT'
order priority rom orders, or o_orderpriority = '2-HIGH' then 1 else 0 end) as
lineitem where l_shipmode in high_line_count, sum(case when o_orderpriority <> '1-URGENT'
('RAIL', 'FOB') and and o_orderpriority <> '2-HIGH' then 1 else 0 end) as
l_commitdate < l_receiptdate low_line_count from ORDERS, LINEITEM where o_orderkey =
and l_shipdate < l_orderkey and l_shipmode in ('RAIL', 'FOB') and l_commitdate
l_commitdate and < l_receiptdate and l_shipdate < l_commitdate and
l_receiptdate >= date '1997- l_receiptdate >= date '1997-01-01' and l_receiptdate < date '1997-
01-01' and l_receiptdate < 01-01' + interval '1' year group by l_shipmode order by
date '1997-01-01' + interval '1' l_shipmode;
year

Q13 select customer distribution in select c_count, count(*) as custdist from (select c_custkey,
descending order from count(o_orderkey) as c_count from CUSTOMER left outer join
customer left outer join orders ORDERS on c_custkey = o_custkey and o_comment not like
on c_custkey = o_custkey and '%pending%deposits%' group by c_custkey) c_orders group by
o_comment not like c_count order by custdist desc, c_count desc;
'%pending%deposits%'

Q14 select promotion effect from select 100.00 * sum(case when p_type like 'PROMO%' then
linitem, part where l_extendedprice * (1 - l_discount) else 0 end) /
l_shipdate >= date '1996-12- sum(l_extendedprice * (1 - l_discount)) as promo_revenue from
01' and l_shipdate < date LINEITEM, PART where l_partkey = p_partkey and
'1996-12-01' + interval '1' l_shipdate >= date '1996-12-01' and l_shipdate < date '1996-12-
month 01' + interval '1' month;

Q15 create view of top supplier create view REVENUE0 (supplier_no, total_revenue) as select
from lineitem where l_suppkey, sum(l_extendedprice * (1 - l_discount)) from
l_shipdate >= date '1997-07- LINEITEM where l_shipdate >= date '1997-07-01' and l_shipdate
01' and l_shipdate < date < date '1997-07-01' + interval '3' month group by l_suppkey;
'1997-07-01' + interval '3' select s_suppkey, s_name, s_address, s_phone, total_revenue
month from SUPPLIER, REVENUE0 where s_suppkey = supplier_no
and total_revenue = ( select max(total_revenue) from
REVENUE0) order by s_suppkey; drop view REVENUE0;

Q16 select parts/supplier select p_brand, p_type, p_size, count(distinct ps_suppkey) as


relationship from partsupp, supplier_cnt from PARTSUPP, PART where p_partkey =
part where p_brand <> ps_partkey and p_brand <> 'Brand#34' and p_type not like
'Brand#34' and p_type not like 'LARGE BRUSHED%' and p_size in (48, 19, 12, 4, 41, 7, 21, 39)
'LARGE BRUSHED%' and and ps_suppkey not in (select s_suppkey from SUPPLIER where
p_size in (48, 19, 12, 4, 41, 7, s_comment like '%Customer%Complaints%') group by p_brand,
21, 39) and ps_suppkey not in p_type, p_size order by supplier_cnt desc, p_brand, p_type,
(select s_suppkey from p_size;
supplier where s_comment
like
'%Customer%Complaints%')

Q17 select small-quantity-order select sum(l_extendedprice) / 7.0 as avg_yearly from LINEITEM,


revenue from lineitem, part PART where p_partkey = l_partkey and p_brand = 'Brand#44' and
where p_brand = 'Brand#44' p_container = 'WRAP PKG' and l_quantity < (select 0.2 *
and p_container = 'WRAP avg(l_quantity) from LINEITEM where l_partkey = p_partkey);
PKG' and l_quantity < (select
0.2 * avg(l_quantity)

Q18 select 100 large volume select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
customer from CUSTOMER, sum(l_quantity) from CUSTOMER, ORDERS, LINEITEM
ORDERS, LINEITEM where where o_orderkey in (select l_orderkey from LINEITEM group
sum(l_quantity) > 314) by l_orderkey having sum(l_quantity) > 314) and c_custkey =
o_custkey and o_orderkey = l_orderkey group by c_name,
c_custkey, o_orderkey, o_orderdate, o_totalprice order by
o_totalprice desc, o_orderdate limit 100;

Q19 select discounted revenue select sum(l_extendedprice* (1 - l_discount)) as revenue from


from lineitem, part p_brand = LINEITEM, PART where (p_partkey = l_partkey and p_brand =
'Brand#52' and p_container in 'Brand#52' and p_container in ('SM CASE', 'SM BOX', 'SM
('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 4 and l_quantity <= 4 + 10
PACK', 'SM PKG') and and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG')
l_quantity >= 4 and l_quantity and l_shipinstruct = 'DELIVER IN PERSON') or (p_partkey =
<= 4 + 10 and p_size between l_partkey and p_brand = 'Brand#11' and p_container in ('MED
1 and 5 and l_shipmode in BAG', 'MED BOX', 'MED PKG', 'MED PACK') and
('AIR', 'AIR REG') and l_quantity >= 18 and l_quantity <= 18 + 10 and p_size between 1
l_shipinstruct = 'DELIVER IN and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct =
PERSON') or p_brand = 'DELIVER IN PERSON' ) or (p_partkey = l_partkey and p_brand
'Brand#11' and p_container in = 'Brand#51' and p_container in ('LG CASE', 'LG BOX', 'LG
('MED BAG', 'MED BOX', PACK', 'LG PKG') and l_quantity >= 29 and l_quantity <= 29 +
'MED PKG', 'MED PACK') 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR
and l_quantity >= 18 and REG') and l_shipinstruct = 'DELIVER IN PERSON');
l_quantity <= 18 + 10 and
p_size between 1 and 10 and
l_shipmode in ('AIR', 'AIR
REG') and l_shipinstruct =
'DELIVER IN PERSON' ) or
p_brand = 'Brand#51' and
p_container in ('LG CASE',
'LG BOX', 'LG PACK', 'LG
PKG') and l_quantity >= 29
and l_quantity <= 29 + 10 and
p_size between 1 and 15 and
l_shipmode in ('AIR', 'AIR
REG') and l_shipinstruct =
'DELIVER IN PERSON')

Q20 select potential part promotion select s_name, s_address from SUPPLIER, NATION where
from supplier, nation where s_suppkey in ( select ps_suppkey from PARTSUPP where
s_suppkey in ( select ps_partkey in (select p_partkey from PART where p_name like
ps_suppkey from partsupp 'green%') and ps_availqty > (select 0.5 * sum(l_quantity) from
where ps_partkey in (select LINEITEM where l_partkey = ps_partkey and l_suppkey =
p_partkey from part where ps_suppkey and l_shipdate >= date '1993-01-01' and l_shipdate <
p_name like 'green%') and date '1993-01-01' + interval '1' year)) and s_nationkey =
ps_availqty > (select 0.5 * n_nationkey and n_name = 'ALGERIA' order by s_name;
sum(l_quantity) from lineitem
where l_shipdate >= date
'1993-01-01' and l_shipdate <
date '1993-01-01' + interval '1'
year)) and n_name =
'ALGERIA'

Q21 select 100 suppliers who kept select s_name, count(*) as numwait from SUPPLIER, LINEITEM
orders waiting from supplier, l1, ORDERS, NATION where s_suppkey = l1.l_suppkey and
lineitem l1, orders, nation o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and
where o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from
l1.l_receiptdate > LINEITEM l2 where l2.l_orderkey = l1.l_orderkey and
l1.l_commitdate and exists l2.l_suppkey <> l1.l_suppkey) and not exists (select * from
( select * from lineitem not LINEITEM l3 where l3.l_orderkey = l1.l_orderkey and
exists (select * from lineitem l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate >
n_name = 'EGYPT' l3.l_commitdate) and s_nationkey = n_nationkey and n_name =
'EGYPT' group by s_name order by numwait desc, s_name limit
100;

Q22 select global sales opportunity select cntrycode, count(*) as numcust, sum(c_acctbal) as
from (select totacctbal from (select substring(c_phone from 1 for 2) as
substring(c_phone from 1 for cntrycode, c_acctbal from CUSTOMER where substring(c_phone
2) as cntrycode, c_acctbal from 1 for 2) in ('20', '40', '22', '30', '39', '42', '21') and c_acctbal >
from customer where ( select avg(c_acctbal) from CUSTOMER where c_acctbal > 0.00
substring(c_phone from 1 for and substring(c_phone from 1 for 2) in ('20', '40', '22', '30', '39',
2) in ('20', '40', '22', '30', '39', '42', '21')) and not exists ( select * from ORDERS where
'42', '21') and c_acctbal > o_custkey = c_custkey)) as custsale group by cntrycode order by
( select avg(c_acctbal) from cntrycode;
customer where c_acctbal >
0.00 and substring(c_phone
from 1 for 2) in ('20', '40', '22',
'30', '39', '42', '21')) and not
exists ( select * from
ORDERS where o_custkey =
c_custkey)) as custsale group
by cntrycode order by
cntrycode;

You might also like