0% found this document useful (0 votes)
14 views

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

This document discusses using Hive queries to analyze Ontario Covid data stored in a Hive table called covidcases_dyn3. It provides the Hive commands used to create and populate the table from a CSV file. It then lists 5 questions to answer from the data and shows the Hive queries used to answer each question, including queries to find deaths by year, days with the most deaths, regions with the most deaths, days with the fewest resolved cases, and days with the highest ratio of resolved to total cases.

Uploaded by

Makis Quilop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

This document discusses using Hive queries to analyze Ontario Covid data stored in a Hive table called covidcases_dyn3. It provides the Hive commands used to create and populate the table from a CSV file. It then lists 5 questions to answer from the data and shows the Hive queries used to answer each question, including queries to find deaths by year, days with the most deaths, regions with the most deaths, days with the fewest resolved cases, and days with the highest ratio of resolved to total cases.

Uploaded by

Makis Quilop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

BDM1043 Lab 5 - Covid data insights using Hive queries

BDM 1043 - Big Data Fundamentals 02 (DSMM Group 2)


Maricris Q. Resma
Lambton College Mississauga
Jagmohan Dutta
Note previously in Lab 4 we have the hive dynamic table covidcases_dyn3 with altered
timestamp format:

add jar /LDZ/csv-serde-1.1.2.jar;

CREATE EXTERNAL TABLE covidcases_ext3(


FILE_DATE timestamp,
PHU_NAME string,
PHU_NUM int,
ACTIVE_CASES int,
RESOLVED_CASES int,
DEATHS int,
ARCHIVED_RESOLVED_CASES int,
ARCHIVED_DEATHS int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
stored as textfile
TBLPROPERTIES ("skip.header.line.count"="1");

load data local inpath '/LDZ/data/cases_by_status_and_phu.csv' overwrite into table covidcases_ext3;


select * from covidcases_ext3 LIMIT 20;
select * from covidcases_ext3 LIMIT 1;

CREATE TABLE covidcases_dyn3(


FILE_DATE timestamp,
PHU_NAME string,
PHU_NUM int,
ACTIVE_CASES int,
RESOLVED_CASES int,
DEATHS int,
ARCHIVED_RESOLVED_CASES int,
ARCHIVED_DEATHS int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
stored as textfile;

insert into table covidcases_dyn3


select (from_unixtime(unix_timestamp(translate(FILE_DATE, '/', '-') , 'dd-MM-yyyy'))), PHU_NAME,
cast(PHU_NUM as int), ACTIVE_CASES, RESOLVED_CASES , DEATHS , ARCHIVED_RESOLVED_CASES ,
ARCHIVED_DEATHS from covidcases_ext3;

describe covidcases_dyn3;
select * from covidcases_dyn3 LIMIT 20;
select * from covidcases_dyn3 LIMIT 1;

Notes:
*** jar serde was needed so the commas in the PHU_NAME table can be
considered as char instead of extra separator in the csv file
*** the serde disadvantage is it converts all data types to string
*** the timestamp format is outputted using following functions
(from_unixtime(unix_timestamp(translate(FILE_DATE, '/', '-') , 'dd-MM-yyyy')))
*** then int and timestamp casting is done on the flye during select queries
Current Table:

Now we will answer the question “Using the Hive table created in Lab4 which will have latest
Ontario Covid data, come up with 5 questions you want answered from this data and execute
queries to answer them. “

1. How many deaths were there by year?

SELECT YEAR(FILE_DATE), sum(DEATHS) from covidcases_dyn3 GROUP by YEAR(FILE_DATE);


2. What are the top 30 days in all the years that had most deaths?

SELECT FILE_DATE, PHU_NAME, PHU_NUM, DEATHS from covidcases_dyn3 SORT BY DEATHS DESC LIMIT 30;

3. What are the the top 5 places that had most covid deaths?

SELECT PHU_NAME, sum(DEATHS) as DEATHS_TOTAL from covidcases_dyn3 GROUP BY PHU_NAME SORT BY


DEATHS_TOTAL DESC LIMIT 5;
4. What are the 30 days with the least total resolved cases?

SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL from covidcases_dyn3 GROUP BY FILE_DATE


SORT BY RESOLVED_TOTAL ASC LIMIT 30;
5. What are the top 10 days that had highest ratio of resolved case over total cases in a day?

SELECT FILE_DATE, 100*(RESOLVED_TOTAL/(RESOLVED_TOTAL+DEATHS_TOTAL+ACTIVE_TOTAL)) AS


RATIO_OUTPUT

FROM (SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL, sum(DEATHS) AS DEATHS_TOTAL,


sum(ACTIVE_CASES) AS ACTIVE_TOTAL from covidcases_dyn3 GROUP BY FILE_DATE LIMIT 30) t_1

SORT BY RATIO_OUTPUT DESC LIMIT 10;

You might also like