0% found this document useful (0 votes)

14 views

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

This document discusses using Hive queries to analyze Ontario Covid data stored in a Hive table called covidcases_dyn3. It provides the Hive commands used to create and populate the table from a CSV file. It then lists 5 questions to answer from the data and shows the Hive queries used to answer each question, including queries to find deaths by year, days with the most deaths, regions with the most deaths, days with the fewest resolved cases, and days with the highest ratio of resolved to total cases.

Uploaded by

Makis Quilop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

Uploaded by

Makis Quilop

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

BDM1043 Lab 5 - Covid data insights using Hive queries

BDM 1043 - Big Data Fundamentals 02 (DSMM Group 2)

Maricris Q. Resma
Lambton College Mississauga
Jagmohan Dutta
Note previously in Lab 4 we have the hive dynamic table covidcases_dyn3 with altered
timestamp format:

add jar /LDZ/csv-serde-1.1.2.jar;

CREATE EXTERNAL TABLE covidcases_ext3(

FILE_DATE timestamp,
PHU_NAME string,
PHU_NUM int,
ACTIVE_CASES int,
RESOLVED_CASES int,
DEATHS int,
ARCHIVED_RESOLVED_CASES int,
ARCHIVED_DEATHS int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
stored as textfile
TBLPROPERTIES ("skip.header.line.count"="1");

load data local inpath '/LDZ/data/cases_by_status_and_phu.csv' overwrite into table covidcases_ext3;

select * from covidcases_ext3 LIMIT 20;
select * from covidcases_ext3 LIMIT 1;

CREATE TABLE covidcases_dyn3(

insert into table covidcases_dyn3

select (from_unixtime(unix_timestamp(translate(FILE_DATE, '/', '-') , 'dd-MM-yyyy'))), PHU_NAME,
cast(PHU_NUM as int), ACTIVE_CASES, RESOLVED_CASES , DEATHS , ARCHIVED_RESOLVED_CASES ,
ARCHIVED_DEATHS from covidcases_ext3;

describe covidcases_dyn3;
select * from covidcases_dyn3 LIMIT 20;
select * from covidcases_dyn3 LIMIT 1;

Notes:
*** jar serde was needed so the commas in the PHU_NAME table can be
considered as char instead of extra separator in the csv file
*** the serde disadvantage is it converts all data types to string
*** the timestamp format is outputted using following functions
(from_unixtime(unix_timestamp(translate(FILE_DATE, '/', '-') , 'dd-MM-yyyy')))
*** then int and timestamp casting is done on the flye during select queries
Current Table:

Now we will answer the question “Using the Hive table created in Lab4 which will have latest
Ontario Covid data, come up with 5 questions you want answered from this data and execute
queries to answer them. “

1. How many deaths were there by year?

SELECT YEAR(FILE_DATE), sum(DEATHS) from covidcases_dyn3 GROUP by YEAR(FILE_DATE);

2. What are the top 30 days in all the years that had most deaths?

SELECT FILE_DATE, PHU_NAME, PHU_NUM, DEATHS from covidcases_dyn3 SORT BY DEATHS DESC LIMIT 30;

3. What are the the top 5 places that had most covid deaths?

SELECT PHU_NAME, sum(DEATHS) as DEATHS_TOTAL from covidcases_dyn3 GROUP BY PHU_NAME SORT BY

DEATHS_TOTAL DESC LIMIT 5;
4. What are the 30 days with the least total resolved cases?

SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL from covidcases_dyn3 GROUP BY FILE_DATE

SORT BY RESOLVED_TOTAL ASC LIMIT 30;
5. What are the top 10 days that had highest ratio of resolved case over total cases in a day?

SELECT FILE_DATE, 100*(RESOLVED_TOTAL/(RESOLVED_TOTAL+DEATHS_TOTAL+ACTIVE_TOTAL)) AS

RATIO_OUTPUT

FROM (SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL, sum(DEATHS) AS DEATHS_TOTAL,

sum(ACTIVE_CASES) AS ACTIVE_TOTAL from covidcases_dyn3 GROUP BY FILE_DATE LIMIT 30) t_1

SORT BY RATIO_OUTPUT DESC LIMIT 10;

Assignment Report Presentation
No ratings yet
Assignment Report Presentation
12 pages
Covid 19 Data Exploration Project
No ratings yet
Covid 19 Data Exploration Project
5 pages
Window Lag - Sqlzoo
No ratings yet
Window Lag - Sqlzoo
6 pages
Covid Analysis
No ratings yet
Covid Analysis
32 pages
Covid Data For Pbi Dashboard
No ratings yet
Covid Data For Pbi Dashboard
2 pages
SQL Codes To Data Cleaning
No ratings yet
SQL Codes To Data Cleaning
2 pages
COVID
No ratings yet
COVID
19 pages
C5S2_Day_10_-_Review_Answer_Key
No ratings yet
C5S2_Day_10_-_Review_Answer_Key
8 pages
Lab5 NGuyễn Trầm Gia Hưng ITITIU23007.docx
No ratings yet
Lab5 NGuyễn Trầm Gia Hưng ITITIU23007.docx
11 pages
Computer Science Ip
No ratings yet
Computer Science Ip
16 pages
GT2 Gonzales
No ratings yet
GT2 Gonzales
3 pages
C5S2_Day_8_-_Writing_my_first_queries_Answer_Key
No ratings yet
C5S2_Day_8_-_Writing_my_first_queries_Answer_Key
10 pages
Project File -A
No ratings yet
Project File -A
20 pages
Helper
No ratings yet
Helper
2 pages
Project No Rona - Software PDF
No ratings yet
Project No Rona - Software PDF
10 pages
Corona Virus Analysis
No ratings yet
Corona Virus Analysis
27 pages
Covid_vaccine
No ratings yet
Covid_vaccine
13 pages
Task2_Part 2
No ratings yet
Task2_Part 2
3 pages
Problem IMPLEMENTATION START (1) 2
No ratings yet
Problem IMPLEMENTATION START (1) 2
8 pages
Pyr Agossou FR
No ratings yet
Pyr Agossou FR
12 pages
COVID 19 Pandemic Analysis
No ratings yet
COVID 19 Pandemic Analysis
26 pages
r.jeevitha
No ratings yet
r.jeevitha
16 pages
lecture3
No ratings yet
lecture3
53 pages
report_MSA_Practice02
No ratings yet
report_MSA_Practice02
29 pages
Database Sba Updated.pdf
No ratings yet
Database Sba Updated.pdf
5 pages
Lab 05 - PySpark - DataFrame (1)
No ratings yet
Lab 05 - PySpark - DataFrame (1)
3 pages
Ashutosh Project
No ratings yet
Ashutosh Project
19 pages
arpit shrivastava143
No ratings yet
arpit shrivastava143
4 pages
A2_midterm_QP
No ratings yet
A2_midterm_QP
1 page
Intro To Py and ML - Part 2
No ratings yet
Intro To Py and ML - Part 2
10 pages
Python Pandas Data Analysis
No ratings yet
Python Pandas Data Analysis
36 pages
Top 30 SQL Query Interview Questions: Updated On Sep 19, 2023 17:55 IST
No ratings yet
Top 30 SQL Query Interview Questions: Updated On Sep 19, 2023 17:55 IST
20 pages
AI Practical Project
No ratings yet
AI Practical Project
15 pages
Big Data Group Assignment
No ratings yet
Big Data Group Assignment
12 pages
COVID 19 Pandemic Analysis class 12 practicals (1) (2)
No ratings yet
COVID 19 Pandemic Analysis class 12 practicals (1) (2)
29 pages
C Ovid Data Analysis
No ratings yet
C Ovid Data Analysis
3 pages
My P Report
No ratings yet
My P Report
14 pages
COVID19.ipynb - Colab
No ratings yet
COVID19.ipynb - Colab
4 pages
Ramirez Reina Angel
No ratings yet
Ramirez Reina Angel
15 pages
Report - Data Visualization and Exploration
No ratings yet
Report - Data Visualization and Exploration
14 pages
Database:: Mysql Select From Patient1
No ratings yet
Database:: Mysql Select From Patient1
1 page
Untitled Document 3
No ratings yet
Untitled Document 3
13 pages
Machine Learning and OLAP On Big COVID-19 Data
No ratings yet
Machine Learning and OLAP On Big COVID-19 Data
10 pages
Ba Correct
No ratings yet
Ba Correct
70 pages
DOC-20240416-WA0002.
No ratings yet
DOC-20240416-WA0002.
32 pages
Lab Challenge
No ratings yet
Lab Challenge
1 page
Co Vids QL Present N 0710
No ratings yet
Co Vids QL Present N 0710
27 pages
Document (1)
No ratings yet
Document (1)
8 pages
M23aid027 DCS Ass2
No ratings yet
M23aid027 DCS Ass2
14 pages
mini
No ratings yet
mini
6 pages
Assignment1_param - converted
No ratings yet
Assignment1_param - converted
10 pages
Assignment Sujith S
No ratings yet
Assignment Sujith S
13 pages
Onlineapp DB
No ratings yet
Onlineapp DB
1,901 pages
Final
No ratings yet
Final
18 pages
Name
No ratings yet
Name
23 pages
Maheswari Public School Kalwar Road: Project File Session 2023-24
No ratings yet
Maheswari Public School Kalwar Road: Project File Session 2023-24
28 pages
Scripts DL1153 Conveniosdegestion2022
No ratings yet
Scripts DL1153 Conveniosdegestion2022
82 pages
LAB 3
No ratings yet
LAB 3
3 pages
Screenshot 2024-11-07 at 8.59.45 PM
No ratings yet
Screenshot 2024-11-07 at 8.59.45 PM
15 pages
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
A Netflix Web Performance Case Study - by Addy Osmani - Dev Channel - Medium
No ratings yet
A Netflix Web Performance Case Study - by Addy Osmani - Dev Channel - Medium
17 pages
MEng134 Lecture2 - Introduction of Most Important Software For Mechanical Engineers
No ratings yet
MEng134 Lecture2 - Introduction of Most Important Software For Mechanical Engineers
12 pages
Yuki X Yuki X X X: Disp Disp Disp
No ratings yet
Yuki X Yuki X X X: Disp Disp Disp
4 pages
LogSed Anomaly Diagnosis Through Mining Time-Weighted Control Flow Graph in Logs
No ratings yet
LogSed Anomaly Diagnosis Through Mining Time-Weighted Control Flow Graph in Logs
9 pages
TCODES For SAP
No ratings yet
TCODES For SAP
1 page
Git Cheat Sheet - Simple Cheat Sheet
No ratings yet
Git Cheat Sheet - Simple Cheat Sheet
7 pages
Node Web Development 2nd New edition Edition Herron instant download
100% (2)
Node Web Development 2nd New edition Edition Herron instant download
56 pages
The Mid-Term Exam of Compiler
100% (3)
The Mid-Term Exam of Compiler
3 pages
SRS Ai Chess-Sujeet
No ratings yet
SRS Ai Chess-Sujeet
7 pages
Wbts-Wcel XML From CSV 3g v1 21 3-May-2016 ..
No ratings yet
Wbts-Wcel XML From CSV 3g v1 21 3-May-2016 ..
43 pages
M01CDE CouerseWK
No ratings yet
M01CDE CouerseWK
62 pages
Create Decouple Infotype - SAP Blogs
100% (1)
Create Decouple Infotype - SAP Blogs
6 pages
Uvm: The Next Generation in Verification Methodology: Mark Glasser, Methodology Architect February 4, 2011
No ratings yet
Uvm: The Next Generation in Verification Methodology: Mark Glasser, Methodology Architect February 4, 2011
6 pages
QOS Enterasys
No ratings yet
QOS Enterasys
4 pages
Mobile Development Services
No ratings yet
Mobile Development Services
11 pages
Ultimate Guide To Termux Commands - Complete List
100% (1)
Ultimate Guide To Termux Commands - Complete List
11 pages
Elephants Dancing Elephants
No ratings yet
Elephants Dancing Elephants
4 pages
Hotel Management Class 12
No ratings yet
Hotel Management Class 12
22 pages
CIS 201 Chapter 3 Review Test: True/False
No ratings yet
CIS 201 Chapter 3 Review Test: True/False
3 pages
"Exception CX - ADDRESS - BCS Triggered - An Exception Occurred"
No ratings yet
"Exception CX - ADDRESS - BCS Triggered - An Exception Occurred"
2 pages
Integration For Microsoft Outlook 2019 Onbase Foundation Ep4 Modu
No ratings yet
Integration For Microsoft Outlook 2019 Onbase Foundation Ep4 Modu
183 pages
Blue Orange Hiking Bag Sales Presentation
No ratings yet
Blue Orange Hiking Bag Sales Presentation
63 pages
C++ Programming: From Problem Analysis To Program Design: Chapter 3: Input/Output
No ratings yet
C++ Programming: From Problem Analysis To Program Design: Chapter 3: Input/Output
50 pages
Soa QB
No ratings yet
Soa QB
14 pages
C-20 - CCL Experiment No 2B - Virtual Box
No ratings yet
C-20 - CCL Experiment No 2B - Virtual Box
11 pages
Temp Anr 5081872112394973443
100% (1)
Temp Anr 5081872112394973443
101 pages
L1 - Project Lifecycles and Agile Approaches
No ratings yet
L1 - Project Lifecycles and Agile Approaches
24 pages
Log
No ratings yet
Log
56 pages
Unit 3 Cloud Computing
No ratings yet
Unit 3 Cloud Computing
99 pages
Upstox API Reference
No ratings yet
Upstox API Reference
30 pages

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

Uploaded by

BDM1043 - Lab 5 - Covid Data Insights Using Hive Queries

Uploaded by

BDM1043 Lab 5 - Covid data insights using Hive queries

BDM 1043 - Big Data Fundamentals 02 (DSMM Group 2)

add jar /LDZ/csv-serde-1.1.2.jar;

CREATE EXTERNAL TABLE covidcases_ext3(

load data local inpath '/LDZ/data/cases_by_status_and_phu.csv' overwrite into table covidcases_ext3;

CREATE TABLE covidcases_dyn3(

insert into table covidcases_dyn3

1. How many deaths were there by year?

SELECT YEAR(FILE_DATE), sum(DEATHS) from covidcases_dyn3 GROUP by YEAR(FILE_DATE);

SELECT PHU_NAME, sum(DEATHS) as DEATHS_TOTAL from covidcases_dyn3 GROUP BY PHU_NAME SORT BY

SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL from covidcases_dyn3 GROUP BY FILE_DATE

SELECT FILE_DATE, 100*(RESOLVED_TOTAL/(RESOLVED_TOTAL+DEATHS_TOTAL+ACTIVE_TOTAL)) AS

FROM (SELECT FILE_DATE, sum(RESOLVED_CASES) AS RESOLVED_TOTAL, sum(DEATHS) AS DEATHS_TOTAL,

SORT BY RATIO_OUTPUT DESC LIMIT 10;

You might also like