0% found this document useful (0 votes)

48 views

Big Data Project

This document appears to be a training report submitted by Aakash Juneja to the Department of Information Technology at G.L. Bajaj Institute of Technology & Management in partial fulfillment of the requirements for a Bachelor of Technology degree. The report analyzes data from the Annual Health Survey conducted in nine Indian states from 2010-2013. It acknowledges those who provided guidance and support. The introduction provides an overview of the survey and the types of data collected. The system requirements and procedures sections outline the tools and methods used, including Hadoop, Pig, Hive, and R.

Uploaded by

Pallavi Sharma

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Big Data Project

Uploaded by

Pallavi Sharma

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

G.L.

Bajaj Institute of Technology & Management,

Greater Noida

ANNUAL HEALTH SURVEY

A
Training Report Submitted to
Department of Information Technology
In Partial Fulfilment of Requirements
For Award of the Degree
BACHELOR OF TECHNOLOGY
By:
AAKASH JUNEJA (1319213001)

Under the Guidance of

Mr. Akhilesh Singh

ACKNOWLEDGEMENT
I would like to express my deepest appreciation to all those who provided me the
possibility to complete this report. A special gratitude I give to our final year
project manager, Mr. Hemant Rai, whose contribution in stimulating suggestions and
encouragement, helped me to coordinate my project especially in writing this report.
Furthermore I would also like to acknowledge with much appreciation the crucial role
of the staff of Department of Information Technology, who gave the permission to use
all required equipment and the necessary material to complete the task. A special
thanks goes to go to the guide of the project, Mr. Akhilesh Singh whose have invested
his full effort in guiding me in achieving the goal. I have to appreciate the guidance
given by other supervisor as well as the panels especially in our project presentation
that has improved our presentation skills thanks to their comment and advices.

Aakash Juneja

Abstract
Annual Health Survey (Combined Household Information).It contains data of all 3 rounds of
AHS Survey. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. Despite being restricted to 9 States, the AHS is the largest
demographic survey in the world and covers two and a half times that of the Sample
Registration System. This project is based on the analysis of data and report generation
according to data.

Table of content
1.0 Introduction
2.0 System Requirement
2.1
2.2
2.3
2.3

Use of Hadoop
Use of Pig
Use of Hive
Use of R
2.4 Use of R Studio

3.0 Procedure
4.0 References
5.0 Bibliography

1.0 Introduction
It contains data of all 3 rounds of AHS Survey i.e. Baseline, First Updating Round and
Second Updating Round. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. During the Base line Survey in 2010-11, a total of 20.1 million
population and 4.14 million households and during the first updating survey in 2011-12,
20.61 million population and 4.28 million households have actually been covered. The
second updating survey (third and final round) covered a total of 20.94 million population
and 4.32 million households in 2012-13. Despite being restricted to 9 States, the AHS is the
largest demographic survey in the world and covers two and a half times that of the Sample
Registration System.
Data includes various Indicators like Whether usual Resident, Date/Month/Year of Birth,
Age, Religion, Social Group, Marital Status, Date/Month/Year of first marriage, Attending
school, not-attending school, Highest educational qualification attained, Occupation / Activity
Status during last 365 days, Whether having any form of disability, Type of treatment for
injury, Type of illness, Source of Treatment, Symptoms Pertaining to illness persisting for
more than one month, Sought medical care, Various Diagnosis, Source of Diagnosis, Getting
Regular Treatment, Person Chews/Smoke/Consume Alcohol, Status of House, Type of
Structure of the House, Ownership status of the house, Source of Drinking water, Does the
household treat the water in any way to make it safer to drink, Toilet facility, Household with
electricity, Main source of lighting, Main source of fuel used for cooking, Number of
dwelling
rooms,
Availability
of
Kitchen,
Possess
Radio/Transistor/Television/Computer/Laptop
/Telephone/Mobilephone/WashingMachine/Refrigerator/Sewingmachine/Bicycle/Motor/Scoo
ter/Moped/Car/Jeep/Van/Tractor/Water Pump/Tube Well/Cart, Land Possessed, Residential
Status, Covered by any health scheme or health insurance, Status of Household.
Analysis and visualization needs the data extraction, cleaning and mining of the data and
finally present in form for report using by any reporting tool like R, Tableau etc.

To contribute something towards Digital India I would like to analysis AHS data in
the different aspect:

1. Analysis on symptom of illness in common age group.

2. Analysis on household things in term of urban vs rural.
3. Analysis on symptom of illness on source of drinking water.

2.0 System Requirements

Minimum System Requirements

Machine with minimum Core i3 3rd gen and 4GB RAM.

Linux/ Ubuntu OS

JDK 1.8.91

Hadoop 2.7.1

Hive 1.2.1 /Pig 0.15

To perform data cleaning, extraction, transformation and analysis.

R/ RStudio

For data storage and processing

To perform final analysis and reporting.

AHS dataset

2.1 Use of Hadoop

Apache Hadoop is an open-source software framework for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware failures
are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part called MapReduce. Hadoop splits files into large
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers
packaged code for nodes to process in parallel based on the data that needs to be processed.
This approach takes advantage of data locality nodes manipulating the data they have
access to to allow the dataset to be processed faster and more efficiently than it would be in
a more conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:

Hadoop Common contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN a resource-management platform responsible for managing

computing resources in clusters and using them for scheduling of users' applications;
and

Hadoop MapReduce an implementation of the MapReduce programming model for

large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix,
Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache
Oozie, Apache Storm.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell scripts. Though MapReduce Java
code is common, any programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program. Other projects in the Hadoop
ecosystem expose richer user interfaces
2.2 Use of Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce,
Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar
to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions
(UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call
directly from the language
2.3 Use of Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. While developed by Facebook, Apache Hive is now used
and developed by other companies such as Netflix and the Financial Industry Regulatory
Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in
Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark
jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides
indexes, including bitmap indexes. Other features of Hive include:

Indexing to provide acceleration, index type including compaction and Bitmap index
as of 0.10, more index types are planned.

Different storage types such as plain text, RCFile, HBase, ORC, and others.

Metadata storage in an RDBMS, significantly reducing the time to perform semantic

checks during query execution.

Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.

Built-in user defined functions (UDFs) to manipulate dates, strings, and other datamining tools. Hive supports extending the UDF set to handle use-cases not supported
by built-in functions.

SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez,
or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other
client/server databases like MySQL can optionally be used.
Four file formats are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively
starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.
2.4 Use of R
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but
much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, ) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of Rs strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full
control.
R is available as Free Software under the terms of the Free Software Foundations GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes

an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on hardcopy, and

a well-developed, simple and effective programming language which includes

conditionals, loops, user-defined recursive functions and input and output facilities.

The term environment is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the
case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it of an environment within
which statistical techniques are implemented. R can be extended (easily) via packages. There
are about eight packages supplied with the R distribution and many more are available
through the CRAN family of Internet sites covering a very wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
2.5 Use of R Studio

RStudio is a free and open-source integrated development environment (IDE) for R, a

programming language for statistical computing and graphics. RStudio was founded by JJ
Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief
Scientist at RStudio.
RStudio is available in two editions: RStudio Desktop, where the program is run locally as a
regular desktop application; and RStudio Server, which allows accessing RStudio using a
web browser while it is running on a remote Linux server. Prepackaged distributions of
RStudio Desktop are available for Windows, OS X, and Linux.
RStudio is available in open source and commercial editions and runs on the desktop
(Windows, OS X, and Linux) or in a browser connected to RStudio Server or RStudio Server
Pro (Debian, Ubuntu, Red Hat Linux, CentOS).
RStudio is written in the C++ programming language and uses the Qt framework for its
graphical user interface.
Work on RStudio started at around December 2010, and the first public beta version (v0.92)
was officially announced in February 2011.

3.0 Procedure and Result

A. Steps to perform Second analysis that is symptom of illness in common
age group :
1. Load the data in pig variable by running command
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad';
4. Now start hive Command:$hive
5. Create a database for this purpose(if u want)Command:hive> create database ahs;
6. Create a table in which you have to load that extracted file for analysisCommand:-

hive> reate table final_dataset(agegroup string,common_dis_count

int,total_population int,total_ill int,commmon_disease string)
row format delimited fields terminated by '\t' stored as textfile;
7.

Load the data into that table

8. Now execute a query to perform required analysis and store that data into a file in
hive Command:hive> insert into table final_dataset select M.agegroup, M.count,T.totalsum, P.totalill,
Q.commondisease from temp M LEFT OUTER JOIN (select agegroup, sum(count)
totalsum from final group by agegroup) T on (M.agegroup=T.agegroup) LEFT
OUTER JOIN (select agegroup, sum(count) totalill from final where symptoms!='No
Symptoms of chronic diseases' and symptoms!='Asymptomatic' group by agegroup) P
on (M.agegroup = P.agegroup) LEFT OUTER JOIN
(select R.agegroup,S.symptoms commondisease from temp R LEFT OUTER JOIN
final S on R.count = S.count) Q on (M.agegroup = Q.agegroup);

Final Data Set of Analysis 1

9. Copy the analysed hive file to local

R Command
10. Now load the file into RStudio
analysis1 <- read.csv("/home/aakash/Desktop/one.csv", header = T)
11. Represent this in form of Pie Chart
Converting Dataframe into numeric matrix
temp <- analysis1

analysis1<-analysis1[,-5]
analysis1<-analysis1[,-1]
m <- as.matrix(t(analysis1))
Setting Scientific Notation
getOption("scipen")
opt <- options("scipen" = 20)
Plotting Bar Plot
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'))
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'),
legend("topright", c("Common Disease Count","Total Population","Total
ill"),cex=0.75,fill=c('red','blue','green')))

AGE GROUP

COMMON SYMPTOMS

0-10

ENT problems/diseases

11-20

ENT problems/diseases

21-30

ENT problems/diseases

31-40

ENT problems/diseases

41-50

ENT problems/diseases

51-60

ENT problems/diseases

61-70

Diseases of musculo-skeletal system

71-80

Diseases of musculo-skeletal system

81-90

Diseases of musculo-skeletal system

91-100

Diseases of musculo-skeletal system

B. Steps to perform analysis that is household things in term of urban vs

rural:
1. Load the data in pig variable by running command:
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad';
4. Now start hive Command:$hive
5. Create a database for this purpose(if u want)
6. Create a table in which you have to load that extracted file for analysis
hive>create table grouped_data(locality string,have_or_not string, count int, type
string)
row format delimited fields terminated by '\t' stored as textfile;
7. Load the data into that table
8. Now execute a query to perform required analysis and store that data into a file in
hive:
hive>select * from

(SELECT rural,household_have_electricity HH,COUNT(psu_id),'ELECTRICITY'

Flag FROM project.allcities group by rural,household_have_electricity
UNION ALL
SELECT rural,is_radio HH,COUNT(psu_id),'RADIO' Flag FROM project.allcities
group by rural,is_radio
UNION ALL
SELECT rural,is_television HH,COUNT(psu_id),'TELEVISION' Flag FROM
project.allcities group by rural,is_television
UNION ALL
SELECT rural,
CASE WHEN is_computer = 'With Internet connection' THEN 'Yes'
WHEN is_computer = 'Without Internet connection' THEN 'Yes'
ELSE is_computer END AS HH,COUNT(psu_id),'COMPUTER' Flag FROM
project.allcities group by rural,is_computer
UNION ALL
SELECT rural,
CASE WHEN is_telephone = 'Both' THEN 'Yes' WHEN is_telephone = 'Mobile
Phone only' THEN 'Yes' WHEN is_telephone = 'Telephone only' THEN 'Yes'
ELSE is_telephone END AS HH,COUNT(psu_id),'TELEPHONE' Flag FROM
project.allcities group by rural,is_telephone
UNION ALL
SELECT rural,is_washing_machine HH,COUNT(psu_id),'WASHING_M' Flag
FROM project.allcities group by rural,is_washing_machine
UNION ALL
SELECT rural,is_refrigerator HH,COUNT(psu_id),'REFRIGERATOR' Flag FROM
project.allcities group by rural,is_refrigerator
UNION ALL
SELECT rural,is_sewing_machine HH,COUNT(psu_id),'SEWING_M' Flag FROM
project.allcities group by rural,is_sewing_machine
UNION ALL
SELECT rural,is_bicycle HH,COUNT(psu_id),'BICYCLE' Flag FROM
project.allcities group by rural,is_bicycle
UNION ALL
SELECT rural,is_scooter HH,COUNT(psu_id),'SCOOTER' Flag FROM
project.allcities group by rural,is_scooter
UNION ALL
SELECT rural,is_car HH,COUNT(psu_id),'CAR' Flag FROM project.allcities group
by rural,is_car
UNION ALL
SELECT rural,is_tractor HH,COUNT(psu_id),'TRACTOR' Flag FROM
project.allcities group by rural,is_tractor)tmp;
9. Creating table final and final2
hive>create table final(type string,rural int);
hive>create table final2(type string,rural int);
10. Inserting Data into both the tables
hive>insert into table final

select type, sum(count) from

(select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TELEPHONE'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'RADIO'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'WASHING_M'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'COMPUTER'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'ELECTRICITY'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'REFRIGERATOR'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TELEVISION'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'SEWING_M'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'SCOOTER'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'BICYCLE'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TRACTOR')tmp group by type;
hive>insert into table final2
select type, sum(count) from
(select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'TELEPHONE'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'RADIO'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'WASHING_M'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'COMPUTER'
UNION ALL

select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'ELECTRICITY'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'REFRIGERATOR'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TELEVISION'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SEWING_M'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SCOOTER'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'BICYCLE'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TRACTOR')tmp group by type;
11. Calculating total count of Rural and Urban
hive>select count(*), rural from project.allcities where rural='Rural' or rural= 'Urban'
group by rural;

12. Creating Final_dataset final

hive>create table final_dataset(type string,rural int, urban int, rpercent float,upercent
float)
row format delimited fields terminated by ',' stored as textfile;
hive>insert into table final_dataset
select A.type, A.rural, B.urban, (A.rural/2552386)*100, (B.urban/450326)*100 from
final A
LEFT OUTER JOIN
final2 B
on (A.type = B.type)

TV
CAR
BIKE
Electricity
Cooking Fuel

Urban
100.00%
70.00%
40.00%
100.00%
100.00%

Rural
50.00%
10.00%
90.00%
100.00%
90.00%

Final Result of Analysis 2

13. Copy the analysis hive file to local

R Command
14. Now load the file into RStudio
analysis3 <- read.csv("/home/aakash/Desktop/Final_Dataset.csv",header = T)
temp<- analysis3
15. Represent this in form of Pie Chart
Removing columns
analysis3 <- analysis3[,-1]
View(analysis3)
analysis3 <- analysis3[,-1]
analysis3 <- analysis3[,-1]
Plotting Bar
barplot(m,names.arg = temp$TYPE,beside = T,col=c('red','blue'),legend("topright",
c("Rural","Urban"), cex=0.75, fill= c('red','blue')))

C. Steps to perform first analysis that is the symptom of illness on

source of drinking water

1. Load the data in pig variable by running command

x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad'
4. Now start hive
$hive;
5. Create a database for this purpose(if u want)
6. Create a table in which you have to load that extracted file for analysis
hive>create table a3 (psu_idstring, symptoms_pertaining_illnessstring,
drinking_water_source string )
row format delimited fields terminated by ',' stored as textfile;
7. Load the data into that table

8. Now execute a query to perform required analysis and store that data into a file in
hive
Inserting Required Data into a3
hive>insert into table a3 select psu_id , symptoms_pertaining_illness,
drinking_water_source from default.allcities;
Creating table a3_grouped
hive>create table a3_grouped (total int,symptoms_pertaining_illness
string,drinking_water_source string);
Inserting Data into table a3_grouped
hive>insert into table a3_grouped select count(*), symptoms_pertaining_illness ,
drinking_water_source from a3 where drinking_water_source!='NA' and
drinking_water_source!='drinking_water_source' group by
drinking_water_source, symptoms_pertaining_illness;
Calculating symptoms with maximum count in particular source of water
hive>select max(total), drinking_water_source from a3_grouped where
symptoms_pertaining_illness!='Asymptomatic' and symptoms_pertaining_illness!
='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
hive>create table maxcount(source string, maxx int);
hive>insert into table maxcount select drinking_water_source,max(total) from
a3_grouped where symptoms_pertaining_illness!='Asymptomatic' and
symptoms_pertaining_illness!='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
Calculating Total Count of People in who use particular water source
hive>select sum(total), drinking_water_source from a3_grouped group by
drinking_water_source;
Finding Symptoms with Maximum diseases
hive>select maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN a3_grouped A on (maxcount.maxx = A.total and
maxcount.source = A.drinking_water_source);
hive>create temp(source string,symptoms string);

hive>insert into table temp select

maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN
a3_grouped A on (maxcount.maxx = A.total and maxcount.source =
A.drinking_water_source);
Creating table Final_dataset
hive>create table final_dataset(source string,symptoms string,max int,total
int,percent float)
row format delimited fields terminated by ',' stored as textfile;
hive>insert into table final_dataset select M.source,A.symptoms, M.maxx, T.total,
(M.maxx/T.total)*100 from maxcount M
LEFT OUTER JOIN
(select sum(total) total, drinking_water_source from a3_grouped group by
drinking_water_source) T on (M.source = T.drinking_water_source)
LEFT OUTER JOIN
temp A on (M.source = A.source);

Final Result of Analysis 3

9. Copy the analysis hive file to local

R Command

10. Now load the file into RStudio

analysis2<-read.csv("/home/aakash/Desktop/final1.csv")
11. Represent this in form of Bar Chart
Changing into matrix
temp <- analysis2
analysis2 <- analysis2[,-1]
m <- as.matrix(t(analysis2))
Plotting
barplot(m,names.arg = temp$Water.Source, col = c("blue") )

4.0 References
https://en.wikipedia.org/
http://www.apache.org/

5.0 Bibliography
http://www.data.gov.in/

Big Data Technologies On Map Reduce and Hadoop
No ratings yet
Big Data Technologies On Map Reduce and Hadoop
2 pages
What Is The Hadoop Ecosystem
No ratings yet
What Is The Hadoop Ecosystem
5 pages
data analyst
No ratings yet
data analyst
9 pages
Integrating R and Hadoop For Big Data Analysis
No ratings yet
Integrating R and Hadoop For Big Data Analysis
12 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
BIGDATA FINAL
No ratings yet
BIGDATA FINAL
25 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Memory Management in BigData: A Perpective View
No ratings yet
Memory Management in BigData: A Perpective View
6 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Data Mining Application To Identify Crop Disease and Recommendation A Solution
No ratings yet
Data Mining Application To Identify Crop Disease and Recommendation A Solution
5 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
big data BASICS
No ratings yet
big data BASICS
3 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Unit 4
No ratings yet
Unit 4
4 pages
Big Data and Hadoop - Suzanne
No ratings yet
Big Data and Hadoop - Suzanne
5 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
Iouu
No ratings yet
Iouu
12 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
Getting Started With HDP Sandbox
No ratings yet
Getting Started With HDP Sandbox
107 pages
Group 3&4 Assignment
No ratings yet
Group 3&4 Assignment
6 pages
Ewwww
No ratings yet
Ewwww
12 pages
Radoop: Analyzing Big Data With Rapidminer and Hadoop
No ratings yet
Radoop: Analyzing Big Data With Rapidminer and Hadoop
12 pages
Weather Data Analysis Using Had Oop
No ratings yet
Weather Data Analysis Using Had Oop
9 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Rewwww
No ratings yet
Rewwww
12 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
RHadoop
No ratings yet
RHadoop
50 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
Hadoop
No ratings yet
Hadoop
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
termproject
No ratings yet
termproject
5 pages
Case Study DSBDA Report Final
No ratings yet
Case Study DSBDA Report Final
24 pages
Hadoop Is A Framework That Is Widely Used For Storing and Managing Big Data
No ratings yet
Hadoop Is A Framework That Is Widely Used For Storing and Managing Big Data
2 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Hadoop and Its Ecosystem.docx edited
No ratings yet
Hadoop and Its Ecosystem.docx edited
10 pages
Network Traffic Analysis: Hadoop Pig VS Typical Mapreduce
No ratings yet
Network Traffic Analysis: Hadoop Pig VS Typical Mapreduce
7 pages
Hadoop
No ratings yet
Hadoop
11 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Workshop On Objective Type Questions (Otqs) Informatics Practices (265/ Old Syllabus) Date: 16/05/2019 Category: Multiple Choice Question
No ratings yet
Workshop On Objective Type Questions (Otqs) Informatics Practices (265/ Old Syllabus) Date: 16/05/2019 Category: Multiple Choice Question
2 pages
Lab No-9
No ratings yet
Lab No-9
6 pages
PowerScale OneFS API Reference 9.2.0.0
No ratings yet
PowerScale OneFS API Reference 9.2.0.0
256 pages
ITSD Document Standard
No ratings yet
ITSD Document Standard
77 pages
DFH WORKSHEET1
No ratings yet
DFH WORKSHEET1
4 pages
Jignesh Resume
No ratings yet
Jignesh Resume
4 pages
IT Infra Weekly Report 13 Dec 24
No ratings yet
IT Infra Weekly Report 13 Dec 24
11 pages
Syno UsersGuide NAServer 7 2 Enu
No ratings yet
Syno UsersGuide NAServer 7 2 Enu
79 pages
Log
No ratings yet
Log
21 pages
AWS Cloud Best Practice
100% (1)
AWS Cloud Best Practice
49 pages
Simul8 Coursework
100% (2)
Simul8 Coursework
6 pages
Liferay Features Detailed
No ratings yet
Liferay Features Detailed
16 pages
NSE 1 Course Description PDF
No ratings yet
NSE 1 Course Description PDF
1 page
Vodafone Smart II V860 - VodafoneSmartIIUserGuide
No ratings yet
Vodafone Smart II V860 - VodafoneSmartIIUserGuide
70 pages
Avaya Knowledge - AES Missing User Management Menu Item With Admin Account OAM and CRITICAL TSAPI tsviAuthenticateSession Failed Error Code 6.
No ratings yet
Avaya Knowledge - AES Missing User Management Menu Item With Admin Account OAM and CRITICAL TSAPI tsviAuthenticateSession Failed Error Code 6.
4 pages
07 Fuzzing & Exploit Dev 101
No ratings yet
07 Fuzzing & Exploit Dev 101
83 pages
Relevant List of Questions For Contact Keeper Application
No ratings yet
Relevant List of Questions For Contact Keeper Application
2 pages
Citra Log
No ratings yet
Citra Log
16 pages
Exercise 1: Deploying A New Domain Controller On Server Core
No ratings yet
Exercise 1: Deploying A New Domain Controller On Server Core
13 pages
4360706
No ratings yet
4360706
9 pages
Test 1 - ECOM Question Paper SET1 AK
No ratings yet
Test 1 - ECOM Question Paper SET1 AK
2 pages
ESA 7-5 Daily Management Guide PDF
No ratings yet
ESA 7-5 Daily Management Guide PDF
456 pages
SAP Fiori For SAP S4HANA 10 Health Checks For The SAP Fiori Launchpad
No ratings yet
SAP Fiori For SAP S4HANA 10 Health Checks For The SAP Fiori Launchpad
55 pages
Resume: Name:-Prabhu G Dongare. Mob No: - 9632519239/7204783223. Email Id
No ratings yet
Resume: Name:-Prabhu G Dongare. Mob No: - 9632519239/7204783223. Email Id
4 pages
Module 2
No ratings yet
Module 2
20 pages
Python Programming PDF
100% (1)
Python Programming PDF
32 pages
SCM210 - Core Interface (CIF) - SAP Training
No ratings yet
SCM210 - Core Interface (CIF) - SAP Training
5 pages
Unit 2 - Object Oriented Analysis and Design - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Object Oriented Analysis and Design - WWW - Rgpvnotes.in
24 pages
ACME Command Line Interface Training
No ratings yet
ACME Command Line Interface Training
70 pages