Big Data Project
Big Data Project
ACKNOWLEDGEMENT
I would like to express my deepest appreciation to all those who provided me the
possibility to complete this report. A special gratitude I give to our final year
project manager, Mr. Hemant Rai, whose contribution in stimulating suggestions and
encouragement, helped me to coordinate my project especially in writing this report.
Furthermore I would also like to acknowledge with much appreciation the crucial role
of the staff of Department of Information Technology, who gave the permission to use
all required equipment and the necessary material to complete the task. A special
thanks goes to go to the guide of the project, Mr. Akhilesh Singh whose have invested
his full effort in guiding me in achieving the goal. I have to appreciate the guidance
given by other supervisor as well as the panels especially in our project presentation
that has improved our presentation skills thanks to their comment and advices.
Aakash Juneja
Abstract
Annual Health Survey (Combined Household Information).It contains data of all 3 rounds of
AHS Survey. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. Despite being restricted to 9 States, the AHS is the largest
demographic survey in the world and covers two and a half times that of the Sample
Registration System. This project is based on the analysis of data and report generation
according to data.
Table of content
1.0 Introduction
2.0 System Requirement
2.1
2.2
2.3
2.3
Use of Hadoop
Use of Pig
Use of Hive
Use of R
2.4 Use of R Studio
3.0 Procedure
4.0 References
5.0 Bibliography
1.0 Introduction
It contains data of all 3 rounds of AHS Survey i.e. Baseline, First Updating Round and
Second Updating Round. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. During the Base line Survey in 2010-11, a total of 20.1 million
population and 4.14 million households and during the first updating survey in 2011-12,
20.61 million population and 4.28 million households have actually been covered. The
second updating survey (third and final round) covered a total of 20.94 million population
and 4.32 million households in 2012-13. Despite being restricted to 9 States, the AHS is the
largest demographic survey in the world and covers two and a half times that of the Sample
Registration System.
Data includes various Indicators like Whether usual Resident, Date/Month/Year of Birth,
Age, Religion, Social Group, Marital Status, Date/Month/Year of first marriage, Attending
school, not-attending school, Highest educational qualification attained, Occupation / Activity
Status during last 365 days, Whether having any form of disability, Type of treatment for
injury, Type of illness, Source of Treatment, Symptoms Pertaining to illness persisting for
more than one month, Sought medical care, Various Diagnosis, Source of Diagnosis, Getting
Regular Treatment, Person Chews/Smoke/Consume Alcohol, Status of House, Type of
Structure of the House, Ownership status of the house, Source of Drinking water, Does the
household treat the water in any way to make it safer to drink, Toilet facility, Household with
electricity, Main source of lighting, Main source of fuel used for cooking, Number of
dwelling
rooms,
Availability
of
Kitchen,
Possess
Radio/Transistor/Television/Computer/Laptop
/Telephone/Mobilephone/WashingMachine/Refrigerator/Sewingmachine/Bicycle/Motor/Scoo
ter/Moped/Car/Jeep/Van/Tractor/Water Pump/Tube Well/Cart, Land Possessed, Residential
Status, Covered by any health scheme or health insurance, Status of Household.
Analysis and visualization needs the data extraction, cleaning and mining of the data and
finally present in form for report using by any reporting tool like R, Tableau etc.
To contribute something towards Digital India I would like to analysis AHS data in
the different aspect:
JDK 1.8.91
Hadoop 2.7.1
R/ RStudio
AHS dataset
Hadoop Common contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix,
Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache
Oozie, Apache Storm.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell scripts. Though MapReduce Java
code is common, any programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program. Other projects in the Hadoop
ecosystem expose richer user interfaces
2.2 Use of Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce,
Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar
to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions
(UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call
directly from the language
2.3 Use of Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. While developed by Facebook, Apache Hive is now used
and developed by other companies such as Netflix and the Financial Industry Regulatory
Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in
Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark
jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides
indexes, including bitmap indexes. Other features of Hive include:
Indexing to provide acceleration, index type including compaction and Bitmap index
as of 0.10, more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other datamining tools. Hive supports extending the UDF set to handle use-cases not supported
by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez,
or Spark jobs.
By default, Hive stores metadata in an embedded Apache Derby database, and other
client/server databases like MySQL can optionally be used.
Four file formats are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively
starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.
2.4 Use of R
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but
much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, ) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of Rs strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full
control.
R is available as Free Software under the terms of the Free Software Foundations GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
graphical facilities for data analysis and display either on-screen or on hardcopy, and
The term environment is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the
case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it of an environment within
which statistical techniques are implemented. R can be extended (easily) via packages. There
are about eight packages supplied with the R distribution and many more are available
through the CRAN family of Internet sites covering a very wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
2.5 Use of R Studio
8. Now execute a query to perform required analysis and store that data into a file in
hive Command:hive> insert into table final_dataset select M.agegroup, M.count,T.totalsum, P.totalill,
Q.commondisease from temp M LEFT OUTER JOIN (select agegroup, sum(count)
totalsum from final group by agegroup) T on (M.agegroup=T.agegroup) LEFT
OUTER JOIN (select agegroup, sum(count) totalill from final where symptoms!='No
Symptoms of chronic diseases' and symptoms!='Asymptomatic' group by agegroup) P
on (M.agegroup = P.agegroup) LEFT OUTER JOIN
(select R.agegroup,S.symptoms commondisease from temp R LEFT OUTER JOIN
final S on R.count = S.count) Q on (M.agegroup = Q.agegroup);
R Command
10. Now load the file into RStudio
analysis1 <- read.csv("/home/aakash/Desktop/one.csv", header = T)
11. Represent this in form of Pie Chart
Converting Dataframe into numeric matrix
temp <- analysis1
analysis1<-analysis1[,-5]
analysis1<-analysis1[,-1]
m <- as.matrix(t(analysis1))
Setting Scientific Notation
getOption("scipen")
opt <- options("scipen" = 20)
Plotting Bar Plot
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'))
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'),
legend("topright", c("Common Disease Count","Total Population","Total
ill"),cex=0.75,fill=c('red','blue','green')))
AGE GROUP
COMMON SYMPTOMS
0-10
ENT problems/diseases
11-20
ENT problems/diseases
21-30
ENT problems/diseases
31-40
ENT problems/diseases
41-50
ENT problems/diseases
51-60
ENT problems/diseases
61-70
71-80
81-90
91-100
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'ELECTRICITY'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'REFRIGERATOR'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TELEVISION'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SEWING_M'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SCOOTER'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'BICYCLE'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TRACTOR')tmp group by type;
11. Calculating total count of Rural and Urban
hive>select count(*), rural from project.allcities where rural='Rural' or rural= 'Urban'
group by rural;
TV
CAR
BIKE
Electricity
Cooking Fuel
Urban
100.00%
70.00%
40.00%
100.00%
100.00%
Rural
50.00%
10.00%
90.00%
100.00%
90.00%
R Command
14. Now load the file into RStudio
analysis3 <- read.csv("/home/aakash/Desktop/Final_Dataset.csv",header = T)
temp<- analysis3
15. Represent this in form of Pie Chart
Removing columns
analysis3 <- analysis3[,-1]
View(analysis3)
analysis3 <- analysis3[,-1]
analysis3 <- analysis3[,-1]
Plotting Bar
barplot(m,names.arg = temp$TYPE,beside = T,col=c('red','blue'),legend("topright",
c("Rural","Urban"), cex=0.75, fill= c('red','blue')))
8. Now execute a query to perform required analysis and store that data into a file in
hive
Inserting Required Data into a3
hive>insert into table a3 select psu_id , symptoms_pertaining_illness,
drinking_water_source from default.allcities;
Creating table a3_grouped
hive>create table a3_grouped (total int,symptoms_pertaining_illness
string,drinking_water_source string);
Inserting Data into table a3_grouped
hive>insert into table a3_grouped select count(*), symptoms_pertaining_illness ,
drinking_water_source from a3 where drinking_water_source!='NA' and
drinking_water_source!='drinking_water_source' group by
drinking_water_source, symptoms_pertaining_illness;
Calculating symptoms with maximum count in particular source of water
hive>select max(total), drinking_water_source from a3_grouped where
symptoms_pertaining_illness!='Asymptomatic' and symptoms_pertaining_illness!
='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
hive>create table maxcount(source string, maxx int);
hive>insert into table maxcount select drinking_water_source,max(total) from
a3_grouped where symptoms_pertaining_illness!='Asymptomatic' and
symptoms_pertaining_illness!='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
Calculating Total Count of People in who use particular water source
hive>select sum(total), drinking_water_source from a3_grouped group by
drinking_water_source;
Finding Symptoms with Maximum diseases
hive>select maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN a3_grouped A on (maxcount.maxx = A.total and
maxcount.source = A.drinking_water_source);
hive>create temp(source string,symptoms string);
R Command
4.0 References
https://en.wikipedia.org/
http://www.apache.org/
5.0 Bibliography
http://www.data.gov.in/