DSBDA Practical Final
DSBDA Practical Final
LAB MANUAL
OF
T E (IT) 2019
COURSE
Information Technology
Page 1
Vision and Mission of Institute and Department
INSTITUTE VISION
“To Satisfy the Aspirations of Youth Force, Who Wants to Lead Nation towards Prosperity through
Techno-economic Development"
DEPARTMENT VISION
“To develop competent IT professionals for e-development of emerging societal needs.”
INSTITUTE MISSION
“To Provide, Nurture and Maintain an Environment of high Academic Excellence, Research and
Entrepreneurship for all aspiring Students, which will prepare them to face Global Challenges
maintaining high Ethical And Moral Standards”
DEPARTMENT MISSION
M1. "Educating aspirants to fulfill technological and social needs through effective teaching
learning process".
M2. "Imparting IT skills to develop innovative solutions catering needs of multidisciplinary
domain".
Program Outcomes: -
PO 1. Engineering knowledge: An ability to apply knowledge of mathematics, including
discrete mathematics, statistics, science, computer science and engineering
fundamentals to model the software application.
PO 2. Problem analysis: An ability to design and conduct an experiment as well as interpret
data, analyze complex algorithms, to produce meaningful conclusions and
recommendations.
PO 3. Design/development of solutions:An ability to design and development of software
system, component, or process to meet desired needs, within realistic constraints such
as economic, environmental, social, political, health & safety, manufacturability, and
sustainability.
PO 4. Conduct investigations of complex problems:An ability to use research based
knowledge including analysis, design and development of algorithms for the solution
of complex problems interpretation of data and synthesis of information to provide
valid conclusion.
PO 5. Modern tool usage: An ability to adapt current technologies and use modern IT tools,
to design, formulate, implement and evaluate computer based system, process, by
considering the computing needs, limits and constraints.
PO 6. The engineer and society: An ability of reasoning about the contextual knowledge of
the societal, health, safety, legal and cultural issues, consequent responsibilities
relevant to IT practices.
Page 2
PO 7. Environment and sustainability: An ability to understand the impact of engineering
solutions in a societal context and demonstrate knowledge of and the need for
sustainable development.
PO 8. Ethics: An ability to understand and commit to professional ethics and responsibilities
and norms of IT practice.
PO 9. Individual and team work:An ability to apply managerial skills by working
effectively as an individual, as a member of a team, or as a leader of a team in
multidisciplinary projects.
PO 10. Communication: An ability to communicate effectively technical information in
speech, presentation, and in written form
PO 11. Project management and finance: An ability to apply the knowledge of Information
Technology and management principles and techniques to estimate time and resources
needed to complete engineering project.
PO 12. Life-long learning: An ability to recognize the need for, and have the ability to
engage in independent and life-long learning.
Page 3
Experiment Learning Outcomes
Page 4
314457: DS & BDA Lab
Page 5
Practical Session Plan
Page 6
Assignment No.1
Title:To perform Single node/Multiple node Hadoop Installation.
Objective: To study,
1. Configure Hadoop on open source software
ELO1: Able to install Hadoop on Single Node Cluster
-------------------------------------------------------------------------------------------------------------------
Theory:
Hadoop
Hadoop is an open sourcesoftware framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines or racks of machines) are common and thus should be automatically handled in
software by the framework.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software can
be written to interact with the database, process the required data and present it to the users for
analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data. But when
it comes to dealing with huge amounts of data, it is really a tedious task to process such data
through a traditional database server.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task
into small parts and assigns those parts to many computers connected over the network, and collects
the results to form the final result dataset.
Page 7
Above diagram shows various commodity hardware‟s which could be single CPU machines or
servers with higher capacity.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open
Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now
Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel
on different CPU nodes. In short, Hadoop framework is capable enough to develop applications
capable of running on clusters of computers and they could perform complete statistical analysis for
huge amounts of data.
Hadoop Architecture
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These
libraries providesfilesystem and OS level abstractions and contains the necessary Java files and
scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Page 8
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.
HadoopMapReduce: This is YARN-based system for parallel processing of large data sets.
MapReduce
HadoopMapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop programs
perform:
The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them
and re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the
master and provide task-status information to the master periodically.
The JobTracker is a single point of failure for the HadoopMapReduce service which means if
JobTracker goes down, all running jobs are halted.
Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS,
S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File
System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters (thousands of computers)
of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a singleNameNode that manages
the file system metadata and one or more slave DataNodes that store the actual data.
Page 9
2. Set Up a Non-Root User for Hadoop Environment. Install OpenSSH on
Ubuntu. ...
3. Download and Install Hadoop on Ubuntu.
4. Single Node Hadoop Deployment (Pseudo-Distributed Mode) ...
5. Format HDFS NameNode.
6. Start Hadoop Cluster.
7. Access Hadoop UI from Browser.
How Does Hadoop Work?
Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output
of the reduce function is stored into the output files on the file system.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
a) Single Node:
Steps for Compilation & Execution
sudo apt-get update
sudo apt-get install openjdk-8-jre-headless
sudo apt-get install openjdk-8-jdk
sudo apt-get install ssh
sudo apt-get install rsync
# Download hadoop from : http://www.eu.apache.org/dist/hadoop/common/stable/hadoop-
2.7.1.tar.gz
# copy and extract hadoop-2.7.1.tar.gz in home folder
# rename the name of the extracted folder from hadoop-2.7.1 to hadoop
readlink -f /usr/bin/javac
gedit ~/hadoop/etc/hadoop/hadoop-env.sh
# add following line in it
# for 32 bit ubuntu
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386
Page 10
# for 64 bit ubuntu
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# save and exit the file
# to display the usage documentation for the hadoop script try next command
~/hadoop/bin/hadoop
# Pseudo-Distributed mode
# get your user name
whoami
# remember your user name, we'll use it in the next step
gedit ~/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:1234</value>
</property>
</configuration>
gedit ~/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/your_user_name/hadoop/name_dir</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/your_user_name/hadoop/data_dir</value>
</property>
</configuration>
#Setup passphraseless/passwordlessssh
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
export HADOOP\_PREFIX=/home/your_user_name/hadoop
sshlocalhost
# typeexit in the terminal to close the ssh connection (very important)
Exit
# The following instructions are to run a MapReduce job locally.
Format the filesystem:( Do it only once )
~/hadoop/bin/hdfsnamenode -format
Start NameNode daemon and DataNode daemon:
~/hadoop/sbin/start-dfs.sh
Page 11
Browse the web interface for the NameNode; by default it is available at:
http://localhost:50070/
Make the HDFS directories required to execute MapReduce jobs:
~/hadoop/bin/hdfsdfs -mkdir /user
~/hadoop/bin/hdfsdfs -mkdir /user/your_user_name
Copy the sample files (from ~/hadoop/etc/hadoop) into the distributed filesystem
folder(input)
~/hadoop/bin/hdfsdfs -put ~/hadoop/etc/hadoop input
Run the example map-reduce job
~/hadoop/bin/hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.7.1.jar grep input output 'us[a-z.]+'
View the output files on the distributed filesystem
~/hadoop/bin/hdfsdfs -cat output/*
Copy the output files from the distributed filesystem to the local filesystem and
examine them:
~/hadoop/bin/hdfsdfs -get output output
remove local output folder
rm -r output
remove distributed folders (input & output)
~/hadoop/bin/hdfsdfs -rm -r input output
When you’re done, stop the daemons with
~/hadoop/sbin/stop-dfs.sh
Flow Chart
Reference :
https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster
https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-
common/SingleCluster.html
Page 12
Software Requirement:
1. Ubuntu 18 / 18
2.Hadoop2.7.1
Conclusion: In this way the Hadoop was installed & configured on Ubuntu for BigData.
Questions:
Q1) what are the various daemons in Hadoop and their role in Hadoop Cluster?
Q2) what does JPS command do?
Q3) what is difference between RDBMS vs.Hadoop
Q4) what is YARN and explain its components?
Q5) Explain HDFS and its components?
Page 13
EXPERIMENT NO.2
Title:
Design a distributed application using MapReduce(Using Java) which processes a log file of a
system. List out the users who have logged for maximum period on the system. Use simple log file
from the Internet and process it using a pseudo distribution mode on Hadoop platform.
Objectives: To learn the concept of Mapper and Reducer and implement it for log file processing
Aim: To implement a MapReduce program that will process a log file of a system.
Theory
-------------------------------------------------------------------------------------------------------------------
Introduction
MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.MapReduce is a
processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.Map takes a set
of data and converts it into another set of data, where individual elements are broken down into
tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task
is always performed after the map job.
Under the MapReduce model, the data processing primitives are called mappers and reducers.
once we writean application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change
Algorithm
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Input : file or directory
o Output : Sorted file<key, value>
1. Mapstage :
o The map or mapper‟s job is to process the input data.
o Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS).
o The input file is passed to the mapper function line by line.
o The mapper processes the data and creates several small chunks of data.
2. Shuffle stage:
o This phase consumes the output of mapping phase.
o Its task is to consolidate the relevant records from Mapping phase output
3. Reduce stage :
o This stage is the combination of the Shuffle stage and the Reduce stage.
o The Reducer‟s job is to process the data that comes from the mapper.
o After processing, it produces a new set of output, which will be stored in the
HDFS.
Inserting Data into HDFS:
Page 14
•The MapReduce framework operates on <key, value>pairs, that is, the framework views the input
to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the
job, conceivably of different types.
•The key and the value classes should be in serialized manner by the framework and hence, need to
implement the Writable interface. Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework.
• Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -><k2, v2>-> reduce -
><k3, v3> (Output).
Flow chart
Page 15
Program Code
#Mapper Class
packageSalesCountry;
importjava.io.IOException;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.*;
#Reducer Class
packageSalesCountry;
Page 16
importjava.io.IOException;
importjava.util.*;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
#Driver Class
packageSalesCountry;
importorg.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
importorg.apache.hadoop.mapred.*;
Page 17
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Start HADOOP
#start-dfs.sh
#start-yarn.sh
#jps
cd
cdanalyzelogs
ls
pwd
Page 18
ls
#ls -ltr
#ls -al
#sudochmod +r *.*
pwd
#export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-
client-core-2.9.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-
common-2.9.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-
2.9.0.jar:~/analyzelogs/SalesCountry/*:$HADOOP_HOME/lib/*"
#sudomkdir ~/input2000
ls
pwd
#sudocp access_log_short.csv ~/input2000/
# $HADOOP_HOME/bin/hdfsdfs -put ~/input2000 /
# $HADOOP_HOME/bin/hadoop jar analyzelogs.jar /input2000 /output2000
# $HADOOP_HOME/bin/hdfsdfs -cat /output2000/part-00000
# stop-all.sh
# jps
Output:
Page 19
10.104.73.51 1
10.105.160.183 1
10.108.91.151 1
10.109.21.76 1
10.11.131.40 1
10.111.71.20 8
10.112.227.184 6
10.114.74.30 1
10.115.118.78 1
10.117.224.230 1
10.117.76.22 12
10.118.19.97 1
10.118.250.30 7
10.119.117.132 23
10.119.33.245 1
10.119.74.120 1
10.12.113.198 2
10.12.219.30 1
10.120.165.113 1
10.120.207.127 4
10.123.124.47 1
10.123.35.235 1
10.124.148.99 1
10.124.155.234 1
10.126.161.13 7
10.127.162.239 1
10.128.11.75 10
10.13.42.232 1
10.130.195.163 8
10.130.70.80 1
10.131.163.73 1
10.131.209.116 5
10.132.19.125 2
10.133.222.184 12
10.134.110.196 13
10.134.242.87 1
10.136.84.60 5
10.14.2.86 8
10.14.4.151 2
hduser@com17-Veriton-M200-A780:~/analyzelog$
Conclusion: Thus we have learnt how to design a distributed application using MapReduce and
process a log file of a system.
Page 20
EXPERIMENT NO. 3
Part A: Assignments based on the HadoopHBase via Hive
Title:
Write an application using HiveQL for flight information system which will include a. Creating,
Dropping, and altering Database tables. b. Creating an external Hive table. c. Load table with data,
insert new values and field in the table, Join tables with Hive d. Create index on Flight Information
Table e. Find the average departure delay per day in 2008.
Objectives: 1) To describe the basics of Hive
2)Explain the components of the Hadoop ecosystem
Aim: To execute a Hive that will perform CRUD operation on Flight Table
Theory
--------------------------------------------------------------------------------------------------------------------
Hive – Introduction
Hive is defined as a data warehouse system for Hadoop that facilitates ad-hoc queries and the
analysis of large datasets stored in Hadoop.
Following are the facts related to Hive:
It provides a SQL-like language called HiveQL(HQL). Due to its SQL-like interface, Hive
is a popular choice for Hadoop analytics.
It provides massive scale-out and faults tolerance capabilities for data storage and
processing of commodity hardware.
Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query
execution
Hive – Characteristics
Hive is a system for managing and querying unstructured data into a structured format.
It uses the concept of MapReduce for the execution of its scripts and the Hadoop Distributed
File System or HDFS for storage and retrieval of data.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Page 21
Unit Name Operation
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the
Metastore. Instead of writing MapReduce program in Java, we
can write a query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and
MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as
MapReduce results.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage
techniques to store data into file system.
Page 22
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Hive Execution Parameters -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/biadmin/Hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
Page 23
SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.2.1/lib/slf4j-
log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
hbase(main):014:0> disable'tb1'
2) Load table with data, insert new values and field in the table, Join tables with Hive
hbase(main):002:0> put 'flight',1,'finfo:dest','mumbai'
ROW COLUMN+CELL
Page 24
Updating all regions with the new schema...
Done.
COLUMN CELL
OK
Time taken: 1.841 seconds
hive>SHOW INDEX ON FLIGHT;
OK
Page 25
Part-B
Assignments based on Data Analytics using Python
EXPERIMENT NO.4
1. Perform the following operations using Python on the Facebook metrics data sets
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data
Objectives:
1. To understand and apply the Analytical concept of Big data using R/Python.
2. To study detailed concept R.
Aim:To perform basic analytical operation on given dataset
Theory:
Python is an object-oriented programming language created by Guido Rossum in
1989. It is ideally designed for rapid prototyping of complex applications. It has interfaces to many
OS system calls and libraries and is extensible to C or C++. Many large companies use the Python
programming language, including NASA, Google, YouTube, BitTorrent, etc.
Features of Python is as it is a dynamic, high level, free open source and interpreted
programming language. It supports object-oriented programming as well as procedural oriented
programming.
1. Easy to code:
Python is a high-level programming language. Python is very easy to learn the language as
compared to other languages like C, C#, Javascript, Java, etc. It is very easy to code in python
language and anybody can learn python basics in a few hours or days. It is also a developer-
friendly language.
2. Free and Open Source:
Python language is freely available at the official website and you can download it from the given
download link
df.head()
df.info()
df.isnull()
df.dropna(how='any',axis=0)
Page 26
# Create data subsets
df1
df2
# Mearge 2 dataset/subsets
df_row
df.shape
df.melt()
# Transposing Data
df.transpose()
df1.transpose()
df2.transpose()
# Sorting data
df.sort_values(by='Category')
df.sort_index()
CONCLUSION:Thus we have learnt how to perform the different reshape operations using R.
Page 27
Part-B
EXPERIMENT NO. 5
Title:
Perform the following operations using Python on the Air quality and Heart Diseases data sets
1) Data cleaning 2) Data integration
3) Data transformation 4) Error correcting
5) Data model building
Objectives:
1. To understand and apply the Analytical concept of Big data using Python.
2.To study detailed concept Python.
THEORY:
Data cleaning or data preparation is an essential part of statistical analysis. In fact,in practice it is
often more time-consuming than the statistical analysis itself
1) 1) Data cleaning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random as rd
ds = pd.read_excel("AirQuality.xlsx")
ds_heart = pd.read_csv("heart.csv")
ds.head()
ds.info()
ds.isnull().sum()
ds.dropna()
2) Data integration
ds1 = ds.loc[111:999, ['Date', 'Time', 'C6H6(GT)', 'RH']]
ds2 = ds.iloc[[1,3,5,2,4,22,43,54,67,7,8,9,50,10,11]]
ds_integration = pd.concat([ds1,ds2])
ds_integration
3) Data transformation
ds_integration.transpose()
Page 28
ds.drop(columns = "NOx(GT)")
ds2.drop(1)
ds.melt()
ds_merged = pd.concat([ds,ds_heart])
ds_merged
4)Error correcting:
SOFTWARE REQUIREMENTS:
1. Ubuntu 14.04 / 14.10
2. GNU C Compiler
3. Hadoop
4. Java
CONCLUSION:Thus we have learnt how to Perform the different Data Cleaning and Data
modeling operations using Python .
Page 29
EXPERIMENT 6
TITLE: Integrate Python and Hadoop and perform the following operations on forest fire
dataset
OBJECTIVE:
1. To understand and apply the Analytical concept of Big data using Python.
2. To study detailed concept Hadoop .
SOFTWARE REQUIREMENTS:
1. Ubuntu 14.04 / 14.10
2. GNU C Compiler
3. Hadoop
4. Java
PROBLEM STATEMENT: Integrate Python and Hadoop and perform the following operations
on forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop
b. Data mining in Hive
THEORY:
Write theory by your own for-
1. Text mining
2. PyHadoop
3. Hive
4. MapReduce
Page 30
EXPERIMENT 7
TITLE
Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs for
assignment no. 2 and 3 ( Group B)
OBJECTIVE:
1. To understand and apply the Analytical concept of Big data using Python.
2. To study detailed concept Python.
SOFTWARE REQUIREMENTS:
1. Ubuntu 14.04 / 14.10
2. GNU C Compiler
3. Hadoop
4. Java
PROBLEM STATEMENT: Visualize the data using Python libraries matplotlib, seaborn by
plotting the graphs for assignment no. 2 and 3 ( Group B)
THEORY:
Data Visualisation in Python using Matplotlib and Seaborn
Data visualization is an easier way of presenting the data, however complex it is, to analyze trends
and relationships amongst variables with the help of pictorial representation.
The following are the advantages of Data Visualization
Easier representation of compels data
Highlights good and bad performing areas
Explores relationship between data points
Identifies data patterns even for larger data points
While building visualization, it is always a good practice to keep some below mentioned points in
mind
Ensure appropriate usage of shapes, colors, and size while building visualization
Plots/graphs using a co-ordinate system are more pronounced
Knowledge of suitable plot with respect to the data types brings more clarity to the information
Usage of labels, titles, legends and pointers passes seamless information the wider audience
Python Libraries
There are a lot of python libraries which could be used to build visualization like matplotlib, vispy,
bokeh, seaborn, pygal, folium, plotly, cufflinks, and networkx. Of the
many, matplotlib and seaborn seems to be very widely used for basic to intermediate level of
visualizations.
Matplotlib
It is an amazing visualization library in Python for 2D plots of arrays, It is a multi-platform data
visualization library built on NumPy arrays and designed to work with the broader SciPy stack. It
was introduced by John Hunter in the year 2002. Let‟s try to understand some of the benefits and
features of matplotlib
It‟s fast, efficient as it is based on numpy and also easier to build
Has undergone a lot of improvements from the open source community since inception and
hence a better library having advanced features as well
Well maintained visualization output with high quality graphics draws a lot of users to it
Page 31
Basic as well as advanced charts could be very easily built
From the users/developers point of view, since it has a large community support, resolving
issues and debugging becomes much easier
Seaborn
Conceptualized and built originally at the Stanford University, this library sits on top of matplotlib.
In a sense, it has some flavors of matplotlib while from the visualization point, its is much better
than matplotlib and has added features as well. Below are its advantages
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples
Nature of Visualization
Depending on the number of variables used for plotting the visualization and the type of variables,
there could be different types of charts which we could use to understand the relationship. Based on
the count of variables, we could have
Univariate plot(involves only one variable)
Bivariate plot(more than one variable in required)
A Univariate plot could be for a continuous variable to understand the spread and distribution of the
variable while for a discrete variable it could tell us the count
Similarly, a Bivariate plot for continuous variable could display essential statistic like correlation,
for a continuous versus discrete variable could lead us to very important conclusions like
understanding data distribution across different levels of a categorical variable. A bivariate plot
between two discrete variables could also be developed.
Box plot
A boxplot, also known as a box and whisker plot, the box and the whisker are clearly displayed in
the below image. It is a very good visual representation when it comes to measuring the data
distribution. Clearly plots the median values, outliers and the quartiles. Understanding data
distribution is another important factor which leads to better model building. If data has outliers,
box plot is a recommended way to identify them and take necessary actions.
Syntax: seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75, width=0.8,
dodge=True, fliersize=5, linewidth=None, whis=1.5, ax=None, **kwargs)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
color: Color for all of the elements.
Returns: It returns the Axes object with the plot drawn onto it.
The box and whiskers chart shows how data is spread out. Five pieces of information are generally
included in the chart
1. The minimum is shown at the far left of the chart, at the end of the left „whisker‟
2. First quartile, Q1, is the far left of the box (left whisker)
3. is shown as a line in the center of the box The median
4. Third quartile, Q3, shown at the far right of the box (right whisker)
5. The maximum is at the far right of the box
As could be seen in the below representations and charts, a box plot could be plotted for one or
more than one variable providing very good insights to our data.
Representation of box plot.
Page 32
Box plot representing multi-variate categorical variables
Python3
# import required modules
import matplotlib as plt
import seaborn as sns
Page 33
Output for Box Plot and Violin Plot
Python3
# Box plot for all the numerical variables
sns.set(rc={'figure.figsize': (16, 5)})
Page 34
Parameters:
x, y: Input data variables that should be numeric.
data: Dataframe where each column is a variable and each row is an observation.
size: Grouping variable that will produce points with different sizes.
style: Grouping variable that will produce points with different markers.
palette: Grouping variable that will produce points with different markers.
markers: Object determining how to draw the markers for different levels.
alpha: Proportional opacity of the points.
Returns: This method returns the Axes object with the plot drawn onto it.
Python3
# import module
import matplotlib.pyplot as plt
Python3
# import required modules
from mpl_toolkits.mplot3d import Axes3D
Page 35
z = [2, 3, 3, 3, 5, 7, 9, 11, 9, 10]
# assign labels
ax.set_xlabel('X Label'), ax.set_ylabel('Y Label'), ax.set_zlabel('Z
Label')
# display illustration
plt.show()
Page 36
Python3
# illustrate histogram
features = ['BloodPressure', 'SkinThickness']
diabetes[features].hist(figsize=(10, 4))
Output Histogram
Pie Chart
Pie chart is a univariate analysis and are typically used to show percentage or proportional data.
The percentage distribution of each class in a variable is provided next to the corresponding slice of
the pie. The python libraries which could be used to build a pie chart is matplotlib and seaborn.
Syntax: matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None, autopct=None,
shadow=False)
Parameters:
data represents the array of data values to be plotted, the fractional area of each slice is
represented by data/sum(data). If sum(data)<1, then the data values returns the fractional area
directly, thus resulting pie will have empty wedge of size 1-sum(data).
labels is a list of sequence of strings which sets the label of each wedge.
color attribute is used to provide color to the wedges.
autopct is a string used to label the wedge with their numerical value.
shadow is used to create shadow of wedge.
Below are the advantages of a pie chart
Easier visual summarization of large data points
Effect and size of different classes can be easily understood
Percentage points are used to represent the classes in the data points
Python3
# import required module
import matplotlib.pyplot as plt
# Creating dataset
Page 37
cars = ['AUDI', 'BMW', 'FORD', 'TESLA', 'JAGUAR',
'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
# Creating plot
fig = plt.figure(figsize=(10, 7))
plt.pie(data, labels=cars)
# Show plot
plt.show()
Page 38
Python3
# Import required module
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
cars = ['AUDI', 'BMW', 'FORD', 'TESLA', 'JAGUAR',
'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
# Wedge properties
wp = {'linewidth': 1, 'edgecolor': "green"}
# Creating plot
fig, ax = plt.subplots(figsize=(10, 7))
wedges, texts, autotexts = ax.pie(data, autopct=lambda pct:
func(pct, data), explode=explode, labels=cars,
shadow=True, colors=colors, startangle=90,
wedgeprops=wp,
textprops=dict(color="magenta"))
# Adding legend
ax.legend(wedges, cars, title="Cars", loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=8, weight="bold")
ax.set_title("Customizing pie chart")
# Show plot
plt.show()
Page 39
Output
CONCLUSION:Thus we have learnt Visualize the data using Python by plotting the graphs .
Page 40
ASSIGNMENT-8
Aim:
Perform the following data visualization operations using Tableau on Adult and Iris datasets
1) 1D (Linear) Data visualization
2) 2D (Planar) Data Visualization
3) 3D (Volumetric) Data Visualization
4) Temporal Data Visualization
5) Multidimensional Data Visualization
6) Tree/ Hierarchical Data visualization
7) Network Data visualization
-------------------------------------------------------------------------------------------------------------------
Introduction
Data visualization or data visualization is viewed by many disciplines as a modern equivalent
of visual communication. It involves the creation and study of the visual representation of data,
meaning "information that has been abstracted in some schematic form, including attributes or
variables for the units of information".
Data visualization refers to the techniques used to communicate data or information by encoding it
as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate
information clearly and efficiently to users. It is one of the steps in data analysis or data science
1D/Linear
Examples:
choropleth
3D/Volumetric
3D/Volumetric
3D computer models
Page 41
In 3D computer graphics, 3D modeling (or three-dimensional modeling) is the process of
developing a mathematical representation of any surface of an object (either inanimate or
living) in three dimensions via specialized software. The product is called a 3D model.
Someone who works with 3D models may be referred to as a 3D artist. It can be displayed as a
two-dimensional image through a process called 3D rendering or used in a computer
simulation of physical phenomena. The model can also be physically created using 3D
printing devices.
Rendering is the process of generating an image from a model, by means of computer programs.
The model is a description of three-dimensional objects in a strictly defined language or data
structure. It would contain geometry, viewpoint, texture, lighting, and shading information. The
image is a digital image or raster graphics image. The term may be by analogy with an "artist's
rendering" of a scene. 'Rendering' is also used to describe the process of calculating effects in a
video editing file to produce final video output.
computer simulations
Temporal
Examples:
timeline
Page 42
Image:
Friendly, M. & Denis, D. J. (2001). Milestones in the history of thematic cartography,
statistical graphics, and data visualization.Web
document, http://www.datavis.ca/milestones/.Accessed: August 30, 2012.
time series
nD/Multidimensional
Examples (category proportions, counts):
histogram
pie chart
Page 43
Tree/Hierarchical
Examples:
dendrogram
Network
Examples:
matrix
Page 44
node-link diagram (link-based layout algorithm)
Tableau:
Tableau is a Business Intelligence tool for visually analyzing the data. Users can create and
distribute an interactive and shareable dashboard, which depict the trends, variations, and density of
the data in the form of graphs and charts. Tableau can connect to files, relational and Big Data
sources to acquire and process data. The software allows data blending and real-time collaboration,
which makes it very unique. It is used by businesses, academic researchers, and many government
organizations for visual data analysis. It is also positioned as a leader Business Intelligence and
Analytics Platform in Gartner Magic Quadrant.
Tableau Features:
Tableau provides solutions for all kinds of industries, departments, and data environments.
Following are some unique features which enable Tableau to handle diverse scenarios.
Speed of Analysis − As it does not require high level of programming expertise, any user
with access to data can start using it to derive value from the data.
Self-Reliant − Tableau does not need a complex software setup. The desktop version which
is used by most users is easily installed and contains all the features needed to start and
complete data analysis.
Page 45
Visual Discovery − The user explores and analyzes the data by using visual tools like
colors, trend lines, charts, and graphs. There is very little script to be written as nearly
everything is done by drag and drop.
Blend Diverse Data Sets − Tableau allows you to blend different relational, semi structured
and raw data sources in real time, without expensive up-front integration costs. The users
don‟t need to know the details of how data is stored.
Architecture Agnostic − Tableau works in all kinds of devices where data flows. Hence,
the user need not worry about specific hardware or software requirements to use Tableau.
Real-Time Collaboration − Tableau can filter, sort, and discuss data on the fly and embed
a live dashboard in portals like SharePoint site or Salesforce. You can save your view of
data and allow colleagues to subscribe to your interactive dashboards so they see the very
latest data just by refreshing their web browser.
Centralized Data − Tableau server provides a centralized location to manage all of the
organization‟s published data sources. You can delete, change permissions, add tags, and
manage schedules in one convenient location. It‟s easy to schedule extract refreshes and
manage them in the data server. Administrators can centrally define a schedule for extracts
on the server for both incremental and full refreshes.
There are three basic steps involved in creating any Tableau data analysis report.
Connect to a data source − It involves locating the data and using an appropriate type of
connection to read the data.
Choose dimensions and measures − This involves selecting the required columns from the
source data for analysis.
Apply visualization technique − This involves applying required visualization methods,
such as a specific chart or graph type to the data being analyzed.
For convenience, let‟s use the sample data set that comes with Tableau installation named sample –
superstore.xls. Locate the installation folder of Tableau and go to My Tableau Repository. Under
it, you will find the above file at Datasources\9.2\en_US-US.
On opening Tableau, you will get the start page showing various data sources. Under the header
“Connect”, you have options to choose a file or server or saved data source. Under Files, choose
excel. Then navigate to the file “Sample – Superstore.xls” as mentioned above. The excel file has
three sheets named Orders, People and Returns. Choose Orders.
Page 46
Choose the Dimensions and Measures
Next, choose the data to be analyzed by deciding on the dimensions and measures. Dimensions are
the descriptive data while measures are numeric data. When put together, they help visualize the
performance of the dimensional data with respect to the data which are measures. Choose Category
and Region as the dimensions and Sales as the measure. Drag and drop them as shown in the
following screenshot. The result shows the total sales in each category for each region.
In the previous step, you can see that the data is available only as numbers. You have to read and
calculate each of the values to judge the performance. However, you can see them as graphs or
charts with different colors to make a quicker judgment.
We drag and drop the sum (sales) column from the Marks tab to the Columns shelf. The table
showing the numeric values of sales now turns into a bar chart automatically.
Page 47
You can apply a technique of adding another dimension to the existing data. This will add more
colors to the existing bar chart as shown in the following screenshot.
Conclusion: Thus we have learnt how to Visualize the data in different types (1 1D (Linear) Data
visualization,2D (Planar) Data Visualization, 3D (Volumetric) Data Visualization, Temporal Data
Visualization, Multidimensional Data Visualization, Tree/ Hierarchical Data visualization, Network
Data visualization) by using Tableau Software.
Page 48
Group C: Model Implementation
1. Create a review scrapper for any ecommerce website to fetch real time comments, reviews, ratings,
comment tags, customer name using Python.
2. Develop a mini project in a group using different predictive models techniques to solve any real life
problem. (Refer link dataset- https://www.kaggle.com/tanmoyie/us-graduate-schools-
admissionparameters)
Reference Books:
1. Big Data, Black Book, DT Editorialservices, 2015 edition.
2. Data Analytics with Hadoop, Jenny Kim, Benjamin Bengfort, OReilly Media, Inc.
3. Python for Data Analysis by Wes McKinney published by O' Reilly media, ISBN : 978-1-449-
31979-3.
4. Python Data Science Handbook by Jake VanderPlas
https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf
5. Alex Holmes, Hadoop in practice, Dreamtech press.
6. Online References for data set http://archive.ics.uci.edu/ml/ https://www.kaggle.com/tanmoyie/us-
graduate-schools-admission-parameters https://www.kaggle.com
Page 49