Lab Manual Ds&Bdal
Lab Manual Ds&Bdal
LAB MANUAL
Prepared by,
Mrs.Rohini Hanchate
TE COMPUTER
Semester II Academic Year 2021-22
1
NMIET, Comp
DS &BDA Laboratory 2021-22
2
NMIET, Comp
DS &BDA Laboratory 2021-22
● To develop in depth understanding and implementation of the key technologies in data science
and big data analytics
● To analyze and demonstrate knowledge of statistical data analysis techniques for decision-
making
● To gain practical, hands-on experience with statistics programming languages and big data
tools
Course Outcomes:
On completion of the course, learner will be able to
CO1: Apply principles of data science for the analysis of real time problems.
CO2: Implement data representation using statistical
methods CO3: Implement and evaluate data analytics
algorithms CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
4
NMIET, Comp
DS &BDA Laboratory 2021-22
● http://vlabs.iitb.ac.in/vlabs-dev/labs/cglab/index.php
Suggested List of Laboratory Experiments/Assignments
(All assignments are compulsory)
Sr.
Group A : Data Science
No
.
1. Data Wrangling I
Perform the following operations using Python on any open source dataset (eg. data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (eg. https://www.kaggle.com). Provide a
clear description of the data and its source (i.e. URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data type, apply proper type
conversions.
6. Turn categorical variables into quantitative variables in Python
In addition to the codes and outputs, explain every operation that you do in the above steps
and explain everything that you do to import/read/scrape the data set.
5
NMIET, Comp
DS &BDA Laboratory 2021-22
2. Data Wrangling II
Perform the following operations using Python on any open source dataset (eg. data.csv)
1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
6
NMIET, Comp
DS &BDA Laboratory 2021-22
Provide the codes with outputs and explain everything that you do in this step.
.
4. Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There are
506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
5. Data Analytics II
1. Implement logistic regression using Python/R to perform classification
on Social_Network_Ads.csv dataset
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on
the
given dataset.
6. Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on
the given dataset.
7. Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
Create representation of document by calculating Term Frequency and Inverse Document
Frequency.
8. Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library
to see if we can find any patterns in the data.
Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram
7
NMIET, Comp
DS &BDA Laboratory 2021-22
9. Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about
whether they survived or not. (Column names : 'sex' and 'age')
Write observations on the inference from the above statistics.
8
NMIET, Comp
DS &BDA Laboratory 2021-22
1. How many features are there and what are their types (e.g., numeric, nominal)?
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a boxplot for each feature in the
dataset. Compare distributions and identify
outliers.
Group B- Big Data Analytics – JAVA/SCALA (Any
three)
1. Write a code in JAVA for a simple WordCount application that counts the number of
occurrences of each word in a given input set using the Hadoop MapReduce framework
on
local-standalone set-up.
2. Design a distributed application using MapReduce which processes a log file of a system.
3. Locate dataset (eg. sample_weather.txt) for working on weather data which reads the text
input files and finds average for temperature, dew point and wind speed.
4. Write a simple program in SCALA using Apache Spark framework
Group C- Mini Projects/ Case Study – PYTHON/R (Any TWO Mini Project)
1. Write a case study on Global Innovation Network and Analysis (GINA). Components of
analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model planning analytic
technique and
4. Results and Key findings.
2. Use the following dataset and classify tweets into positive and negative
tweets. https://www.kaggle.com/ruchi798/data-science-tweets
Refer dataset
https://github.com/rashida048/Some-NLP-Projects/blob/master/movie_dataset.csv
4. Use the following covid_vaccine_statewise.csv dataset and perform following analytics on the
given dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv
a. Describe the dataset
b. Number of persons state wise vaccinated for first dose in India
c. Number of persons state wise vaccinated for second dose in India
d. Number of Males vaccinated
d. Number of females vaccinated
5. Write a case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
9
NMIET, Comp
DS &BDA Laboratory 2021-22
10
NMIET, Comp
DS &BDA Laboratory 2021-22
Reference Books :
1. Chirag Shah, “A Hands-On Introduction To Data Science”, Cambridge
University Press,(2020), ISBN : ISBN 978-1-108-47244-9.
2. Python for Data Analysis by Wes McKinney published by O' Reilly media, ISBN :
978-1-449- 31979-3.
3. Scikit-learn Cookbook , Trent hauk,Packt Publishing, ISBN: 9781787286382
4. R Kent Dybvig, ―the Scheme Programming Language‖, MIT Press, ISBN 978-0-262-
51298-5.
5. Data Analytics with Hadoop, Jenny Kim, Benjamin Bengfort, OReilly Media, Inc.
6. Python Data Science Handbook by Jake VanderPlas
https://tanthiamhuat.files.wordpress.com/2018/04/pythondatascienceha
ndbook.pdf
7. An Introduction to Statistical Learning by Gareth James
https://www.ime.unicamp.br/~dias/Intoduction%20to%20Statistical
%20Learning.pdf
8. Cay S Horstmann, ―Scala for the Impatient‖, Pearson, ISBN: 978-81-317-9605-4,
9. Scala Cookbook, Alvin Alexander, O‘Reilly, SPD,ISBN: 978-93-5110-263-2
References :
● https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
● https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client- core/MapReduceTutorial.html
● https://www.edureka.co/blog/hadoop-ecosystem
● https://www.edureka.co/blog/mapreduce-tutorial/#mapreduce_word_count_example
● https://github.com/vasanth-mahendran/weather-data-hadoop
● https://spark.apache.org/docs/latest/quick-start.html#more-on-dataset-operations
● https://www.scala-lang.org/
@The CO-PO
Mapping Matrix
PO/ P P P P P P P P P PO PO PO12
CO O O O O O O O O O 10 11
1 2 3 4 5 6 7 8 9
CO1 2 2 2 2 2 2 - - - - 3 -
CO2 2 2 2 2 3 - - - - - - -
CO3 2 2 2 - 2 - - - - - - -
CO4 2 2 2 2 2 2 - - - - - -
CO5 2 2 2 2 2 2 - - - - - -
CO6 2 2 2 2 2 2 - - - - 3 -
11
NMIET, Comp
DS &BDA Laboratory 2021-22
“To be a recognizable institution for providing quality technical education & ensuring holistic
development of students”.
“Imbibing quality Technical Education and overall development by endowing students with
technical skills and competency in Computer engineering department”
PEO2: To prepare the graduates to work as a committed professional with strong professional ethics
and values, sense of responsibilities, understanding of legal, safety, health, societal, cultural and
environmental issues.
PEO3: To prepare committed and motivated graduates with research attitude, lifelong learning,
investigative approach, and multidisciplinary thinking.
PEO4: To prepare the graduates with strong managerial and communication skills to work effectively
as individual as well as in teams.
12
NMIET, Comp
DS &BDA Laboratory 2021-22
GENRAL INSTUCTIONS:
● Equipment in the lab is meant for the use of students. Students need to maintain a proper
decorum in the computer lab. Students must use the equipment with care.
● Students are required to carry their reference materials, files and records with completed
● Students are supposed to occupy the systems allotted to them and are not supposed to talk
● Lab can be used in free time/lunch hours by the students who need to use the systems
● All the Students are instructed to carry their identity cards when entering the lab.
● Students are not supposed to use pen drives, compact drives or any other storage devices in
the lab.
● For Laboratory related updates and assignments students should refer to the notice board in
the Lab.
WEEKLY PLAN:
Expt.
Practical/Assignment Name Problem Definition
No.
Data Wrangling I Perform the following operations using To understand basic data wrangling
01
Python on any open source dataset (eg. data.csv) 1. Import all methods
13
NMIET, Comp
DS &BDA Laboratory 2021-22
predict the value of prices of the house using the given features.
Data Analytics II 1. Implement logistic regression using
Python/R to perform classification on Social_Network_Ads.csv To understand logistic regression model
05
dataset Compute Confusion matrix to find TP, FP, TN, FN, & implement it for given dataset
Accuracy, Error rate, Precision, Recall on the given dataset.
Data Analytics III 1. Implement Simple Naïve Bayes
classification algorithm using Python/R on iris.csv dataset. II.
06 To learn simple naïve Bayes algorithm
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy,
Error rate, Precision, Recall on the given dataset.
Text Analytics 1. Extract Sample document and apply
following document preprocessing methods: Tokenization,
07 POS Tagging, stop words removal, Stemming and To understand text analytics methods.
Lemmatization. Create representation of document by
calculating Term Frequency and Inverse Document Frequency.
Data Visualization I 1. Use the inbuilt dataset 'titanic'. The
dataset contains 891 rows and contains information about the
passengers who boarded the unfortunate Titanic ship. Use the
08 Seaborn library to see if we can find any patterns in the data. To understand data visualization methods.
Write a code to check how the price of the ticket (column
name: 'fare') for each passenger is distributed by plotting a
histogram
Data Visualization II 1. Use the inbuilt dataset 'titanic' as used
in the above problem. Plot a box plot for distribution of age
09 with respect to each gender along with the information about To understand data visualization methods
whether they survived or not. (Column names : 'sex' and 'age')
Write observations on the inference from the above statistics.
Data Visualization III Download the Iris flower dataset or any
other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset
and give the inference as: 1. How many features are there and To understand data visualization methods
10
what are their types (e.g., numeric, nominal)? 2. Create a for given datasets
histogram for each feature in the dataset to illustrate the feature
distributions. 3. Create a boxplot for each feature in the dataset.
Compare distributions and identify outliers
15
NMIET, Comp
DS &BDA Laboratory 2021-22
Course Outcomes:
Statement
Cours
e
Outco
At the end of the course, a student will be able
me to
310256 Apply principles of data science for the analysis of real time problems.
.1
310256 Implement data representation using statistical methods
.2
310256 Implement and evaluate data analytics algorithms
.3
310256 Perform text preprocessing
.4
310256 Implement data visualization techniques
.5
310256 Use cutting edge tools and technologies to analyze Big Data
.6
Program
Course outcomes
Outcome
1 2 3 4 5 6 7 8 9 1 1 1
0 1 2
310256.1 2 2 2 2 2 2 - - - - 3 -
2 2 2 2 3 - - - - - - -
310256.2
310256.3 2 2 2 - 2 - - - - - - -
310256.4 2 2 2 2 2 2 - - - - - -
16
NMIET, Comp
DS &BDA Laboratory 2021-22
310256.5 2 2 2 2 2 2 - - - - - -
310256.6 2 2 2 2 2 2 - - - - 3 -
17
NMIET, Comp
DS &BDA Laboratory 2021-22
Program Specific
Course Outcomes
Outcome
1 2
310256.1 1 -
310256.2 1 -
310256.3 - 2
310256.4 1 -
310256.5 2 -
310256.6 - 2
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization for the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyze complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering
sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet t h e specified needs with appropriate consideration for public
health and safety, and cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: The problems that cannot be solved by straightforward
application of knowledge, theories and techniques applicable to the engineering discipline.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools, including prediction and modeling to complex engineering activities, with an
understanding of the limitations.
18
NMIET, Comp
DS &BDA Laboratory 2021-22
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal, and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with t h e society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of t h e engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
1. Understand and apply recent technology trends and modern tools in diverse areas like Cybersecurity, High-
Performance Computing, IoT, Web Design, AI/ML, Data Science, etc.
2. Identify and solve real-world problems with software engineering principles and cater to project-based learning
through experiential and industrial exposure.
19
NMIET, Comp
DS &BDA Laboratory 2021-22
EXAMINATION SCHEME
Practical Exam: 25 Marks
Term Work: 50 Marks
Total: 75 Marks
Minimum Marks required: 20 Marks(TW) +12(Practical)
PROCEDURE OF EVALUATION
20
NMIET, Comp
DS &BDA Laboratory 2021-22
21
NMIET, Comp
DS &BDA Laboratory 2021-22
Sr Gro Title of C P
. up Assignment O O
N
o.
Data Wrangling I Perform the following operations using Python on any C P
open source dataset (eg. data.csv) 1. Import all the required Python O1 O
Libraries. 2. Locate an open source data from the web (eg. 1
https://www.kaggle.com). Provide a clear description of the data and its P
source (i.e. URL of the web site). 3. Load the Dataset into pandas dataframe. O
4. Data Preprocessing: check for missing values in the data using pandas 2
isnull(), describe() function to get some initial statistics. Provide variable P
01 A descriptions. Types of variables etc. Check the dimensions of the data frame. O
5. Data Formatting and Data Normalization: Summarize the types of 3
variables by checking the data types (i.e., character, numeric, integer, factor, P
and logical) of the variables in the data set. If variables are not in the correct O
data type, apply proper type conversions. 6. Turn categorical variables into 4
quantitative variables in Python In addition to the codes and outputs, explain
every operation that you do in the above steps and explain everything that
you do to import/read/scrape the data set
C P
O1 O
Data Wrangling II Perform the following operations using Python on any C 1
open source dataset (eg. data.csv) 1. Scan all variables for missing values O2 P
and inconsistencies. If there are missing values and/or inconsistencies, use O
02 any of the suitable techniques to deal with them. 2. Scan all numeric 2
variables for outliers. If there are outliers, use any of the suitable techniques P
A to deal with them. 3. Apply data transformations on at least one of the O
variables. The purpose of this transformation should be one of the following 4
reasons: to change the scale for better understanding of the variable, to P
convert a non-linear relation into a linear one, or to decrease the skewness O
and convert the distribution into a normal distribution. Reason and document 5
your approach properly. P
O
6
03 A Basic Statistics - Measures of Central Tendencies and Variance Perform the C
following operations on any open source dataset (eg. data.csv) 1. Provide O1
summary statistics (mean, median, minimum, maximum, standard deviation) C
for a dataset (age, income etc.) with numeric variables grouped by one of the O2
qualitative (categorical) variable. For example, if your categorical variable is
age groups and quantitative variable is income, then provide summary
statistics of income grouped by the age groups. Create a list that contains a
numeric value for each response to the categorical variable. 2. Write a
Python program to display some basic statistical details like percentile,
22
NMIET, Comp
DS &BDA Laboratory 2021-22
P
O
3
P
O
5
08 C P
O3 O
Data Visualization I 1. Use the inbuilt dataset 'titanic'. The dataset contains
C 3
891 rows and contains information about the passengers who boarded the
O4 P
unfortunate Titanic ship. Use the Seaborn library to see if we can find any
A patterns in the data. Write a code to check how the price of the ticket
C O
O5 5
(column name: 'fare') for each passenger is distributed by plotting a
P
histogram
O
6
09 C P
O3 O
Data Visualization II 1. Use the inbuilt dataset 'titanic' as used in the above C 5
problem. Plot a box plot for distribution of age with respect to each gender O4 P
A along with the information about whether they survived or not. (Column C O
names : 'sex' and 'age') Write observations on the inference from the above O5 6
statistics. P
O
3
10 Data Visualization III Download the Iris flower dataset or any other dataset C P
into a DataFrame. (eg https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the O3 O
dataset and give the inference as: 1. How many features are there and what C 2
A are their types (e.g., numeric, nominal)? 2. Create a histogram for each O5 P
feature in the dataset to illustrate the feature distributions. 3. Create a O
boxplot for each feature in the dataset. Compare distributions and identify 4
outliers
11 C P
O1 O
C 3
Write a code in JAVA for a simple WordCount application that counts the
O6 P
B number of occurrences of each word in a given input set using the Hadoop
O
MapReduce framework on local-standalone set-up
5
PO
11
12 B Design a distributed application using MapReduce which processes a log file C P
of a system. O1 O
C 2
O6 P
O
3
24
NMIET, Comp
DS &BDA Laboratory 2021-22
P
O
4
13 C P
O1 O
C 3
Locate dataset (eg. sample_weather.txt) for working on weather data which
O6 P
B reads the text input files and finds average for temperature, dew point and
O
wind speed.
4P
O
5
14 C P
O6 O
1-
C Group C- Mini Projects/ Case Study – PYTHON/R
P
O
6
25
NMIET, Comp
DS &BDA Laboratory 2021-22
AIM: To Perform Data Wrangling I : implementing the following operations using Python.
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems
● To Study the data wrangling
PROBLEM STATMENT: Data Wrangling I Perform the following operations using Python on any
open source dataset (eg. data.csv) 1. Import all the required Python Libraries. 2. Locate an open source
data from the web (eg. https://www.kaggle.com). Provide a clear description of the data and its source
(i.e. URL of the web site). 3. Load the Dataset into pandas dataframe. 4. Data Preprocessing: check for
missing values in the data using pandas isnull(), describe() function to get some initial statistics.
Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame. 5. Data
Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in
the correct data type, apply proper type conversions. 6. Turn categorical variables into quantitative
variables in Python In addition to the codes and outputs, explain every operation that you do in the
above steps and explain everything that you do to import/read/scrape the data set
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
THEORY:
Python language is one of the most trending programming languages as it is dynamic than others.
Python is a simple high-level and an open-source language used for general-purpose programming. It
has many open-source libraries and Pandas is one of them. Pandas is a powerful, fast, flexible open-
source library used for data analysis and manipulations of data frames/datasets. Pandas can be used to
read and write data in a dataset of different formats like CSV(comma separated values), txt,
xls(Microsoft Excel) etc.
In this post, you will learn about various features of Pandas in Python and how to use it to practice.
Prerequisites: Basic knowledge about coding in Python.
Installation:
So if you are new to practice Pandas, then firstly you should install Pandas on your system.
26
NMIET, Comp
DS &BDA Laboratory 2021-22
Go to Command Prompt and run it as administrator. Make sure you are connected with an internet
connection to download and install it on your system.
Then type “pip install pandas“, then press Enter key.
Download the Dataset “Iris.csv” Iris dataset is the Hello World for the Data Science, so if you have
started your career in Data Science and Machine Learning you will be practicing basic ML algorithms
on this famous dataset. Iris dataset contains five columns such as Petal Length, Petal Width, Sepal
Length, Sepal Width and Species Type.
Iris is a flowering plant, the researchers have measured various features of the different iris flowers and
recorded digitally.
27
NMIET, Comp
DS &BDA Laboratory 2021-22
Displaying the number of columns and names of the columns. The column() function prints all the
columns of the dataset in a list form.
data.columns
data.shape
print(data[10:21])
# it will print the rows from 10 to 20.
28
NMIET, Comp
DS &BDA Laboratory 2021-22
29
NMIET, Comp
DS &BDA Laboratory 2021-22
30
NMIET, Comp
DS &BDA Laboratory 2021-22
ASSIGNMENT 02
AIM: To perform Data Wrangling II Perform the following operations using Python on any open
source dataset.
OBJECTIVES:
THEORY:
Identification and Handling of Null Values Missing Data can occur when no information is provided
for one or more items or for a whole unit. Missing Data is a very big problem in real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many
datasets simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, Suppose different users being surveyed may choose not to share their income,
some users may choose not to share the address in this way many datasets went missing. In Pandas
missing data is represented by two value: 1. None: None is a Python singleton object that is often used
for missing data in Python code. 2. NaN : NaN (an acronym for Not a Number), is a special floating-
point value recognized by all systems that use the standard IEEE floating-point representation
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To
facilitate this convention, there are several useful functions for detecting, removing, and replacing null
values in Pandas DataFrame : ● isnull() ● notnull() ● dropna() ● fillna() ● replace()
31
NMIET, Comp
DS &BDA Laboratory 2021-22
Checking for missing values using isnull() and notnull() ● Checking for missing values using isnull()
In order to check null values in Pandas DataFrame, isnull() function is used. This function return
dataframe of Boolean values which are True for NaN values.
Algorithm: Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame import pandas as pd import numpy as np
Step 4: Use isnull() function to check null values in the dataset. df.isnull()
Step 5: To create a series true for NaN values for specific columns. for example math score in
dataset and display data with only math score as NaN series = pd.isnull(df["math score"]) df[series]
● Checking for missing values using notnull() In order to check null values in Pandas Dataframe,
notnull() function is used. This function return dataframe of Boolean values which are False for
NaN values.
Algorithm: Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame import pandas as pd import numpy as np
Filling missing values using dropna(), fillna(), replace() In order to fill null values in a datasets,
fillna(), replace() functions are used. These functions replace NaN values with some value of their
own. All these functions help in filling null values in datasets of a DataFrame.
Step 1 : Import pandas and numpy in order to check missing values in Pandas DataFrame import
pandas as pd import numpy as np
Step 3: Display the data frame df Step 4: filling missing value using fillna() ndf=df ndf.fillna(0)
In order to drop null values from a dataframe, dropna() function is used. This function drops
Rows/Columns of datasets with Null values in different ways.
Algorithm: Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame import pandas as pd import numpy as np Step 2: Load the dataset in dataframe object df
It is used when you have paired numerical data, or when your dependent variable
has multiple values for each reading independent variable, or when trying to determine
the relationship between the two variables. In the process of utilizing the scatter plot, one
To plot the scatter plot one requires two variables that are somehow related to
each other. So here Placement score and Placement count features are used.
Algorithm:
import pandas as pd
import numpy as np
df=pd.read_csv("/content/demo.csv")
33
NMIET, Comp
DS &BDA Laboratory 2021-22
df
Step 4: Draw the scatter plot with placement score and placement offer count
count'])
plt.show()
acres)/(town)')
$10,000)')
CONCLUSION:- In this way we have explored the functions of the python library for Data Identifying and
handling the outliers. Data Transformations Techniques are explored with the purpose of creating the new
variable and reducing the skewness from datasets.
34
NMIET, Comp
DS &BDA Laboratory 2021-22
ASSIGNMENT 03
AIM: To perform Basic Statistics - Measures of Central Tendencies and Variance Perform the
following operations on any open source dataset
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● To develop in depth understanding and implementation of the key
technologies in data science and big data analytics
PROBLEM STATMENT: Basic Statistics - Measures of Central Tendencies and Variance Perform
the following operations on any open source dataset (eg. data.csv) 1. Provide summary statistics
(mean, median, minimum, maximum, standard deviation) for a dataset (age, income etc.) with numeric
variables grouped by one of the qualitative (categorical) variable. For example, if your categorical
variable is age groups and quantitative variable is income, then provide summary statistics of income
grouped by the age groups. Create a list that contains a numeric value for each response to the
categorical variable. 2. Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ of
iris.csv dataset. Provide the codes with outputs and explain everything that you do in this step.
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO2: Implement data representation using statistical methods
Measures of Center
Mean
The arithmetic mean of a variable, often called the average, is computed by adding up all the values
and dividing by the total number of values.
The population mean is represented by the Greek letter μ (mu). The sample mean is represented
by x̄(x-bar). The sample mean is usually the best, unbiased estimate of the population mean. However,
the mean is influenced by extreme values (outliers) and may not be the best measure of center with
strongly skewed data. The following equations compute the population mean and sample mean.
35
NMIET, Comp
DS &BDA Laboratory 2021-22
where xi is an element in the data set, N is the number of elements in the population, and n is the
number of elements in the sample data set.
Median
The median of a variable is the middle value of the data set when the data are sorted in order from least
to greatest. It splits the data into two equal halves with 50% of the data below the median and 50%
above the median. The median is resistant to the influence of outliers, and may be a better measure of
center with strongly skewed data.
The calculation of the median depends on the number of observations in the data set.
To calculate the median with an odd number of values (n is odd), first sort the data from smallest to
largest.
Mode
The mode is the most frequently occurring value and is commonly used with qualitative data as the
values are categorical. Categorical data cannot be added, subtracted, multiplied or divided, so the mean
and median cannot be computed. The mode is less commonly used with quantitative data as a measure
of center. Sometimes each value occurs only once and the mode will not be meaningful.
Understanding the relationship between the mean and median is important. It gives us insight into the
distribution of the variable. For example, if the distribution is skewed right (positively skewed), the
mean will increase to account for the few larger observations that pull the distribution to the right. The
median will be less affected by these extreme large values, so in this situation, the mean will be larger
than the median. In a symmetric distribution, the mean, median, and mode will all be similar in value.
If the distribution is skewed left (negatively skewed), the mean will decrease to account for the few
smaller observations that pull the distribution to the left. Again, the median will be less affected by
these extreme small observations, and in this situation, the mean will be less than the median.
36
NMIET, Comp
DS &BDA Laboratory 2021-22
Measures of Dispersion
Measures of center look at the average or middle values of a data set. Measures of dispersion look at
the spread or variation of the data. Variation refers to the amount that the values vary among
themselves. Values in a data set that are relatively close to each other have lower measures of
variation. Values that are spread farther apart have higher measures of variation.
Examine the two histograms below. Both groups have the same mean weight, but the values of Group
A are more spread out compared to the values in Group B. Both groups have an average weight of 267
lb. but the weights of Group A are more variable.
Range
The range of a variable is the largest value minus the smallest value. It is the simplest measure and
uses only these two values in a quantitative data set.
Variance
The variance uses the difference between each value and its arithmetic mean. The differences are
squared to deal with positive and negative differences. The sample variance (s2) is an unbiased
estimator of the population variance (σ2), with n-1 degrees of freedom.
s2 =
Standard Deviation
The standard deviation is the square root of the variance (both population and sample). While the
sample variance is the positive, unbiased estimator for the population variance, the units for the
variance are squared. The standard deviation is a common method for numerically describing the
distribution of a variable. The population standard deviation is σ (sigma) and sample standard
deviation is s.
CONCLUSION:- In this way we have explored the functions of the python library for Basic Statistics
- Measures of Central Tendencies
37
NMIET, Comp
DS &BDA Laboratory 2021-22
38
NMIET, Comp
DS &BDA Laboratory 2021-22
ASSIGNMENT 02
AIM: Data Analytics I Create a Linear Regression Model using Python/R to predict home prices using
Boston Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There are 506
samples and 14 feature variables in this dataset. The objective is to predict the value of prices of the
house using the given features.
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using liner regression using Python for any
open source dataset
PROBLEM STATMENT: Data Analytics I Create a Linear Regression Model using Python/R to
predict home prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing). The
Boston Housing dataset contains information about various houses in Boston through different
parameters. There are 506 samples and 14 feature variables in this dataset. The objective is to predict
the value of prices of the house using the given features.
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO2: Implement data representation using statistical methods
39
NMIET, Comp
DS &BDA Laboratory 2021-22
● Fig. 2 shown below is about the relation between weight (in Kg) and height (in
cm), a linear relation. It is an approach of studying in a statistical manner to
summarise and learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.
40
NMIET, Comp
DS &BDA Laboratory 2021-22
Fig.2 : Relation between weight (in Kg) and height (in cm)
MultiVariate Regression :It concerns the study of two or more predictor variables. Usually a
transformation of the original features into polynomial features from a given degree is preferred and
further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and
it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.
● A simple linear model is the one which involves only one dependent and one independent
variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regression
equation
(1)
𝑦=β
+β
𝑥
0 1
of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛 (2)
2
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦)
𝑖=0
42
NMIET, Comp
DS &BDA Laboratory 2021-22
43
NMIET, Comp
DS &BDA Laboratory 2021-22
β =𝑦 − β 𝑥 (4)
0 1
Once the Linear Model is estimated using equations (3) and (4), we can estimate the value of the
dependent variable in the given range only. Going outside the range is called extrapolation which is
inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear
Regression Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model created
based on the given set of observations in the sample. Two or more regression models created using
a given sample data can be compared based on their MSE. The lesser the MSE, the better the
regression model is. When the linear regression model is trained using a given set of observations,
the model with the least mean sum of squares error (MSE) is selected as the best model. The Python
or R packages select the best-fit model as the model with the lowest MSE or lowest RMSE when
training the linear regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference between
the actual value and the predicted or estimated value represented by the regression model (line or
plane).
44
NMIET, Comp
DS &BDA Laboratory 2021-22
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes a root of
the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all errors
divided by the total number of values. This is the formula to calculate RMSE
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total (SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors (SSE)
are all showing the variation with different measures.
45
NMIET, Comp
DS &BDA Laboratory 2021-22
A value of R-squared closer to 1 would mean that the regression model covers most part of
the variance of the values of the response variable and can be termed as a good model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of the hour.
However, the disadvantage of using MSE rather than R-squared is that it will be difficult to gauge
the performance of the model using MSE as the value of MSE can vary from 0 to any larger
number. However, in the case of R-squared, the value is bounded between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and corresponding Yi (i = 1 to
5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?
3 80 70
4 70 65
5 60 70
47
NMIET, Comp
DS &BDA Laboratory 2021-22
x y 𝑥
−𝑥
𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
ε (𝑥 −𝑥 )2= 730 ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470
𝑥= 𝑦= 77
78
(i) linear regression equation that best predicts standard XIIth score
𝑦=β +β 𝑥
0 1
𝑛 𝑛 2
β − 𝑥 ) (𝑦 ∑ (𝑥 𝑥)
= − 𝑦 )/ 𝑖=1 𝑖−
∑ (𝑥 𝑖
1
𝑖
𝑖=1
β = 470/730 = 0. 644
1
β =𝑦 − β 𝑥
0 1
𝑦 = 26. 76 + 0. 644 𝑥
48
NMIET, Comp
DS &BDA Laboratory 2021-22
Interpretation 1
For an increase in value of x by 0.644 units there is an increase in value of y in one unit.
Interpretation 2
49
NMIET, Comp
DS &BDA Laboratory 2021-22
Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .
(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288
50
NMIET, Comp
DS &BDA Laboratory 2021-22
● Training dataset is a dataset having attributes and class labels and used for training
Machine Learning algorithms to prepare models.
51
NMIET, Comp
DS &BDA Laboratory 2021-22
● Machines can learn when they observe enough relevant data. Using this one can model
algorithms to find relationships, detect patterns, understand complex problems and make
decisions.
● Training error is the error that occurs by applying the model to the same data from
which the model is trained.
● In a simple way the actual output of training data and predicted output of the model does
not match the training error Ein is said to have occurred.
● Training error is much easier to compute.
(b) Testing Phase
● Testing dataset is provided as input to this phase.
● Test dataset is a dataset for which class label is unknown. It is tested using model
● A test dataset used for assessment of the finally chosen model.
● Training and Testing dataset are completely different.
● Testing error is the error that occurs by assessing the model by providing the unknown
data to the model.
● In a simple way the actual output of testing data and predicted output of the model does
not match the testing error Eout is said to have occurred.
● E out is generally observed larger than Ein.
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not have
seen yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never
seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and
evaluate the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate
how well the model trained on the training data and how well it would perform on the
unseen data.
52
NMIET, Comp
DS &BDA Laboratory 2021-22
plt.scatter(x,y,c='r')
54
NMIET, Comp
DS &BDA Laboratory 2021-22
Output:
55
NMIET, Comp
DS &BDA Laboratory 2021-22
ytrain, ytest =
train_test_split(x, y, test_size =0.2,random_state = 0)
56
NMIET, Comp
DS &BDA Laboratory 2021-22
57
NMIET, Comp
DS &BDA Laboratory 2021-22
plt.show()
58
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical
session)
AIM: Data Analytics II 1. Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression using
Python for any open source dataset
PROBLEM STATMENT: Data Analytics II 1. Implement logistic regression using Python/R to
perform classification on Social_Network_Ads.csv dataset Compute Confusion matrix to find TP, FP,
TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
Logistic Regression can be used for various classification problems such as spam detection.
Diabetes prediction, if a given customer will purchase a particular product or will they churn
another competitor, whether the user will click on a given advertisement link or not, and many more
examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for
two-class classification. It is easy to implement and can be used as the baseline for any binary
classification problem. Its basic fundamental concepts are also constructive in deep learning.
Logistic regression describes and estimates the relationship between one dependent binary variable
and independent variables.
Logistic regression is a statistical method for predicting binary classes. The outcome or target
variable is dichotomous in nature. Dichotomous means there are only two possible classes. For
example, it can be used for cancer detection problems. It computes the probability of an event
occurring.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log
of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
3. Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or
Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories such as
predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories such as
restaurant or product rating from 1 to 5.
4. Confusion Matrix Evaluation Metrics
Contingency table or Confusion matrix is often used to measure the performance of classifiers. A
confusion matrix contains information about actual and predicted classifications done by a
classification system. Performance of such systems is commonly evaluated using the data in the
matrix.
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column indicates
the classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal
concerns prediction errors.
● Number of positive (Pos) : Total number instances which are labelled as positive in a
given dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a
given dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as
positive and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as
negative and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as
negative and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as
positive and the class predicted by the classifier is negative.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided
by total number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 𝑎𝑐𝑐 𝑃𝑜𝑠+𝑁𝑒𝑔
=
● Error Rate: Error Rate is calculated as the number of incorrectly classified instances
divided by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false positive and
false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 𝑃𝑜𝑠+𝑁𝑒𝑔 Or =
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 =
𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃+
● Recall: .It is calculated as the number of correctly classified positive instances divided by
the total number of positive instances. It is also called recall or sensitivity. The ideal
value of sensitivity is 1, whereas the worst is 0.
It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃+
Algorithm (Boston Dataset):
Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
Step 2: Import the Social_Media_Adv Dataset
Step 3: Initialize the data frame Step
4: Perform Data Preprocessing
● Convert Categorical to Numerical Values if applicable
● Check for Null Value
● Covariance Matrix to select the most promising features
● Divide the dataset into Independent(X) and
Dependent(Y)variables.
● Split the dataset into training and testing datasets
● Scale the Features if necessary.
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and evaluate
the performance of model.
Value Addition: Visualising Confusion Matrix using Heatmap
Assignment Question:
1) Consider the binary classification task with two classes positive and
negative. Find out TP,TP, FP, TN, FN, Accuracy, Error rate,
Precision, Recall
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain
every step in detail.
2
ASSIGNEMT 06
AIM: Data Analytics III 1. Implement Simple Naïve Bayes classification algorithm using
Python/R on iris.csv dataset
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Data Analytics III 1. Implement Simple Naïve Bayes classification
algorithm using Python/R on iris.csv dataset. II. Compute Confusion matrix to find TP, FP, TN,
FN, Accuracy, Error rate, Precision, Recall on the given dataset
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
3
Problem Analysis
To implement the Naive Bayes Classification, we shall use a very famous Iris Flower Dataset
that consists of 3 classes of flowers. In this, there are 4 independent variables namely
the, sepal_length, sepal_width, petal_length and petal_width. The dependent variable is
the species which we will predict using the four independent features of the flowers.
here are 3 classes of species namely setosa, versicolor and the virginica. This dataset was
originally introduced in 1936 by Ronald Fisher. Using the various features of the flower
(independent variables), we have to classify a given flower using Naive Bayes Classification
model.
As always, the first step will always include importing the libraries which are the NumPy,
Pandas and the Matplotlib.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In this step, we shall import the Iris Flower dataset which is stored in my github repository
as IrisDataset.csv and save it to the variable dataset. After this, we assign the 4 independent
variables to X and the dependent variable ‘species’ to Y. The first 5 rows of the dataset are
displayed.
dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Classification/master/
IrisDataset.csv')X = dataset.iloc[:,:4].values
y = dataset['species'].valuesdataset.head(5)
>>
sepal_length sepal_width petal_length petal_width species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
Step 3: Splitting the dataset into the Training set and Test set
4
Once we have obtained our data set, we have to split the data into the training set and the test set.
In this data set, there are 150 rows with 50 rows of each of the 3 classes. As each class is given in
a continuous order, we need to randomly split the dataset. Here, we have the test_size=0.2, which
means that 20% of the dataset will be used for testing purpose as the test set and the
remaining 80% will be used as the training set for training the Naive Bayes classification
model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
The dataset is scaled down to a smaller range using the Feature Scaling option. In this, both
the X_train and X_test values are scaled down to smaller values to improve the speed of the
program.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Step 5: Training the Naive Bayes Classification model on the Training Set
Once the model is trained, we use the the classifier.predict() to predict the values for the Test set
and the values predicted are stored to the variable y_pred.
y_pred = classifier.predict(X_test)
y_pred
This is a step that is mostly used in classification techniques. In this, we see the Accuracy of the
trained model and plot the confusion matrix.
5
The confusion matrix is a table that is used to show the number of correct and incorrect
predictions on a classification problem when the real values of the Test Set are known. It is of
the format
From the above confusion matrix, we infer that, out of 30 test set data, 29 were correctly
classified and only 1 was incorrectly classified. This gives us a high accuracy of 96.67%.
In this step, a Pandas DataFrame is created to compare the classified values of both the original
Test set (y_test) and the predicted results (y_pred).
df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df>>
Real Values Predicted Values
setosa setosa
setosa setosa
virginica virginica
versicolor versicolor
setosa setosa
6
setosa setosa
... ... ... ... ...
virginica versicolor
virginica virginica
setosa setosa
setosa setosa
versicolor versicolor
versicolor versicolor
This step is an additional step which is not much informative as the Confusion matrix and is
mainly used in regression to check the accuracy of the predicted value.
As you can see, there is one incorrect prediction that has predicted versicolor instead
of virginica.
7
ASSIGNMRNT 07
AIM: Text Analytics 1. Extract Sample document and apply following document preprocessing
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Text Analytics 1. Extract Sample document and apply following
document preprocessing methods: Tokenization, POS Tagging, stop words removal, Stemming
and Lemmatization. Create representation of document by calculating Term Frequency and
Inverse Document Frequency
OUTCOMES:
o Tokenization
o Stopwords
o POS Tagging
8
● Sentiment Analysis
● Text Classification
Tokenization
Tokenization is the first step in text analytics. The process of breaking down a text paragraph
into smaller chunks such as words or sentence is called Tokenization. Token is a single entity
that is building blocks for sentence or paragraph.
Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences.
9
Word Tokenization
Word tokenizer breaks text paragraph into words.
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is',
'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']
Frequency Distribution
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
print(fdist)
<FreqDist with 25 samples and 30 outcomes>
fdist.most_common(2)
[('is', 3), (',', 2)]
# Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()
Stopwords
Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this,
a, an, the, etc.
10
In NLTK for removing stopwords, you need to create a list of stopwords and filter out your list
of tokens from these words.
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)
{'their', 'then', 'not', 'ma', 'here', 'other', 'won', 'up', 'weren', 'being', 'we', 'those', 'an', 'them', 'which', 'him',
'so', 'yourselves', 'what', 'own', 'has', 'should', 'above', 'in', 'myself', 'against', 'that', 'before', 't', 'just', 'into',
'about', 'most', 'd', 'where', 'our', 'or', 'such', 'ours', 'of', 'doesn', 'further', 'needn', 'now', 'some', 'too', 'hasn',
'more', 'the', 'yours', 'her', 'below', 'same', 'how', 'very', 'is', 'did', 'you', 'his', 'when', 'few', 'does', 'down',
'yourself', 'i', 'do', 'both', 'shan', 'have', 'itself', 'shouldn', 'through', 'themselves', 'o', 'didn', 've', 'm', 'off',
'out', 'but', 'and', 'doing', 'any', 'nor', 'over', 'had', 'because', 'himself', 'theirs', 'me', 'by', 'she', 'whom', 'hers',
're', 'hadn', 'who', 'he', 'my', 'if', 'will', 'are', 'why', 'from', 'am', 'with', 'been', 'its', 'ourselves', 'ain', 'couldn',
'a', 'aren', 'under', 'll', 'on', 'y', 'can', 'they', 'than', 'after', 'wouldn', 'each', 'once', 'mightn', 'for', 'this', 'these',
's', 'only', 'haven', 'having', 'all', 'don', 'it', 'there', 'until', 'again', 'to', 'while', 'be', 'no', 'during', 'herself', 'as',
'mustn', 'between', 'was', 'at', 'your', 'were', 'isn', 'wasn'}
Removing Stopwords
filtered_sent=[]
for w in tokenized_sent:
if w not in stop_words:
filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)
Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?']
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Lexicon Normalization
Lexicon normalization considers another type of noise in the text. For example, connection,
connected, connecting word reduce to a common word "connect". It reduces derivationally
related forms of a word to a common root word.
Stemming
Stemming is a process of linguistic normalization, which reduces words to their word root word
or chops off the derivational affixes. For example, connection, connected, connecting word
reduce to a common word "connect".
# Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stemmed_words=[]
11
for w in filtered_sent:
stemmed_words.append(ps.stem(w))
print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)
Filtered Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Stemmed Sentence: ['hello', 'mr.', 'smith', ',', 'today', '?']
Lemmatization
Lemmatization reduces words to their base word, which is linguistically correct lemmas. It
transforms root word with the use of vocabulary and morphological analysis. Lemmatization is
usually more sophisticated than stemming. Stemmer works on an individual word without
knowledge of the context. For example, The word "better" has "good" as its lemma. This thing
will miss by stemming because it requires a dictionary look-up.
#Lexicon Normalization
#performing stemming and Lemmatization
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))
Lemmatized Word: fly
Stemmed Word: fli
POS Tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a
given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based
on the context. POS Tagging looks for relationships within the sentence and assigns a
corresponding tag to the word.
sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens=nltk.word_tokenize(sent)
print(tokens)
['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']
nltk.pos_tag(tokens)
[('Albert', 'NNP'),
('Einstein', 'NNP'),
('was', 'VBD'),
('born', 'VBN'),
12
('in', 'IN'),
('Ulm', 'NNP'),
(',', ','),
('Germany', 'NNP'),
('in', 'IN'),
('1879', 'CD'),
('.', '.')]
POS tagged: Albert/NNP Einstein/NNP was/VBD born/VBN in/IN Ulm/NNP ,/, Germany/NNP
in/IN 1879/CD ./.
CONCLUSION:- In this way we have explored the Text Analytics.
13
ASSIGNMENT 08
Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid
plots, regression plots etc., in this article we will see how the Seaborn library can be used to draw
distributional and categorial plots. In the second part of the series, we will see how to draw
regression plots, matrix plots, and grid plots.
The seaborn library can be downloaded in a couple of ways. If you are using pip installer for
Python libraries, you can execute the following command to download the library:
Alternatively, if you are using the Anaconda distribution of Python, you can use execute the
following command to download the seaborn library:
The Dataset
14
The dataset that we are going to use to draw our plots will be the Titanic dataset, which is
downloaded by default with the Seaborn library. All you have to do is use
the load_dataset function and pass it the name of the dataset.
Let's see what the Titanic dataset looks like. Execute the following script:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
Distributional Plots
Distributional plots, as the name suggests are type of plots that show the statistical distribution of
data. In this section we will see some of the most commonly used distribution plots in Seaborn.
The Dist Plot
The distplot() shows the histogram distribution of data for a single column. The column name is
passed as a parameter to the distplot() function. Let's see how the price of the ticket for each
passenger is distributed. Execute the following script:
sns.distplot(dataset['fare'])
Output:
15
You can see that most of the tickets have been solved between 0-50 dollars. The line that you see
represents the kernel density estimation. You can remove this line by passing False as the
parameter for the kde attribute as shown below:
sns.distplot(dataset['fare'], kde=False)
You can also pass the value for the bins parameter in order to see more or less details in the
graph. Take a look at he following script:
Here we set the number of bins to 10. In the output, you will see data distributed in 10 bins as
shown below:
Output:
16
You can clearly see that for more than 700 passengers, the ticket price is between 0 and 50.
17
From the output, you can see that a joint plot has three parts. A distribution plot at the top for the
column on the x-axis, a distribution plot on the right for the column on the y-axis and a scatter
plot in between that shows the mutual distribution of data for both the columns. You can see that
there is no correlation observed between prices and the fares.
You can change the type of the joint plot by passing a value for the kind parameter. For instance,
if instead of scatter plot, you want to display the distribution of data in the form of a hexagonal
plot, you can pass the value hex for the kind parameter. Look at the following script:
sns.jointplot(x='age', y='fare', data=dataset, kind='hex')
Output:
18
In the hexagonal plot, the hexagon with most number of points gets darker color. So if you look
at the above plot, you can see that most of the passengers are between age 20 and 30 and most of
them paid between 10-50 for the tickets.
The Pair Plot
The paitplot() is a type of distribution plot that basically plots a joint plot for all the possible
combination of numeric and Boolean columns in your dataset. You only need to pass the name
of your dataset as the parameter to the pairplot() function as shown below:
sns.pairplot(dataset)
CONCLUSION:- In this way we have explored the Data Visualization I.
19
ASSIGNMENT 08
AIM: Data Visualization II 1. Use the inbuilt dataset 'titanic' as used in the above problem..
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Data Visualization II 1. Use the inbuilt dataset 'titanic' as used in the
above problem. Plot a box plot for distribution of age with respect to each gender along with the
information about whether they survived or not. (Column names : 'sex' and 'age') Write
observations on the inference from the above statistics.
OUTCOMES:
Categorical Plots
Categorical plots, as the name suggests are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against another categorical column or
a numeric column. Let's see some of the most commonly used categorical data.
The barplot() is used to display the mean value for each value in a categorical column, against a
numeric column. The first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. For instance, if you want to know the
mean value of the age of the male and female passengers, you can use the bar plot as follows.
Output:
20
From the output, you can clearly see that the average age of male passengers is just less than 40
while the average age of female passengers is around 33.
In addition to finding the average, the bar plot can also be used to calculate other aggregate
values for each category. To do so, you need to pass the aggregate function to the estimator. For
instance, you can calculate the standard deviation for the age of each gender as follows:
import numpy as np
Notice, in the above script we use the std aggregate function from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output looks like this:
21
The Count Plot
The count plot is similar to the bar plot, however it displays the count of the categories in a
specific column. For instance, if we want to count the number of males and women passenger we
can do so using count plot as follows:
sns.countplot(x='sex', data=dataset)
The output shows the count as follows:
sns.boxplot(x='sex', y='age', data=dataset)
Output:
22
If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.
You can make your box plots more fancy by adding another layer of distribution. For instance, if
you want to see the box plots of forage of passengers of both genders, along with the information
about whether or not they survived, you can pass the survived as value to the hue parameter as
shown below:
Output:
23
Now in addition to the information about the age of each gender, you can also see the
distribution of the passengers who survived. For instance, you can see that among the male
passengers, on average more younger people survived as compared to the older ones. Similarly,
you can see that the variation among the age of female passengers who did not survive is much
greater than the age of the surviving female passengers.
nstead of plotting two different graphs for the passengers who survived and those who did not,
you can have one violin plot divided into two halves, where one half represents surviving while
the other half represents the non-surviving passengers. To do so, you need to pass True as value
for the split parameter of the violinplot() function. Let's see how we can do this:
24
Now you can clearly see the comparison between the age of the passengers who survived and
who did not for both males and females.
25
ASSIGNEMENT 10
AIM: Data Visualization III Download the Iris flower dataset or any other dataset into a
DataFrame.
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Data Visualization III Download the Iris flower dataset or any other
dataset into a DataFrame. (eg https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and
give the inference as: 1. How many features are there and what are their types (e.g., numeric,
nominal)? 2. Create a histogram for each feature in the dataset to illustrate the feature
distributions. 3. Create a boxplot for each feature in the dataset. Compare distributions and
identify outliers.
OUTCOMES:
path = "iris.csv"
df = pd.read_csv(path, header=None)
26
headers = ["Sepal-length", "Sepal-width", "Petal-length", "Petal-width", "Species"]
df.columns = headers
print(df.head())
print(df.tail())
print(df.info())
print(df.shape)
print(df.dtypes)
print(df.describe())
df.hist()
plt.show()
df.boxplot()
plt.show()
plt.scatter(df["Sepal-length"], df["Sepal-width"])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()
plt.scatter(df["Sepal-length"], df["Petal-length"])
plt.xlabel('Sepal Length')
plt.ylabel('Petal Width')
plt.show()
plt.scatter(df["Sepal-length"], df["Petal-width"])
plt.xlabel('Sepal Length')
plt.xlabel('Petal Width')
plt.show()
plt.scatter(df["Sepal-width"], df["Sepal-length"])
plt.xlabel('Sepal Width')
plt.ylabel('Sepal Length')
plt.show()
plt.scatter(df["Sepal-width"], df["Petal-length"])
plt.xlabel('Sepal Width')
plt.ylabel('Petal Length')
plt.show()
plt.scatter(df["Sepal-width"], df["Petal-width"])
plt.xlabel('Sepal Width')
27
plt.ylabel('Petal Width')
plt.show()
plt.scatter(df["Petal-length"], df["Sepal-length"])
plt.xlabel('Petal Length')
plt.ylabel('Sepal Length')
plt.show()
plt.scatter(df["Petal-length"], df["Sepal-width"])
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.show()
plt.scatter(df["Petal-length"], df["Petal-width"])
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.show()
plt.scatter(df["Petal-width"], df["Sepal-length"])
plt.xlabel('Petal Width')
plt.xlabel('Sepal Length')
plt.show()
plt.scatter(df["Petal-width"], df["Sepal-width"])
plt.xlabel('Petal Width')
plt.xlabel('Sepal Width')
plt.show()
plt.scatter(df["Petal-width"], df["Petal-length"])
plt.xlabel('Petal Width')
plt.yxlabel('Petal Length')
plt.show()
import numpy as np
import pandas as pd
df = pd.read_csv("iris-flower-dataset.csv",header=None)
df.columns = ["col1","col2","col3","col4","col5"]
df.head()
column = len(list(df))
column
df.info()
np.unique(df["col5"])
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
28
df.describe()
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
fig, axes = plt.subplots(2, 2, figsize=(16, 8))
sns.set_style("whitegrid")
# Creating a figure instance
fig = plt.figure(1, figsize=(12,8))
29
ASSIGNEMENT 11
AIM: Write a code in JAVA for a simple WordCount application that counts the number of
occurrences of each word in a given input set
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Write a code in JAVA for a simple WordCount application that
counts the number of occurrences of each word in a given input set using the Hadoop
MapReduce framework on local-standalone set-up
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO6: Use cutting edge tools and technologies to analyze Big Data
This is an Hadoop Map/Reduce application for Working on weather data It reads the text input
files, breaks each line into stations weather data and finds average for temperature , dew point ,
wind speed. The output is a locally sorted list of stations and its 12 attribute vector of average
temperature , dew , wind speed of 4 sections for each month.
Installation
Since this a Map reduce Project , please install Apache Hadoop , Refere the below site for more
details
30
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Once you have installed the hadoop , download the project using git commands and install it.
Run
Add the sample input file given in the proect to the hdfs
and start the Map reduce application to submit the job to hadoop
CONCLUSION:- In this way we have explored the Hadoop MapReduce framework on local-
standalone set-up.
31
ASSIGNMENT 12
AIM: Design a distributed application using MapReduce which processes a log file of a system.
OBJECTIVES:
● To understand principles of data science for the analysis of real time
problems.
● Students should be able to data analysis using logistic regression
using Python for any open source dataset
PROBLEM STATMENT: Design a distributed application using MapReduce which processes
a log file of a system.
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO6: Use cutting edge tools and technologies to analyze Big Data
This section refers to the installation settings of Hadoop on a standalone system as well as on a
system existing as a node in a cluster. SINGLE-NODE INSTALLATION Running Hadoop on
Ubuntu (Single node cluster setup)
The report here will describe the required steps for setting up a single-node Hadoop cluster
backed by the Hadoop Distributed File System, running on Ubuntu Linux. Hadoop is a
framework written in Java for running applications on large clusters of commodity hardware and
incorporates features similar to those of the Google File System (GFS) and of the MapReduce
computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like
Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput
access to application data and is suitable for applications that have large data sets. Before we
start, we will understand the meaning of the following:
DataNode: A DataNode stores data in the Hadoop File System. A functional file system has
more than one DataNode, with the data replicated across them.
NameNode: The NameNode is the centrepiece of an HDFS file system. It keeps the directory of
all files in the file system, and tracks where across the cluster the file data is kept. It does not
store the data of these file itself.
Jobtracker: The Jobtracker is the service within hadoop that farms out MapReduce to specific
nodes in the cluster, ideally the nodes that have the data, or atleast are in the same rack.
TaskTracker: A TaskTracker is a node in the cluster that accepts tasks- Map, Reduce and Shuffle
operatons – from a Job Tracker.
32
Secondary Namenode: Secondary Namenode whole purpose is to have a checkpoint in HDFS. It
is just a helper node for namenode. Prerequisites Java is the primary requirement for run hadoop
on any system, so make sure you have Java installed on your system using following command.
# java -version If you don’t have Java installed on your system, follow the step to install java.
Java download
CONCLUSION:- In this way we have explored the Hadoop framework MapReduce which
processes a log file of a system.
33
ASSIGNMENT 13
AIM: Locate dataset (eg. sample_weather.txt) for working on weather data which reads the text
input files and finds average for temperature, dew point and wind speed.
OBJECTIVES:
● To analyze and demonstrate knowledge of statistical data analysis techniques for
decision-making
● To gain practical, hands-on experience with statistics programming languages and big
data tools
PROBLEM STATMENT: Locate dataset (eg. sample_weather.txt) for working on weather
data which reads the text input files and finds average for temperature, dew point and wind
speed.
OUTCOMES:
CO1: Apply principles of data science for the analysis of real time problems.
CO6: Use cutting edge tools and technologies to analyze Big Data
This is an Hadoop Map/Reduce application for Working on weather data It reads the text input
files, breaks each line into stations weather data and finds average for temperature , dew point ,
wind speed. The output is a locally sorted list of stations and its 12 attribute vector of average
temperature , dew , wind speed of 4 sections for each month.
InstallationSince this a Map reduce Project , please install Apache Hadoop , Refere the below
site for more details
Once you have installed the hadoop , download the project using git commands and install it.
Run
Add the sample input file given in the proect to the hdfs
and start the Map reduce application to submit the job to hadoop
34
CONCLUSION:- In this way we have explored the Locate dataset (eg. sample_weather.txt) for
working on weather data which reads the text input files
35