Data Analytics Lab Manual Students.docx
Data Analytics Lab Manual Students.docx
P. R. POTE PATIL
College of Engineering & Management, Amravati.
Program Outcomes:
Engineering Graduate will be able to:
1. Engineering Knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
2. Problem Analysis: Identify, formulate, research literature, and analyze
complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex
engineering problems and design system components or processes that meet
the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based
knowledge and research methods including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid
conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modeling to complex engineering activities with an understanding of the
limitations.
6. The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.
10.Communication: Communicate effectively on complex engineering activities
with the engineering community and with society at large, such as, being able
to comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions.
11.Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply
these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12.Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context
of technological change.
P. R. POTE PATIL
COLLEGE OF ENGINEERING & MANAGEMENT, AMRAVATI.
Certificate
This is to certify that Mr./Ms…………………………………………... of
….... Semester of Bachelor of Technology in Artificial Intelligence &
Data Science of P. R. Pote Patil College of Engineering &
Management, Amravati, has completed the term work satisfactory of
course…………………. for the academic year 20 - 20 as
prescribed in the curriculum.
Subject Teacher Head of the Department
Signature of Faculty
Course Outcomes
After successful completion of laboratory course, the students will able to
SN Outcomes
After successful completion of this course, students will be able to
1 To understand Data Analytics Life Cycle and Business Challenges
2 To understand Analytical Techniques and Statically Models
3 To understand Statically Modeling Language.
Rubrics used for continuous Assessment in every lab session
Skills Allocated Parameters High Medium Low
Marks
1. Handle equipment/ tools/ Most Partially Below
commands correctly or satisfactory successful expectation
Logic Formation (5) (4-5) (3) (0-2)
2. Work cohesively in team Exceptional (2) Satisfactory Unsatisfactory
(2) (1) (0)
Process
Related 15
1. Integrate system & Partially Incorrect or
Skills measure parameters correct (2-3) unsatisfactory
correctly or Debugging Highly
satisfactory (4) (0-1)
Ability(4) *Completed *Completed
2. Completed experiment as In time (4) but delayed with 50%
per schedule (4) (2) delayed (1)
Obtain correct results, Highly Partially Incorrect (0-1)
Interpret results (4) Accurate (4) correct (2-3)
Product Highly accurate Unsatisfactory
Related 10 Draw conclusion (3) (3) Partially (2) (1)
Skills
Answer practical related Highly Moderate Unsatisfactory
questions & submit the satisfactory (3) satisfactory (1)
write up of expt on time (3) (2)
Total Marks 25 Marks
Assessment
Marks Rubrics: 25 = a (05) + b (02) + c (04) + d (04) + e (04) + f (03) + g (03)
a: Handle equipment/ tools/ commands correctly or Logic Formation
b: Work cohesively in team
c: Integrate system & measure parameters correctly or Debugging Ability
d: Completed experiment as per schedule
e: Obtain correct results, Interpret results
f: Draw conclusion
g: Answer practical related questions & submit the write up of experiment on time
SN. Title of the (a) (b) (c) (d) (e) (f) (g) Total
25
Practical / Experiment 05 02 04 04 04 03 03 Marks
1 Install and configure
Hadoop framework
2 Working with Hadoop
distributed file system
Implement the Map reduce
3 method in hadoop and write
the Word count program
4 Develop the Program for
Apriori Algorithm
5 Installation of R Studio and
write a simple program for it
To construct a Data Frame
6 and develop an R program
for data frame
Construct a program for
7 Manipulating & Processing
Data in R.
To Generate Graphs Using
Plot(), Hist(), Linechart(),
8 Pie(), Boxplot(), and
Scatterplots() Develop an R
program
EXPERIMENT NO: 01
Title: Install and configure Hadoop framework
Objective: to enable the use of a distributed computing environment for handling
large volumes of data efficiently
Software Details:
SN Name of Software/Tools Specification Qty Required
01 Hadoop 3.12.4 01
02 Java 8 version 01
Theory:
Hadoop software can be installed in three modes of
Hadoop is a Java-based programming framework that supports the processing and
storage of extremely large data sets on a cluster of inexpensive machines. It was the first
major open source project in the big data playing field and is sponsored by the Apache
Software Foundation.
Hadoop-2.7.3 is comprised of four main layers:
● Hadoop Common is the collection of utilities and libraries that support other
hadoop modules.
● HDFS, which stands for Hadoop Distributed File System, is responsible for
persisting data to disk.
● YARN, short for Yet Another Resource Negotiator, is the "operating system" for
HDFS.
● Map Reduce is the original processing model for Hadoop clusters. It distributes
work within the cluster or map, then organizes and reduces the results from the
nodes into a response to a query. Many other processing models are available for
the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone
mode which is suitable for learning about Hadoop, performing simple operations, and
debugging.
Procedure:
We’ll install Hadoop in stand-alone mode and run one of the example example Map
Reduce programs it includes verifying the installation.
Prerequisites:
Step1: Installing Java 8 version.
Open jdk version "1.8.0_91"
Open JDK Runtime Environment (build1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
Open JDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that Open JDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME
Step2: Installing Hadoop
With Java in place, we'll visit the Apache Hadoop Releases page to find the most recent
stable release. Follow the binary for the current release:
Procedure:
Step1: Installing Java 8 version.
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 02
Title: Working with Hadoop distributed file system
Objective: To efficiently store and manage large volumes of data across a
distributed environment
Theory:
Working with the Hadoop Distributed File System (HDFS) involves interacting with a
distributed storage system designed to handle large amounts of data across multiple
machines. HDFS is part of the Apache Hadoop ecosystem and provides fault-tolerant
storage with high throughput.
Here’s an overview of how to work with HDFS, including both the Hadoop command-line
interface (CLI) and programmatic approaches (Java and Python).
1. HDFS Overview
HDFS has two key components:
1. NameNode: This is the master node that manages the filesystem namespace and
regulates access to files.
2. DataNode: These are the worker nodes that store the actual data blocks.
HDFS stores large files by splitting them into blocks (default block size is 128 MB or 256
MB) and distributing them across different nodes in the cluster.
2. Using the HDFS Command-Line Interface (CLI)
You can interact with HDFS using the Hadoop CLI. Some basic commands include:
a. Listing files in HDFS:
hdfs dfs -ls /user/hadoop/
This lists all the files and directories in the /user/hadoop/ directory.
b. Creating directories in HDFS:
hdfs dfs -mkdir /user/hadoop/new_dir
This creates a new directory in HDFS at /user/hadoop/new_dir.
c. Copying files from local filesystem to HDFS:
hdfs dfs -put localfile.txt /user/hadoop/
This uploads localfile.txt from your local file system to HDFS under /user/hadoop/.
d. Copying files from HDFS to local file system:
hdfs dfs -get /user/hadoop/testfile.txt /path/to/local/
This retrieves the testfile.txt from HDFS to your local machine.
e. Reading a file from HDFS:
hdfs dfs -cat /user/hadoop/testfile.txt
This prints the content of testfile.txt from HDFS to the terminal.
f. Deleting files from HDFS:
hdfs dfs -rm /user/hadoop/testfile.txt
This deletes the file testfile.txt from HDFS.
g. Checking the status of HDFS:
hdfs dfsadmin -report
This provides an overview of the HDFS cluster's status, including the amount of data
stored and available space.
Program:-
Result:-
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 03
Title: Implement the Map Reduce method in hadoop and write the Word count program
Objective: To leverage the distributed computing capabilities of Hadoop to efficiently
process large datasets.
Theory:
Map Reduce is a core component of Hadoop that enables distributed data
processing. It allows you to perform operations like filtering, aggregation, and
transformation of large datasets.
Example Program: Word Count Program This is a classic example where you
count the number of occurrences of each word in a dataset. Here’s how it works:
Mapper: Reads the input and splits it into words.
Reducer: Aggregates the word counts.
Code (MapReduce in Java):
Program:-
Result:-
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 04
Title: Develop the Program for Apriori Algorithm
Objective: To implement an efficient method for identifying frequent itemsets and
generating association rules from a transactional dataset.
Theory:
The Apriori algorithm is a classic data mining technique used to find frequent itemsets in
a transaction dataset and derive association rules. It was introduced by Rakesh Agrawal
and Ramakrishnan Srikant in 1994 and is mainly applied in market basket analysis.
Key Concepts:
Frequent Itemsets: A set of items that appear together in a transaction dataset with
frequency above a specified threshold.
Association Rules: These rules express relationships between items, showing how the
presence of one item in a transaction affects the presence of another item. For example,
"If a customer buys bread, they are likely to also buy butter."
Example:
Let's say you have a transaction dataset:
Transaction ID Items Bought
1 Milk, Bread
2 Milk, Butter
3 Milk, Bread, Butter
4 Bread, Butter
Step-by-step:
Step 1: Find frequent 1-itemsets (e.g., Milk, Bread, Butter) by counting the frequency of
individual items.
Step 2: Generate frequent 2-itemsets (e.g., Milk & Bread, Milk & Butter, Bread & Butter).
Step 3: Generate rules like "If Milk is bought, then Bread is also bought."
Program: -
Result: -
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 05
Title: Installation of R Studio and write a simple program for it
Objective: The objective of installing R is to set up an environment for statistical
computing and data analysis
Theory:
R is a programming language and free software developed by Ross Ihaka and Robert
Gentleman in 1993.
R possesses an extensive catalog of statistical and graphical methods. It includes
machine Learning algorithms, linear regression, time series, statistical inference to name
a few. Most of the R Libraries are written in R, but for heavy computational tasks, C, C++
and FORTRAN codes are Preferred.
R is not only entrusted by academic, but many large companies also use R programming
Language, including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,
modeling and communicate the results.
Program: R is a clear and accessible programming tool
Transform: R is made up of a collection of libraries designed specifically for data science
Discover: Investigate the data, refine your hypothesis and analyze them
Model: R provides a wide array of tools to capture the right model for your data
Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build
Shiny apps to share with the world
What is R used for?
Statistical inference
Data analysis
Machine learning algorithm
Procedure:
Installation of R-Studio on windows:
Step – 1: With R-base installed, let’s move on to installing RStudio. To begin, go to
download RStudio and click on the download button for RStudio desktop
Step – 2: Click on the link for the windows version of RStudio and save the .exe file.
Step – 3: Run the .exe and follow the installation instructions.
1. Click next on the welcome window
• Enter/Browse the path to the installation folder and click Next to proceed.
• Select the folder for the start menu shortcut or click on do not create shortcuts and then
click Next.
Wait for the installation process to complete.
Program:
Result: -
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 06
Title: To construct a Data Frame and develop an R program for data frame
Objective: To create a Data Frame is to understand how to construct and manipulate a
structured collection of data.
Theory:
In R, a data frame is a two-dimensional, tabular data structure that can hold different
types of data (like numeric, character, factor, etc.) in columns. Each column can have
different data types, similar to a spreadsheet or SQL table. Data frames are a key
component in R for data manipulation and analysis.
Creating a Data Frame
You can create a data frame using the data frame () function.
Example:
# creating a simple data frame
df<- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Height = c(5.5, 6.0, 5.8)
)
print(df)
# Output:
# Name Age Height
# 1 Alice 25 5.5
# 2 Bob 30 6.0
# 3 Charlie 35 5.8
Accessing Rows
You can access rows using square brackets with row and column indices.
# Access the first row
first_row<- df[1, ]
print(first_row)
# Output:
# Name Age Height
# 1 Alice 25 5.5
# Access specific rows and columns (e.g., first row, second column)
age_of_first_person<- df[1, 2]
print(age_of_first_person) # Output: [1] 25
# Output:
# Name Height
# 1 Alice 5.5
# 3 Charlie 5.8
Deleting a Row:
# Remove the second row
df<- df[-2, ] # or df = df[-which(df$Name == "Bob"), ]
print(df)
# Output:
# Name Age Height
# 1 Alice 26 5.5
# 3 Charlie 36 5.8
# Output:
# Name Age Height
# Length:2 Min. :26 Min. :5.5
# Class :character 1st Qu.:26 1st Qu.:5.5
# Mode :character Median :31 Median :5.6
# Mean :31 Mean :5.6
# 3rd Qu.:36 3rd Qu.:5.8
# Max.:36 Max. :5.8
Exporting Data
# Exporting data frame to a CSV file
write.csv(df, "output.csv", row.names = FALSE)
Program:-
Result: -
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 07
Title: Construct a program for Manipulating & Processing Data in R.
Objective: To provide an efficient framework for performing essential data operations,
such as filtering, sorting, transforming, aggregating, and summarizing datasets.
Theory:
Data manipulation and processing are fundamental tasks in data analysis, and R is a
powerful tool for handling these operations. By constructing a program for
manipulating and processing data in R, you can automate common data preparation
tasks such as cleaning, transforming, summarizing, and analyzing datasets. This
allows for efficient and reproducible workflows that support data-driven
decision-making.
Steps Involved in Constructing a Data Manipulation Program in R
Import Data: Load the dataset into R using appropriate functions based on the file
type (e.g., read.csv() for CSV files, read_excel() for Excel files).
Inspect the Data:
Check the structure, dimensions, and summary statistics of the data using functions
like head(), str(), summary(), and glimpse().
This helps in identifying the types of variables and spotting potential issues like
missing values or incorrect formats.
Clean Data:
Handle missing values (NAs) by removing them or replacing them with suitable
values (mean, median, or other imputed values).
Remove duplicates using distinct () and correct data types if needed (e.g., convert a
character column to a factor or numeric).
Transform Data:
Create new columns using the mutate() function. This can involve mathematical
operations or string manipulation.
For example, creating a new column to categorize a numerical variable into
categories (e.g., converting scores into letter grades).
Filter and Sort Data:
Use filter() to extract specific rows based on conditions.
Use arrange() to sort the data in a desired order, such as by score in descending
order.
Group and Aggregate Data:
Use group_by() to group the data by categorical variables (e.g., department or
region).
Use summarise() to compute aggregate values (mean, median, sum, etc.) for each
group.
Summarize the Data:
Summarize the dataset using statistical measures, and get insights into its
distribution or central tendency.
Export Processed Data:
Save the processed data to a new file using write.csv() or other appropriate
functions.
Program:-
Result:
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty
EXPERIMENT NO: 08
Title: To Generate Graphs Using Plot(), Hist(), Linechart(), Pie(), Boxplot(), and
Scatterplots() Develop an R program
Objective: To model and analyze relationships and connections between
different entities or elements.
Theory:
Graphs using Graph functions: Plot(), Hist(),Line chart(), Pie(), Box plot(), Scatter plots()
in r programming
In R, various functions are provided for creating different types of graphs and
visualizations. Below are examples of how to use the most commonly used plotting
functions, including plot (), hist(), lines(), pie(), boxplot(), and scatterplot().
# Basic plot
plot(x, y, type = "n", main = "Line Chart", x lab = "X-axis", y lab = "Y-axis") # "type = 'n'"
does not plot points
lines (x, y, col = "red", lwd = 2)
# Adding line
Program: -
Result: -
Conclusion:
Assessment Scheme:
Process Related Skills Product Related Skills Total Signature of
(15-M) (10-M) (25-M) Faculty