DSL Lab
DSL Lab
SUBJECT: DS USING
PYTHON LAB
(ITL605)
Institute’s Vision
Institute’s Mission
Department’s Mission
To imbibe problem solving and analytical skills through teaching learning process.
To impart technical and managerial skills to meet the industry requirement.
To encourage ethical and value-based education.
Excelssior’s Education Society
K. C. COLLEGE OF ENGINEERING AND
MANAGEMENT STUDIES AND RESEARCH
THANE (EAST).
Certificate
Om Ishwar Badhe
This is to certify that Mr/ Ms__________________________________of
VI
Semester: _____ IT
Branch: _______ 01
Roll No: _______ has performed and
successfully & completed all the practicals in the subject
DSL LAB
_____________of for the academic year 2023 to 2024 as prescribed by
University of Mumbai.
DATE: -
COLLEGE SEAL
Course details
Lab Objective
Lab Objectives: Sr. No. Lab Objectives
The Lab experiments aims:
1 To know the fundamental concepts of data science and analytics
Lab Outcomes:
Sr. No. Lab Outcomes Cognitive levels of attainment
as per Bloom’s Taxonomy
DETAILED SYLLABUS:
Program Outcomes
Engineering Graduates will be able to:
Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences,
and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering solutions
in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Department of Information Technology
Semester :VI
Class : TE
ITL 605.1 Understand the concept of Data science process and associated
terminologies to solve real-world problems
ITL 605.2 Analyze the data using different statistical techniques and visualize
the outcome using different types of plots.
ITL 605.3 Analyze and apply the supervised machine learning techniques like
Classification, Regression or Support Vector Machine on data for
building the models of data and solve the problems.
ITL 605.4 Apply the different unsupervised machine learning algorithms like
Clustering, Decision Trees, Random Forests or Association to solve
the problems.
ITL 605.6 Design and develop a data science application that can have data
acquisition, processing, visualization and statistical analysis
methods with supported machine learning technique to solve the
real-world problem
Rubrics for Practical
Marks
6 Classification modeling
a. Choose a classifier for classification problems.
b. Evaluate the performance of classifier
7 Clustering
a. Clustering algorithms for unsupervised
classification.
b. Plot the cluster data.
12 Assignment 1
13 Assignment 2
Total Grade / Marks :-
(A) (B)
(A+B)
Obtained Out of Obtained Out of
Lab Outcome: - Understand the concept of Data science process and associated
terminologies to solve real-world problems
Practical Incharge
EXPERIMENT NO. - 01
THEORY:
Data Preprocessing:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.
2. Regression:
Here data can be made smooth by fitting it to a regression function. regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for the mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attributes by interval levels or conceptual
levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amounts of data. While working
with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we
use data reduction techniques. It aims to increase the storage efficiency and reduce data storage
and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanism .It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction, else it is called lossy reduction. The two effective methods of
dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
Feature Scaling:
Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying
magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm
tends to weigh greater values, higher and consider smaller values as the lower values,
regardless of the unit of the values.
OUTPUT :
Data preparation using numpy and pandas
CONCLUSION:
Data preparation using NumPy and Pandas has been successfully studied on a small sample data
set on google collab environment.
EXPERIMENT NO. - 02
Aim of the Experiment :- Data Visualization / Exploratory Data Analysis for the selected
data set using Matplotlib and Seaborn
Create a bar graph, contingency table using any 2 variables.
Create normalized histogram.
Describe what this graphs and tables indicates?
Practical Incharge
EXPERIMENT NO. - 02
AIM : Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create a normalized histogram.
c. Describe what these graphs and tables indicate?
THEORY:
A bar graph is the graphical representation of categorical data using rectangular bars where the length
of each bar is proportional to the value they represent. A histogram is the graphical representation of
data where data is grouped into continuous number ranges and each range corresponds to a vertical bar.
Contingency Table is one of the techniques for exploring two or even more variables. It is basically
a tally of counts between two or more categorical variables.
A barplot is basically used to aggregate the categorical data according to some methods and by default
it’s the mean. It can also be understood as a visualization of the group by action. To use this plot we
choose a categorical column for the x-axis and a numerical column for the y-axis, and we see that it
creates a plot taking a mean per categorical column.
Parameters :
Arguments Value Description
names of variables in Inputs for plotting long-form data. See examples for
x, y, hue “data“ or vector data, interpretation.
optional
DataFrame, array, or Dataset for plotting. If “x“ and “y“ are absent, this is
data list of arrays, optional interpreted as wide-form. Otherwise it is expected to be
long-form.
lists of strings, Order to plot the categorical levels in, otherwise the levels
order, optional are inferred from the data objects.
hue_order
callable that maps Statistical function to estimate within each categorical bin.
estimator vector -> scalar,
optional
matplotlib color, Color for all of the elements, or seed for a gradient palette.
color optional
palette name, list, or Colors to use for the different levels of the “hue“ variable.
palette dict, optional Should be something that can be interpreted by
:func:`color_palette`, or a dictionary mapping hue levels to
matplotlib colors.
matplotlib color Color for the lines that represent the confidence interval.
errcolor
matplotlib Axes, Axes object to draw the plot onto, otherwise uses the
ax optional current Axes.
ey, value mappings Other keyword arguments are passed through to “plt.bar“ at
kwargs draw time.
OUTPUT:
Conclusion :
we have successfully implemented Data Visualization / Exploratory Data Analysis for the
selected data set using Matplotlib and Seaborn
EXPERIMENT NO. - 03
Practical Incharge
Experiment No. 3
AIM : Data Modeling : Validating partition by performing a two‐sample Z‐test.
Data modeling is the process of creating a simplified diagram of a software system and the data
elements it contains, using text and symbols to represent the data and how it flows. Data models
provide a blueprint for designing a new database or reengineering a legacy application.
Z-test
Z-test is a statistical method to determine whether the distribution of the test statistics can be
approximated by a normal distribution. It is the method to determine whether two sample means
are approximately the same or different when their variance is known and the sample size is
large (should be >= 30).
● The sample size should be greater than 30. Otherwise, we should use the t-test.
● Samples should be drawn at random from the population.
● The standard deviation of the population should be known.
● Samples that are drawn from the population should be independent of each other.
● The data should be normally distributed, however for large sample size, it is assumed
to have a normal distribution.
Hypothesis Testing
● Null Hypothesis: The null hypothesis is a statement that the value of a population
parameter (such as proportion, mean, or standard deviation) is equal to some claimed
value. We either reject or fail to reject the null hypothesis. Null Hypothesis is denoted by
H0.
● Alternate Hypothesis: The alternative hypothesis is the statement that the parameter
has a value that is different from the claimed value. It is denoted by HA.
Two-sampled z-test:
In this test, we have provided 2 normally distributed and independent populations, and we have
drawn samples at random from both populations. Here, we consider u1 and u2 be the population
mean X1 and X2 are the observed sample mean. Here, our null hypothesis could be like:
H0 : µ1- µ2 = 0
and alternative hypothesis
where sigma1 and sigma2 are the standard deviation and n1 and n2 are the sample size of
population corresponding to u1 and u2 .
● Type I error: Type 1 error has occurred when we reject the null hypothesis, even
when the hypothesis is true. This error is denoted by alpha.
● Type II error: Type II error has occurred when we didn’t reject the null hypothesis,
even when the hypothesis is false. This error is denoted by beta.
OUTPUT:
CONCLUSION:
Data Modeling : Validating partition by performing a two‐sample Z‐test has been successfully
performed along with the study of output on google collab environment
EXPERIMENT NO. - 04
Aim of the Experiment :- Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.
Correlation Tests : Chi-Squared Test
Lab Outcome :- Analyze the data using different statistical techniques and visualize the
outcome using different types of plots.
Practical In charge
Experiment No. 4
AIM : Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.
THEORY:
The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then
using Python’s SciPy module.
A Contingency table (also called crosstab) is used in statistics to summarise the relationship
between several categorical variables. Here, we take a table that shows the number of men and
women buying different types of pets.
The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related
to each other.
Null hypothesis:
We start by defining the null hypothesis (H0) which states that there is no relation between the
variables. An alternate hypothesis would state that there is a significant relation between the two.
● Using p-value:
We define a significance factor to determine whether the relation between the variables is of
considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This
alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha value
is chosen in cases where we expect more precision. If the p-value for the test comes out to be
strictly greater than the alpha value, then H0 holds true.
If our calculated value of chi-square is less or equal to the tabular(also called critical) value of chi-
square, then H0 holds true.
Expected Values Table :Next, we prepare a similar table of calculated(or expected) values.
To do this we need to calculate each item in the new table as :
row total * column total / grand total
Chi-Square Table :
Calculated_value
From this table, we obtain the total of the last column, which gives us the calculated value of chi-
square. Hence the calculated value of chi-square is 4.542228269825232
Now, we need to find the critical value of chi-square. We can obtain this from a table. To use this
table, we need to know the degrees of freedom for the dataset. The degrees of freedom is defined
as : (no. of rows – 1) * (no. of columns – 1).
Hence, the degrees of freedom is (2-1) * (3-1) = 2
Now, look at the table and find the value corresponding to 2 degrees of freedom and 0.05
significance factor :
The tabular value of chi-square here is 5.991
Hence, Critical value of X^2>=Calculates value of x
Therefore, H0 is accepted, that is, the variables do not have a significant relation.
Installation:
The chi2_contingency() function of scipy.stats module takes as input, the contingency table in
2d array format. It returns a tuple containing test statistics, the p-value, degrees of freedom and
expected table(the one we created from the calculated values) in that order.
Hence, we need to compare the obtained p-value with alpha value of 0.05.
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')
OUTPUT:
Chi-square Test for feature selection
Feature selection is also known as attribute selection is a process of extracting the most relevant
features from the dataset and then applying machine learning algorithms for the better performance
of the model. A large number of irrelevant features increases the training time exponentially and
increase the risk of overfitting.
Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each
feature and the target and select the desired number of features with best Chi-square scores. It
determines if the association between two categorical variables of the sample would reflect their
real association in the population.
where –
Observed frequency = No. of observations of class
Expected frequency = No. of expected observations of class if there was no relationship
between the feature and the target.
OUTPUT:
CONCLUSION:
Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn, Correlation Tests :
Chi-Squared Test has been successfully studied along with the desire conclusion for the
obtained results on google collab environment.
EXPERIMENT NO. - 05
Aim of the Experiment :- Apply regression Model techniques to predict the data on House prices
dataset . And Prediction of Loan Using Multivariable Regression in Python.
Practical In charge
Experiment No. 5
AIM:- Apply regression Model techniques to predict the data on House prices dataset .
And Prediction of Loan Using Multivariable Regression in Python
When implementing linear regression of some dependent variable 𝑦 on the set of independent
variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship
between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀ ,
𝛽₁,
…, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.
Linear regression calculates the estimators of the regression coefficients or simply the predicted
weights, denoted with 𝑏₀ , 𝑏₁, …, 𝑏ᵣ. They define the estimated regression function (𝐱) = 𝑏₀ +
𝑏₁𝑥₁
+ ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output
sufficiently well.
The estimated or predicted response, (𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as
possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - (𝐱ᵢ) for all observations 𝑖 = 1,
…, 𝑛, are called the residuals. Regression is about determining the best predicted weights, that is
the weights corresponding to the smallest residuals.
To get the best weights, you usually minimize the sum of squared residuals (SSR) for all
observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary least
squares.
If there are just two independent variables, the estimated regression function is (𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁
+ 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional space. The goal of regression is to
determine the values of the weights 𝑏₀ , 𝑏₁, and 𝑏₂ such that this plane is as close as possible to
theactual responses and yield the minimal SSR.
The case of more than two independent variables is similar, but more general. The estimated
regression function is (𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be
determined when the number of inputs is 𝑟.
Output:-
CONCLUSION:
Apply regression Model techniques to predict the data on House prices dataset. And Prediction of
Loan Using Multivariable Regression in Python has been successfully implemented using
google collab
EXPERIMENT NO. - 06
Aim of the Experiment: - Classification modelling Choose classifier for classification problem.
Evaluate the performance of classifier
Lab Outcome: - Analyze and apply the supervised machine learning techniques like
Classification, Regression or Support Vector Machine on data for building the
models of data and solve the problems.
Practical Incharge
Experiment No. 6
THEORY: Ensemble learning is a machine learning paradigm where multiple models (often called
“weak learners”) are trained to solve the same problem and combined to get better results. The
main hypothesis is that when weak models are correctly combined we can obtain more accurate
and/or robust models.
Bagging is a homogeneous weak learners’ model that learns from each other independently in
parallel and combines them for determining the model average. Bagging is an acronym for
‘Bootstrap Aggregation’ and is used to decrease the variance in the prediction model. Bagging is a
parallel method that fits different, considered learners independently from each other, making it
possible to train them simultaneously.
Bagging generates additional data for training from the dataset. This is achieved by random
sampling with replacement from the original dataset. Sampling with replacement may repeat some
observations in each new training data set. Every element in Bagging is equally probable for
appearing in a new dataset.
These multi datasets are used to train multiple models in parallel. The average of all the predictions
from different ensemble models is calculated. The majority vote gained from the voting mechanism
is considered when classification is made. Bagging decreases the variance and tunes the prediction
to an expected outcome.
Example of Bagging: The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees. Several random trees
make a Random Forest.
OUTPUT:
CONCLUSION:
Classification modeling - a) Choose a classifier for a classification problem. b) Evaluate the
performance of the classifier has been successfully studied on google collab environment.
EXPERIMENT NO. - 07
Lab Outcome :- ITL605.4 Apply the different unsupervised machine learning algorithms
like Clustering, Decision Trees, Random Forests or Association to solve the problems.
Practical Incharge
Experiment No. 7
AIM : Clustering
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data.
THEORY: -K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is K-
means clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
K-Means Clustering is an Unsupervised Learning algorithm , which groups the unlabeled dataset
into different clusters. Here K defines the number of predefined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties. It allows us to cluster the
data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training. It is a centreoid -based algorithm,
where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the
sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
o Determines the best value for K centre points or centroids by an iterative process.
o Assigns each data point to its closest k-centre Those data points which are near to the
particular k- centre , create a cluster.
Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
OUTPUT:
CONCLUSION:
Clustering a. Clustering algorithms for unsupervised classification. b. Plot the cluster data. Has been
successfully studied and implemented on google collab environment.
EXPERIMENT NO. - 08
Aim of the experiment :- Using any machine learning techniques using available data
set to develop a recommendation system.
Lab Outcome :- ITL605.3 Analyze and apply the supervised machine learning
techniques like Classification, Regression or Support Vector Machine on data for
building the models of data and solve the problems.
Practical Incharge
EXPERIMENT NO.: 8
AIM: Using any machine learning techniques using available data set to develop a recommendation
system.
THEORY:
Practically, recommender systems encompass a class of techniques and algorithms which are
able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to the user
as possible, so that the user can engage with those items: YouTube videos, news articles, online
products, and so on.
Items are ranked according to their relevancy, and the most relevant ones are shown to the user.
The relevancy is something that the recommender system must determine and is mainly based
on historical data. If you’ve recently watched YouTube videos about elephants, then YouTube is
going to start showing you a lot of elephant videos with similar titles and themes!
Recommender systems are generally divided into two main categories: collaborative filtering
and content-based systems.
Principal component analysis (PCA) is a statistical procedure that is used to reduce the
dimensionality. It uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal
components. It is often used as a dimensionality reduction technique.
Step 2: Calculate the covariance matrix for the features in the dataset.
Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
OUTPUT:
CONCLUSION:
Using any machine learning techniques using available data set a recommendation system
has been developed on jupiter notebook.
EXPERIMENT NO. - 09
Aim of the Experiment :- Exploratory data analysis using Apache Spark and Pandas
Practical Incharge
Experiment No-9
THEORY:
Exploratory Data Analysis (EDA) in Python is the first step in data analysis process developed
by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing
data sets to summarize their main characteristics, often with visual methods.
For Example, You are planning to go on a trip to the “X” location. Things you do before taking a
decision:
● You will explore the location on what all places, waterfalls, trekking, beaches, restaurants
that location has in Google, Instagram, Facebook, and other social Websites.
● Calculate whether it is in your budget or not.
● Check for the time to cover all the places.
● Type of Travel method.
Similarly, when you are trying to build a machine learning model you need to be pretty sure
whether your data is making sense or not. The main aim of exploratory data analysis is to obtain
confidence in your data to an extent a machine learning algorithm.
Exploratory Data Analysis is a crucial step before jumping to machine learning or modeling of the
data. By doing this you can get to know whether the selected features are good enough to model,
are all the features required, are there any correlations based on which we can either go back to the
Data Pre-processing step or move on to modeling.
Once Exploratory Data Analysis is complete, its feature can be used for supervised and
unsupervised machine learning modeling.
In every machine learning workflow, the last step is Reporting or Providing the insights to the
Stake Holders. By completing the Exploratory Data Analysis many plots can be drawn, heat-
maps, frequency distribution, graphs, correlation matrix along with the hypothesis by which any
individual can understand what the data is all about and what insights can get from exploring the
data set.
In Trip Example, all the exploration of the selected place are done based on which we will get the
confidence to plan the trip and even share with our friends the insights we got regarding the place
so that they can also join.
analysis.
● Description of data
● Handling missing data
● Handling outliers
● Understanding relationships and new insights through plots
a) Description of data:
We need to know the different kinds of data and other statistics of our data before we can move on
to the other steps. A good one is to start with the describe() function in python. In Pandas, we can
apply describe() on a DataFrame which helps in generating descriptive statistics that summarize
the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
The result’s index will include count, mean, std, min, max as well as lower, 50 and upper
percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50 percentile
is the same as the median.
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating dataframes
boston_df = pd.DataFrame(boston.data) boston_df.columns = columns boston_df.describe()
The above code indicates that there are no null values in our data set.
Fill Missing Values:
This is the most common method of handling missing values. This is a process whereby missing
values are replaced with a test statistic like mean, median or mode of the particular feature the
missing value belongs to. Let’s suppose we have a missing value of age in the boston data set.
Then the below code will fill the missing value with the 30.
c) Handling outliers:
An outlier is something which is separate or different from the crowd. Outliers can be a result of a
mistake during data collection or it can be just an indication of variance in your data. Some of the
methods for detecting and handling outliers:
● BoxPlot
● Scatterplot
● Z-score
● IQR(Inter-Quartile Range)
BoxPlot:
A box plot is a method for graphically depicting groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The
whiskers extend from the edges of the box to show the range of the data. Outlier points are those
past the end of the whiskers. Boxplots show robust measures of location and spread as well as
providing information about symmetry and outlier
Output:-
CONCLUSION:
Exploratory data analysis using Apache Spark and Pandas has been successfully studied and
implemented on Jupiter notebook environment.
EXPERIMENT NO. - 10
Aim of the Experiment :- Batch and Streamed Data Analysis using Spark.
Practical Incharge
EXPERIMENT NO. - 10
AIM :- Batch and Streamed Data Analysis using Spark.
THEORY:
Datasets are becoming huge. Infact, data is growing faster than processing speeds. Therefore,
algorithms involving large data and high amount of computation are often run on a distributed
computing system. A distributed computing system involves nodes (networked computers) that
run processes in parallel and communicate (if, necessary).
MapReduce – The programming model that is used for Distributed computing is known as
MapReduce. The MapReduce model involves two stages, Map and Reduce.
1. Map – The mapper processes each line of the input data (it is in the form of a file), and
produces key – value pairs.
Input data → Mapper → list([key, value])
2. Reduce – The reducer processes the list of key – value pairs (after the Mapper’s
function). It outputs a new set of key – value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark – Spark (open source Big-Data processing engine by Apache) is a cluster computing
system. It is faster as compared to other cluster computing systems (such as, Hadoop). It
provides high level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark.
We will cover PySpark (Python + Apache Spark), because this will make the learning curve
flatter. To install Spark on a linux system, follow this. To run Spark in a multi – cluster system,
follow this. We will see how to create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of objects. Since
we are using PySpark, these objects can be of multiple types. These will become more clear
further.
RDD transformations – Now, a SparkContext object is created. Now, we will create RDDs
and see some transformations on them.
One major advantage of using Spark is that it does not load the dataset into memory,
lines is a pointer to the ‘file_name.txt’ ?file.
Steps:
1. Our text file is in the following format – (each line represents an edge of a
directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from it.
4. Then, we transform the lines RDD to edges RDD.The function conv a?cts on
each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), … are stored
in
the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs corresponding to a
particular key and numNeighbours function is used for generating each
vertex’s degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3,
1), …
OUTPUT:
CONCLUSION:
Batch and Streamed Data Analysis using Spark has been successfully implemented
and studied using google collab environment