0% found this document useful (0 votes)

283 views42 pages

DA Lab Manual

Uploaded by

jeyakarthika cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

283 views42 pages

DA Lab Manual

Uploaded by

jeyakarthika cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Ex No 1: Write a program to perform numerical operations (MAX,

MIN, AVG, SUM, SQRT, ROUND) using R.

AIM:
To write a program to perform numerical operations (MAX, MIN,
AVG, SUM, SQRT, ROUND).
ALGORITHM:
❖ Use the read.csv() function to read the data from the CSV file
and stores it in a variable called csv_data.
❖ Print the contents (age column) of the csv_data variable,
displaying the data from the CSV file.
❖ Calculate the minimum and maximum values of the 'Age'
column in the csv_data data-frame using the min() and max()
functions.
❖ Calculate the average (mean) of the 'Age' column in the
csv_data data frame using the mean() function.
❖ Calculate the square root of each value in the 'Age' column
using the sqrt() function.
❖ Apply the round() function to the avg variable to round the
average age.
❖ Calculate the sum of all values in the 'Age' column using the
sum() function.

1
CODE:
# Import the data using read.csv()
csv_file_path <- "C:\\Users\\HARI RAGHAV\\Downloads\\MMM.csv"
csv_data <- read.csv(csv_file_path)
print(csv_data$Age)
# Compute the min&max value
print(min(csv_data$Age))
print(max(csv_data$Age))
# Compute the Average
avg = mean(csv_data$Age)
print(avg)
# Compute the Squareroot
print(sqrt(csv_data$Age))
# Apply round function for the Average Value
round(avg)
# Compute the Sum
print(sum(csv_data$Age))

OUTPUT:
20 22 100 100
Min & Max: 20 & 100
Avg: 60.5
Sqrt: 4.472136 ,4.690416,10.000000,10.000000,
Round: 60
Sum: 242
RESULT:
Thus, a program to perform numerical operations has been
successfully executed using R.

2
Ex No 2: Implement a program for statistical operations such as
Mean, Median, Mode and Standard deviation.
AIM:
To write a program for performing statistical operations such as mean
median mode.
ALGORITHM:
❖ Use the read.csv() function to read the data from the CSV file and
stores it in a variable called csv_data.

❖ Print the contents (age column) of the csv_data variable, displaying

the data from the CSV file.

❖ Calculate the mean (average) value of the 'Age' column in the

csv_data dataframe using the mean() function.

❖ Calculate the median value of the 'Age' column in the csv_data

dataframe using the median() function.

❖ The function uses the table() function to create a frequency table and
identifies the mode.

❖ Calculate the standard deviation of the 'Age' column in the csv_data

dataframe using the sd() function.

❖ Print the results.

3
CODE:
# Import the data using read.csv()
csv_file_path <- "C:\\Users\\HARI RAGHAV\\Downloads\\MMM.csv"
csv_data <- read.csv(csv_file_path)
print(csv_data$Age)
# Compute the mean value
mean = mean(csv_data$Age)
print(mean)
# Compute the median value
median = median(csv_data$Age)
print(median)
# Compute the mode value
mode = function()
{
return(sort(-table(csv_data$Age))[1])
}
mode()
# Compute the Standard Deviation
print(sd(csv_data$Age))
OUTPUT:
Age: 20 22 100 100
Mean: 60.5
Median: 61
Mode: 100, -2
Standard Deviation: 45.61798
RESULT:
Thus, a program to perform statistical operations has been successfully
executed using R.

4
Ex No 3: Write a Program to Read and Write operations on
different types of Files (csv, xls, txt etc).
AIM:
To write a program to read and write operations on different types of
Files (csv, xls, txt etc).
ALGORITHM:
❖ Reading CSV File: Use the `read.csv()` function to read the CSV
file specified by `csv_file_path`.
❖ Writing CSV File: Use the `write.csv()` function to write the
`csv_data` data frame to `csv_output_file`.
❖ Reading Excel File: Use the `read_excel()` function from the
`readxl` library to read the Excel file specified by `excel_file_path`.
❖ Writing Excel File: Use the `write_xlsx()` function to write the data
frame to `excel_output_file`.
❖ Reading TXT file: Use the `read.delim()` function to read the tab-
delimited text file specified by `txt_file_path`.
❖ Writing TXT file:Use the `write.table()` function to write `data_txt`
to `txt_output_file`, using `*` as the separator.
CODE:
#Reading CSV File
csv_file_path <- "C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1a.csv"
csv_data <- read.csv(csv_file_path)
print(csv_data)
5
#Writing CSV File
csv_data<-data.frame(Name =
c("sdfsafasdca","vva","bffbb"),Age=c(14,15,16),City=c("ddd1","aaa1","
bbb1"))
csv_output_file <- "C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1b.csv"
write.csv(csv_data, csv_output_file,row.names = FALSE)
print(csv_data)
#Reading Excel File
library(readxl)
Data1<-read_excel("C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1c.xlsx")
print(Data1)
#Writing Excel File
write_xlsx(data.frame(Name=c("ggg"),Age=c(23),City=("df")),"C:\\Use
rs\\HARI RAGHAV\\Downloads\\HelloEX1cii.xlsx")
Data2<-read_excel("C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1cii.xlsx")
print(Data2)
# Reading TXT file
myData = read.delim("C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1d.txt", header = FALSE)
print(myData)

6
#Writing TXT file
data_txt="Welcome to R programing!!"
data=write.table(data_txt,file = "C:\\Users\\HARI
RAGHAV\\Downloads\\HelloEX1d.txt", sep = "*")
print(data)
OUTPUT:
1 Hari 20 MDU
2 Gopi 22 MDU
# A tibble: 1 × 3
Name Age City
<chr> <dbl> <chr>
1 ggg 23 df
2 1*Welcome to R programing!!

RESULT:
Thus, a program to perform Read & write operations has been
successfully executed using R.

7
Ex No 4: Implement data pre-processing operations:
❖ Handling Missing data
❖ Min-Max normalization
AIM:
To implement data pre-processing operations:
❖ Handling Missing data
❖ Min-Max normalization
ALGORITHM:
❖ Handling Missing data:
➢ Import dplyr Library.
➢ Create a Data Frame.
➢ Identify Columns with Missing Values.
➢ Calculate Means and Medians of Columns with Missing
Values.
➢ Impute Missing Values with Medians.
➢ Print the Resulting Data Frame.
❖ Min-Max Normalization:
➢ Import the `caret` library.
➢ Create a Dataset (`data`).
➢ Define a Custom Min-Max Scaling Function (`minMax`).

8
➢ Normalize Data Using the Custom Function
(`normalisedMydata`).
➢ Print the First Few Rows of Normalized Data
(`head(normalisedMydata)`).
➢ Check Summary Statistics After Normalization
(`summary(normalisedMydata)`).
CODE:
Handling Missing data:
library(dplyr)
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai",
"Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(NA, 84, 93, 87),
Mathematics = c(91, 86, NA, NA) )
dataframe
listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2,
anyNA)]
meanMissing <- apply(dataframe[,colnames(dataframe) %in%
listMissingColumns], 2, mean, na.rm = TRUE)
medianMissing <- apply(dataframe[,colnames(dataframe) %in%
listMissingColumns], 2, median, na.rm = TRUE)
newDataFrameMedian <- dataframe %>% mutate(
Chemistry = ifelse(is.na(Chemistry), medianMissing[1], Chemistry),

9
Mathematics = ifelse(is.na(Mathematics),
medianMissing[2],Mathematics))
print(newDataFrameMedian)
Min-Max Normalization:
library(caret)
data = data.frame(var1 = c(120,345,145,122,596,285,211),
var2 = c(10,15,45,22,53,28,12),
var3 = c(-34,0.05,0.15,0.12,-6,0.85,0.11))
minMax <- function(x) {
(x - min(x)) / (max(x) - min(x)) }
#normalise data using custom function
normalisedMydata <- as.data.frame(lapply(data, minMax))
head(normalisedMydata)
summary(normalisedMydata)
OUTPUT:
Handling Missing data:
Name Physics Chemistry Mathematics
Bhuwanesh 98 NA 91
Anil 87 84 86
Jai 91 93 NA
Naveen 94 87 NA

10
Name Physics Chemistry Mathematics
Bhuwanesh 98 87 91.0
Anil 87 84 86.0
Jai 91 93 88.5
Naveen 94 87 88.5

MIN-MAX Normalization:
var1 var2 var3
1 0.000000000 0.0000000 0.0000000
2 0.472689076 0.1162791 0.9770445
3 0.052521008 0.8139535 0.9799139
4 0.004201681 0.2790698 0.9790531
5 1.000000000 1.0000000 0.8034433
6 0.346638655 0.4186047 1.0000000

var1 var2 var3

Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.02836 1st Qu.:0.0814 1st Qu.:0.8902
Median :0.19118 Median :0.2791 Median :0.9788
Mean :0.29532 Mean :0.3821 Mean :0.8169
3rd Qu.:0.40966 3rd Qu.:0.6163 3rd Qu.:0.9795
Max. :1.00000 Max. :1.0000 Max. :1.0000

RESULT:
Thus, a program to implement data pre-processing operations has been
successfully executed using R.

11
Ex No 5: Write a Program to implement Principal Component
Analysis for House dataset
AIM:
To implement Principal Component Analysis for House dataset.
ALGORITHM:
❖ Generate a sample dataset with features and house prices.
❖ Load necessary R libraries: dplyr, caret, and pcaMethods.
❖ Prepare and separate the data into features (X) and target variable
(y).
❖ Perform PCA on the features and choose to retain 3 principal
components.
❖ Train a linear regression model on the PCA-transformed data.
❖ Create a new dataset with similar features and different values.
❖ Apply the same PCA transformation to the new data.
❖ Use the model to predict house prices for the new dataset.
CODE:
# Load necessary libraries
library(dplyr)
# Set a random seed for reproducibility
set.seed(123)
# Create a sample dataset with 100 observations
n <- 100
# Generate random values for features
bedrooms <- sample(1:5, n, replace = TRUE)

12
bathrooms <- sample(1:3, n, replace = TRUE)
square_feet <- rnorm(n, mean = 1500, sd = 500)
garage <- sample(0:1, n, replace = TRUE)
year_built <- sample(1950:2020, n, replace = TRUE)
# Generate house prices based on the features (with some random noise)
house_prices <- 50000 + 25000 * bedrooms + 30000 * bathrooms +
200 * square_feet + 15000 * garage - 10 * (2023 - year_built) +
rnorm(n, mean = 0, sd = 50000)
# Create the sample dataset
sample_data <- data.frame(
Bedrooms = bedrooms,
Bathrooms = bathrooms,
SquareFeet = square_feet,
Garage = garage,
YearBuilt = year_built,
HousePrice = house_prices
)
# Display the first few rows of the dataset
head(sample_data)
# Load necessary libraries
install.packages("caret")

13
library(caret)
install.packages("pcaMethods")
library(pcaMethods)
# Prepare your data
X <- sample_data[, -6] # Features (exclude the target variable)
y <- sample_data$HousePrice # Target variable
# Perform PCA
pca_result <- prcomp(X, center = TRUE, scale = TRUE)
# Examine explained variance
summary(pca_result)
cumsum(pca_result$sdev^2) / sum(pca_result$sdev^2)
# Select the number of principal components (e.g., 3 components)
n_components <- 3
# Ensure that 'X_pca' is available before converting to a data frame
X_pca <- pca_result$x
# convert 'X_pca' to a data frame
X_pca_df <- as.data.frame(X_pca)
# Train a linear regression model
model <- lm(y ~ ., data = X_pca_df)
# Display the summary of the regression model
summary(model)
14
# Create a new dataset with the same feature columns
new_data <- data.frame(
Bedrooms = c(3, 2, 4, 1), # Example values for Bedrooms
Bathrooms = c(2, 1, 3, 1), # Example values for Bathrooms
SquareFeet = c(1800, 1600, 2100, 1400), # Example values for
SquareFeet
Garage = c(1, 0, 1, 0), # Example values for Garage
YearBuilt = c(1990, 2005, 1980, 2010) # Example values for
YearBuilt
)
# Load necessary libraries (if not already loaded)
library(caret)
# Assuming you've already created and loaded the new_data dataset
# Prepare your new data
X_new <- new_data[, -6] # Features (exclude the target variable)
print(X_new)
# Use the same PCA transformation as the original dataset
X_new_pca <- predict(pca_result, newdata = X_new)
# Convert X_new_pca to a data frame
X_new_pca_df <- as.data.frame(X_new_pca)
# Predict house prices using the trained linear regression model

15
predicted_prices <- predict(model, newdata = X_new_pca_df)
# The 'predicted_prices' variable now contains the predicted house prices
for the new dataset.
print(predicted_prices)
OUTPUT:

RESULT:
Thus, a program to implement Principal Component Analysis for
House dataset has been successfully executed using R.
16
Ex No 6: Implement simple linear regression program to predict the
future values and analyze the goodness of fit.
AIM:
To Implement simple linear regression program to predict the future
values and analyze the goodness of fit.
ALGORITHM:
❖ Data Collection: Create vectors for height and weight to store height
and weight data.
❖ Linear Regression Model: Build a linear regression model relation
using the lm function to predict weight from height.
❖ Create a data frame newData with a specific height value (170) for
prediction.
❖ Predict weight for the new data point using the predict function and
print the predicted weight value.
❖ Data Visualization: Create a scatter plot of height and weight using
plot, with labeled axes and visualize the result.
CODE:
height <- c (150, 160, 140, 155, 148, 177, 167, 126, 149, 131)
weight <- c (60, 70, 55, 55, 58, 80, 75, 45, 50, 44)
relation <- lm(weight ~ height)
print (relation)
print (summary(relation))
newData <- data.frame(height = 170)
print (newData)

17
result <- predict(relation, newData)
result
plot(weight, height)
plot(height, weight)
OUTPUT:

RESULT:
Thus, a program to implement linear regression program to predict the
future values and analyze the goodness of fit has been successfully
executed using R.

18
Ex No 7: Write a Program to implement Simple Naïve Bayes
classification algorithm for predicting the weather forecast
AIM:
To implement Simple Naive Bayes classification algorithm for
predicting the weather forecast
ALGORITHM:
❖ Load weather data from a CSV file.
❖ Convert the "weather" column to categories (e.g., rainy, sunny).
❖ Split the data into training and testing sets.
❖ Build a Naive Bayes model to predict weather based on features like
precipitation, temperature, and wind using the training data.
❖ Use the model to predict the weather on the test data and store the
predictions.
❖ Prepare a new data frame with sample values for weather-related
features.
❖ Apply the trained model to predict weather conditions for the new
data.
❖ Print the predicted weather conditions for each observation in the
new data.
CODE:
install.packages("el071")
library(e1071)
install.packages("caTools")
library(caTools)

19
# Load your dataset
data <- read.csv("C:\\Users\\HARI RAGHAV\\Downloads\\seattle-
weather.csv")
data$weather <- as.factor(data$weather)
set.seed(123) # For reproducibility
split <- sample.split(data$weather, SplitRatio = 0.7)
train_data <- subset(data, split == TRUE)
test_data <- subset(data, split == FALSE)
model <- naiveBayes(weather ~ precipitation + temp_max + temp_min +
wind, data = train_data)
predictions <- predict(model, newdata = test_data)
# Create a new data frame with sample data
new_data <- data.frame(
precipitation = c(0.1, 0.5, 0.0, 0.2, 0.0), # Example precipitation values
temp_max = c(25, 22, 18, 10, 28), # Example max temperature
values
temp_min = c(15, 10, 8, 0, 20), # Example min temperature
values
wind = c(10, 5, 15, 20, 8) # Example wind values
)
# Use the trained model to make predictions on the new data
predictions <- predict(model, newdata = new_data)

20
# Print the predicted weather conditions
for (i in 1:length(predictions))
{
cat("Observation", i, ": ",
levels(predictions)[as.integer(predictions[i])], "\n")
}
OUTPUT:

RESULT:
Thus, a program to implement Simple Naive Bayes classification
algorithm for predicting the weather forecast has been successfully
executed using R.

21
Ex No 8: Write a Program to implement K-Means clustering
operation and visualize for iris dataset
AIM:
To Write a Program to implement K-Means clustering operation and
visualize for iris dataset.
ALGORITHM:
❖ Import the necessary libraries, `ggplot2` and `cluster`, for data
visualization and clustering, respectively.
❖ Data Preparation: Create a dataframe `df` and assign the Iris dataset
to it.
❖ Data Visualization: Create a scatter plot using `ggplot2`, where
`Petal.Length` is on the x-axis, `Petal.Width` is on the y-axis.
❖ K-means Clustering: Perform k-means clustering on the numeric
columns (1:4) of the Iris dataset with three centers (`center=3`).
❖ Cluster Label Assignment: Add cluster labels to the original
dataframe `df` based on the clustering results.
❖ Create a scatter plot with cluster labels using `ggplot2` & Apply a
minimal theme to the plot using `theme_minimal()`.
❖ Cluster Analysis and Visualization: Compute and display a
contingency table (cross-tabulation) between the cluster
assignments and the actual species
❖ Visualize the clusters using a cluster plot with `clusplot` from the
`cluster` library.
❖ Performing elbow Method for Optimal Cluster Number Selection

22
CODE:
library(ggplot2)
df <- iris
head(iris)
ggplot(df, aes(Petal.Length, Petal.Width)) +
geom_point(aes(col=Species), size=4)
set.seed(101)
irisCluster <- kmeans(df[,1:4], center=3, nstart=20)
irisCluster
# Add cluster labels to the original dataframe
df$Cluster <- as.factor(irisCluster$cluster)
# Create a scatter plot with cluster labels
ggplot(df, aes(Petal.Length, Petal.Width, color = Cluster)) +
geom_point(size = 4) +
scale_color_manual(values = c("red", "blue", "green")) + # Customize
cluster colors
labs(color = "Cluster") +
theme_minimal()
table(irisCluster$cluster, df$Species)
library(cluster)
clusplot(iris, irisCluster$cluster, color=T, shade=T, labels=0, lines=0)

23
tot.withinss <- vector(mode="character", length=10)
for (i in 1:10){
irisCluster <- kmeans(df[,1:4], center=i, nstart=20)
tot.withinss[i] <- irisCluster$tot.withinss
}
plot(1:10, tot.withinss, type="b", pch=19)
OUTPUT:

RESULT:
Thus, a program to implement K-Means clustering operation and
visualize for iris dataset has been successfully executed using R.

24
Ex No 9: Write a Program to diagnose any disease using KNN
classification and plot the results
AIM:
To diagnose any disease using KNN classification and plot the results
in R.
ALGORITHM:
❖ Load the "class," "caret," "e1071," and "ggplot2" libraries for
various functions and data visualization.
❖ Read a dataset (assumed to be in CSV format) named "diabetes.csv"
into a variable called "data."
❖ Split the dataset into training and testing sets
❖ Set the number of neighbors (K) to 5.
❖ Build KNN Model: Use the "train" function from the "caret"
package to build a KNN classification model & Specify "Outcome"
as the target variable, and use all other columns as features for
prediction.
❖ Create a new data frame named "new_data" containing information
for five hypothetical individuals.
❖ Apply the previously trained KNN model to the "new_data" to make
predictions for each of the five hypothetical individuals.
❖ Convert the numeric predictions into "Yes" or "No" labels based on
a threshold of 0.5.
❖ Create a bar chart using "ggplot2" to visualize the count of "Yes"
and "No" predictions.

25
CODE:
# Load necessary libraries
library(class)
# Load the dataset
data <- read.csv("E:\\Semester Notes\\7th sem\\Data Analytics
Lab\\diabetes.csv")
# Split the dataset into training and testing sets
set.seed(123) # for reproducibility
sample_indices <- sample(nrow(data), size = 0.7 * nrow(data))
train_data <- data[sample_indices, ]
test_data <- data[-sample_indices, ]
# Define the number of neighbors (K) - you can tune this value
K <- 5
library(caret)
library(e1071)
# Assuming 'trainData' is your training dataset
knn_model <- train(Outcome ~ ., data = train_data, method = "knn",
trControl = trainControl(method = "cv", number = 5))
# Create a new data frame with the same column names
new_data <- data.frame(
Pregnancies = c(2, 5, 1, 7, 0),

26
Glucose = c(120, 160, 90, 200, 80),
BloodPressure = c(70, 80, 60, 90, 70),
SkinThickness = c(25, 30, 20, 35, 10),
Insulin = c(80, 120, 40, 200, 0),
BMI = c(27.5, 32.1, 25.0, 28.6, 21.4),
DiabetesPedigreeFunction = c(0.245, 0.625, 0.186, 0.956, 0.402),
Age = c(30, 45, 28, 54, 23)
)
# Print the new data frame
print(new_data)
# Make sure 'new_data' is properly formatted with the same column names
as your training data
# Exclude the 'Outcome' column since you're predicting it
new_data <- new_data[, names(train_data) != "Outcome"]
# Step 2: Apply the KNN model to the new dataset
new_predictions <- predict(knn_model, newdata = new_data)
# Step 3: Convert numeric predictions to "Yes" or "No" based on the
threshold
threshold <- 0.5
diabetes_predictions <- ifelse(new_predictions >= threshold, "Yes",
"No")
# Print the predictions
27
print(diabetes_predictions)
install.packages("ggplot2")
library(ggplot2)
predictions_df <- data.frame(Prediction = diabetes_predictions)
# Create a bar chart
ggplot(predictions_df, aes(x = Prediction)) +
geom_bar() +
labs(title = "Diabetes Predictions", x = "Prediction", y = "Count")
OUTPUT:

RESULT:
Thus, a program to diagnose any disease using KNN classification and
plotting the results has been successfully executed using R.

28
Ex No 10: Create the following visualization plots for the movie
recommendations system
1. Bar, Pie, Box and scatter plot.
2. Find the outliers using plot
AIM:
To visualize bar, pie, box, scatter plot and to find the outliers using
plots for a movie recommendation system.
ALGORITHM:
❖ Data Preparation:
➢ Create a data frame named `data` containing movie
information.
➢ Preprocess the `tagline` column by converting text to
lowercase, removing punctuation, numbers, stopwords, and
extra whitespace.
➢ Create a Document-Term Matrix (DTM) from the
preprocessed text.
❖ Define a Function to Get Similarity Scores:
➢ Create a function called `get_single_movie_recommendation`
that takes a target movie name as input.
➢ Find the row index of the target movie in the dataset.
➢ Calculate cosine similarity between the target movie and all
other movies based on the DTM.
➢ Create a data frame with movie titles and similarity scores.
➢ Sort the data frame by similarity and exclude the target movie.
➢ Get the top recommended movie based on similarity and return
its title.

29
❖ Data Visualization: Install and load necessary R packages, including
`tm`, `proxy`, and `ggplot2`.
❖ Box Plot Visualization: Create a box plot to visualize the
distribution of similarity scores for all movies.
❖ Scatter Plot Visualization: Create a scatter plot to show cosine
similarity scores between the target movie and other movies.
❖ Bar Chart Visualization: Create a bar chart with reordered movie
titles based on similarity scores.
❖ Pie Chart Visualization for Recommendations:
➢ Get similarity scores for all movies using the
`get_single_movie_recommendation` function.
➢ Create a pie chart to display recommended movies based on
similarity to the target movie.
❖ Outlier Detection and Visualization:
➢ Get similarity scores for all movies relative to a target movie
using the get_single_movie_recommendation function.
➢ Create a box plot to visualize the distribution of similarity
scores.
➢ Customize the plot to make outliers transparent.
➢ Identify the movie with the maximum similarity score and
highlight it as an outlier.
➢ This is done by adding a red dot to the box plot at the position
of the maximum similarity score.
CODE:
data <- data.frame(
movie_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
title = c(

30
"The Shawshank Redemption", "The Godfather", "Pulp Fiction", "The
Dark Knight", "Schindler's List",
"Forrest Gump", "The Matrix", "Inception", "Fight Club", "The Silence
of the Lambs"
),
tagline = c(
"Fear can hold you prisoner. Hope can set you free.", "An offer you can't
refuse.","Just because you are a character doesn't mean you have
character.", "Why So Serious?",
"The list is an absolute good. The list is life.", "Life is like a box of
chocolates...",
"Welcome to the Real World.", "Your mind is the scene of the crime.",
"Mischief. Mayhem. Soap.",
"To enter the mind of a killer, she must challenge the mind of a madman."
),
vote_average = c(9.3, 9.2, 8.9, 9.0, 8.9, 8.8, 8.7, 8.8, 8.9, 8.6),
vote_count = c(13878, 10121, 19354, 24352, 8340, 18847, 17286, 24178,
18123, 16341)
)
# Preprocess text-based columns
install.packages("tm")
library(tm)
corpus <- Corpus(VectorSource(data$tagline))

31
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)

# Function to get similarity scores for all movies (excluding the target
movie)
get_similarity_scores <- function(target_movie_name) {
# Find the row index corresponding to the target movie by name
target_movie_index <- which(data$title == target_movie_name)
if (length(target_movie_index) == 0) {
return(NULL)
}
else {
# Calculate cosine similarity between the target movie and all other
movies
target_movie_vector <- as.matrix(dtm)[target_movie_index, , drop =
FALSE]
similarity_scores <- proxy::simil(as.matrix(dtm), target_movie_vector,
method = "cosine")

32
# Ensure similarity_scores has the same length as the number of movies
similarity_scores <- rep(similarity_scores, nrow(data))
# Create a data frame with movie titles and similarity scores
similarity_df <- data.frame(
movie_title = data$title,
similarity = similarity_scores
)
# Sort by similarity and exclude the target movie
similarity_df <- similarity_df[order(-similarity_df$similarity), ]
similarity_df <- similarity_df[similarity_df$movie_title !=
target_movie_name, ]
# Get the top recommended movie
top_recommended_movie <- head(similarity_df, 1)
# Return the title of the top recommended movie
return(top_recommended_movie$movie_title)
}
}
# Get a recommended movie for a specific movie ("Forrest Gump" in this
example)
install.packages("proxy")
target_movie_name <- "Forrest Gump"

33
recommended_movie <-
get_single_movie_recommendation(target_movie_name)
recommended_movie
# Filter the similarity scores for recommended movies only
recommended_movies <- similarity_df[similarity_df$movie_title ==
recommended_movie, ]
install.packages("ggplot2")
library(ggplot2)
# Create a data frame with movie titles and similarity scores
similarity_df <- data.frame(
movie_title = data$title,
similarity = similarity_scores
)
# Create a data frame for the box plot
boxplot_data <- data.frame(
Movie_Title = similarity_df$movie_title,
Similarity_Score = similarity_df$similarity
)
# Create a box plot with custom x-axis labels and no y-axis label
ggplot(boxplot_data, aes(x = Movie_Title)) +
geom_boxplot() +

34
labs(title = "Similarity Scores Distribution", x = "Movie Titles") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create a scatter plot
ggplot(similarity_df, aes(x = movie_title, y = similarity)) +
geom_point(size = 3) +
labs(
title = paste("Movie Similarity to", target_movie_name),
x = "Movie Title",
y = "Cosine Similarity"
)+
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
# Load the ggplot2 library
library(ggplot2)
# Calculate cosine similarity between the target movie and all other
movies
target_movie_vector <- as.matrix(dtm)[which(data$title ==
target_movie_name), , drop = FALSE]
similarity_scores <- proxy::simil(as.matrix(dtm), target_movie_vector,
method = "cosine")
# Sort the data frame by similarity score in descending order
similarity_df <- similarity_df[order(-similarity_df$similarity), ]
35
# Exclude the target movie
similarity_df <- similarity_df[similarity_df$movie_title !=
target_movie_name, ]
# Create a bar chart
ggplot(similarity_df, aes(x = reorder(movie_title, -similarity), y =
similarity)) +
geom_bar(stat = "identity", fill = "blue") +
labs(
title = paste("Movie Similarity to", target_movie_name),
x = "Movie Title",
y = "Cosine Similarity"
)+
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
# Load the ggplot2 library
library(ggplot2)
# Get similarity scores for all movies
similarity_scores_df <- get_similarity_scores(target_movie_name)
if (!is.null(similarity_scores_df)) {
# Create a pie chart
ggplot(similarity_scores_df, aes(x = "", y = similarity, fill = movie_title))
+
36
geom_bar(stat = "identity") +
coord_polar(theta = "y") +
labs(
title = paste("Recommended Movies based on Similarity to",
target_movie_name),
x = NULL,
y = NULL,
fill = "Movie Title"
)+
theme_void() +
theme(legend.position = "bottom")
}
else {
# Handle the case when the target movie is not found
print("Target movie not found in the dataset.")
}
# Load the ggplot2 library
library(ggplot2)
# Function to get similarity scores for all movies (excluding the target
movie)
get_similarity_scores <- function(target_movie_name) {

37
# Find the row index corresponding to the target movie by name
target_movie_index <- which(data$title == target_movie_name)
if (length(target_movie_index) == 0) {
return(NULL)
} else {
# Calculate cosine similarity between the target movie and all other
movies
target_movie_vector <- as.matrix(dtm)[target_movie_index, , drop =
FALSE]
similarity_scores <- proxy::simil(as.matrix(dtm), target_movie_vector,
method = "cosine")
# Ensure similarity_scores has the same length as the number of movies
similarity_scores <- rep(similarity_scores, nrow(data))
# Create a data frame with movie titles and similarity scores
similarity_df <- data.frame(
movie_title = data$title,
similarity = similarity_scores
)
# Sort by similarity and exclude the target movie
similarity_df <- similarity_df[order(-similarity_df$similarity), ]
similarity_df <- similarity_df[similarity_df$movie_title !=
target_movie_name, ]

38
return(similarity_df)
}
}
# Get similarity scores for all movies
similarity_scores_df <- get_similarity_scores(target_movie_name)
if (!is.null(similarity_scores_df)) {
# Create a box plot and highlight the outlier with a red dot
ggplot(similarity_scores_df, aes(x = reorder(movie_title, -similarity), y =
similarity)) +
geom_boxplot(outlier.colour = "transparent") + # Make outliers
transparent in the box plot
geom_point(data =
similarity_scores_df[which(similarity_scores_df$similarity ==
max(similarity_scores_df$similarity)), ],
aes(x = reorder(movie_title, -similarity), y = similarity),
color = "red", size = 3) + # Highlight the maximum similarity as an outlier
in red
labs(
title = paste("Similarity Scores Distribution to", target_movie_name),
x = "Movie Title",
y = "Cosine Similarity"
)+

39
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
}
else
{
# Handle the case when the target movie is not found
print("Target movie not found in the dataset.")
}
OUTPUT:
Schindler's List

40
41
RESULT:
Thus, a program to visualize bar, pie, box, scatter plot and to find the
outliers using plots for a movie recommendation system has been
successfully executed using R.

Vulnerability Assessment Report
No ratings yet
Vulnerability Assessment Report
23 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
No ratings yet
Galgotias College of Engineering & Technology: Inroduction To Data Analytics and Visualization Lab File (KDS-551)
47 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
DSDA MANUAL
No ratings yet
DSDA MANUAL
64 pages
R Programmimg Practical Journal All-1
No ratings yet
R Programmimg Practical Journal All-1
25 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
arunav da prac
No ratings yet
arunav da prac
55 pages
L3 Notes-1
No ratings yet
L3 Notes-1
8 pages
Prerequisites: R Installation
No ratings yet
Prerequisites: R Installation
11 pages
Ida Lab Final
No ratings yet
Ida Lab Final
29 pages
R Assignment 10
No ratings yet
R Assignment 10
12 pages
18 3 24 Upto Week 6 A B Latest 1
No ratings yet
18 3 24 Upto Week 6 A B Latest 1
25 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Statistics With R
No ratings yet
Statistics With R
5 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
R_Lab_rec_final
No ratings yet
R_Lab_rec_final
31 pages
saurabh
No ratings yet
saurabh
22 pages
Basics of Statistics and Probability - FP: Statistical Measures
No ratings yet
Basics of Statistics and Probability - FP: Statistical Measures
12 pages
Lab Manual Record: St. Josephs PG College
No ratings yet
Lab Manual Record: St. Josephs PG College
14 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
r file code
No ratings yet
r file code
16 pages
R456
No ratings yet
R456
8 pages
R Practicals
No ratings yet
R Practicals
32 pages
Big Data Analytics Programs Only
No ratings yet
Big Data Analytics Programs Only
22 pages
da lab file 2
No ratings yet
da lab file 2
13 pages
Data Analytics Practical
No ratings yet
Data Analytics Practical
6 pages
R Programs
No ratings yet
R Programs
12 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
Data Analytics Lab
No ratings yet
Data Analytics Lab
14 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
data analytics lab manual
No ratings yet
data analytics lab manual
26 pages
Bigdata Programs&Solutions
No ratings yet
Bigdata Programs&Solutions
7 pages
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
R-Lab p-4,2,1
No ratings yet
R-Lab p-4,2,1
12 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
Practical 1- Basics of R
No ratings yet
Practical 1- Basics of R
8 pages
50 R Exercises
No ratings yet
50 R Exercises
44 pages
R Programming LAB
No ratings yet
R Programming LAB
32 pages
MODEL 1
No ratings yet
MODEL 1
14 pages
as
No ratings yet
as
22 pages
Lab 02 - Compound Data Structures
No ratings yet
Lab 02 - Compound Data Structures
12 pages
DA
No ratings yet
DA
10 pages
RStudio
No ratings yet
RStudio
31 pages
R Record
No ratings yet
R Record
16 pages
Data Analytics-Lab Manual
No ratings yet
Data Analytics-Lab Manual
19 pages
DS Lab
No ratings yet
DS Lab
31 pages
An Introduction To R Language
No ratings yet
An Introduction To R Language
11 pages
R study material I
No ratings yet
R study material I
8 pages
materi 4
No ratings yet
materi 4
30 pages
DA Practical File
No ratings yet
DA Practical File
36 pages
R File New
No ratings yet
R File New
22 pages
R Lab
No ratings yet
R Lab
15 pages
Strategic Management Questions
No ratings yet
Strategic Management Questions
2 pages
R Lab Manual
No ratings yet
R Lab Manual
19 pages
Sheet
No ratings yet
Sheet
2 pages
R Programming Practical File
No ratings yet
R Programming Practical File
38 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
R Exam
No ratings yet
R Exam
18 pages
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
george2008
No ratings yet
george2008
5 pages
CIT -2 -ISP-NOTES
No ratings yet
CIT -2 -ISP-NOTES
19 pages
20oe108 -Isp-question Bank 2024-2025 Odd
No ratings yet
20oe108 -Isp-question Bank 2024-2025 Odd
2 pages
Networking Class Test 2
No ratings yet
Networking Class Test 2
5 pages
Enjoy Kwanzaa Newsletter by Slidesgo
No ratings yet
Enjoy Kwanzaa Newsletter by Slidesgo
41 pages
Nitrox 125 Rpiezas
No ratings yet
Nitrox 125 Rpiezas
34 pages
56th CONVOCATION APPLICATION
No ratings yet
56th CONVOCATION APPLICATION
1 page
Cycle Inventory
No ratings yet
Cycle Inventory
40 pages
Can, Could, Be Able - Context Exercises With Key (2 Pages + Key) .
No ratings yet
Can, Could, Be Able - Context Exercises With Key (2 Pages + Key) .
4 pages
Project Planning and Execution
100% (3)
Project Planning and Execution
5 pages
Completion - Drilling Formulas
100% (2)
Completion - Drilling Formulas
13 pages
Prosthetics and Orthotics Manufacturing Guidelines: Lower Limb Prosthetics: Partial Foot Prosthesis
100% (2)
Prosthetics and Orthotics Manufacturing Guidelines: Lower Limb Prosthetics: Partial Foot Prosthesis
27 pages
M. Sc. (11/01-05/02) Final Year Project: CV: Jahanzeb Ahmed
No ratings yet
M. Sc. (11/01-05/02) Final Year Project: CV: Jahanzeb Ahmed
5 pages
Questions - Worksheet_homework - 9th - Science - 2024-04-06T01_13
No ratings yet
Questions - Worksheet_homework - 9th - Science - 2024-04-06T01_13
8 pages
Being and Time
82% (17)
Being and Time
590 pages
Acta Colombiana Cuidado Intensivo
No ratings yet
Acta Colombiana Cuidado Intensivo
12 pages
Ad 19123
No ratings yet
Ad 19123
1 page
8177B KFC Annual Report 2023 DPS Digital - Compressed
No ratings yet
8177B KFC Annual Report 2023 DPS Digital - Compressed
15 pages
Difference Theory: Genderlect Theory of Deborah Tannen
No ratings yet
Difference Theory: Genderlect Theory of Deborah Tannen
2 pages
MODULE 1 - ECONOMIC GLOBALIZATION - CONTEMPORARY WORLD - Pot
No ratings yet
MODULE 1 - ECONOMIC GLOBALIZATION - CONTEMPORARY WORLD - Pot
21 pages
A. Office Sundram Fasteners LTD., Padi, Chennai-600050: Biodata of Assessor
No ratings yet
A. Office Sundram Fasteners LTD., Padi, Chennai-600050: Biodata of Assessor
2 pages
Ronco 2017
No ratings yet
Ronco 2017
10 pages
Electronic Support For Placing in London: Designed by The Market For The Market
No ratings yet
Electronic Support For Placing in London: Designed by The Market For The Market
3 pages
Citation510 Mustang
No ratings yet
Citation510 Mustang
62 pages
Ignition System - 1
100% (1)
Ignition System - 1
5 pages
Edge Computing
No ratings yet
Edge Computing
14 pages
Indian Minerals Yearbook 2018: (Part-I)
No ratings yet
Indian Minerals Yearbook 2018: (Part-I)
12 pages
L1S2 - Opportunity Cost and Marginal Analysis Exercises
No ratings yet
L1S2 - Opportunity Cost and Marginal Analysis Exercises
2 pages
MFAT Individual Form
No ratings yet
MFAT Individual Form
2 pages
Sekolah Menengah Kejuruan at Taqwa: A. Choose The Correct Answer by Crossing (X) A, B, C, D, or E!
No ratings yet
Sekolah Menengah Kejuruan at Taqwa: A. Choose The Correct Answer by Crossing (X) A, B, C, D, or E!
3 pages
Chapter 2 The Classical Linear Regression Model (CLRM)
No ratings yet
Chapter 2 The Classical Linear Regression Model (CLRM)
20 pages
Nguyen Van Tuan - CV
No ratings yet
Nguyen Van Tuan - CV
3 pages
Ethics: Slam Shraf Ahmy
No ratings yet
Ethics: Slam Shraf Ahmy
6 pages

DA Lab Manual

Uploaded by

DA Lab Manual

Uploaded by

Ex No 1: Write a program to perform numerical operations (MAX,

MIN, AVG, SUM, SQRT, ROUND) using R.

❖ Print the contents (age column) of the csv_data variable, displaying

❖ Calculate the mean (average) value of the 'Age' column in the

❖ Calculate the median value of the 'Age' column in the csv_data

❖ Calculate the standard deviation of the 'Age' column in the csv_data

❖ Print the results.

var1 var2 var3

You might also like