Harsh Psda Practical File
Harsh Psda Practical File
[CSIT366]
FUNDAMENTALS OF DATA SCIENCE AND
ANALYTICS
LAB FILE
Submitted to: Dr. Rashmi Vashisth Submitted by: Harsh Singh Parmar
Program: BCA 5th semester Enrollment no: A10046622002
INDEX
Sno. Name of the Experiment Date Signature
1. To implement Bar Plot, Histogram and 11-07-24
Line Chart.
2. To implement frequency distribution 18-07-24
and trend chart using a data set.
3. Write a program to create a matrix of 25-07-24
random numbers and convert it to a
vector.
4. Write a program to perform different 01-08-24
text analysis operations using NLTK.
5. Write a program to generate a random 08-08-24
word using
(i) HTTP Request
(ii) A Text File
6. To implement Image Morphing in 06-09-24
Python.
7. To implement and verify EDA 19-09-24
techniques.
8. To implement KNN algorithm on iris 26-09-24
dataset.
9. To Perform sentiment analysis on text 03-10-24
data using a pretrained model.
10. To Implement Principal Component 10-10-24
Analysis on a dataset.
Date-11-07-24
Experiment-1
Aim: To implement Bar Plot, Histogram and Line Chart.
Theory
Bar Plot: A bar plot is a visual representation of data using rectangular bars,
where the length of each bar corresponds to the value it represents. It's effective
for comparing categories or showing changes over time.
Histogram: A histogram is a visual representation of data distribution where
continuous data is grouped into intervals and displayed as bars. The height of
each bar corresponds to the frequency of values within that interval, providing
insights into data shape and spread.
Line Chart: A line chart is a visual representation of data points connected by
straight lines, often used to show trends and changes over time. It's ideal for
displaying continuous data and highlighting patterns of increase, decrease, or
stability.
Code:
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y = np.random.rand(10)
plt.figure(figsize=(8, 6))
plt.plot(x, y, marker='x')
plt.title('Line Chart Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
data = np.random.randn(1000)
plt.figure(figsize=(8, 6))
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.figure(figsize=(8, 6))
plt.bar(categories, values)
plt.title('Bar Plot Example')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()
OUTPUT:
Date-18-07-24
Experiment-2
Aim: To implement frequency distribution and trend chart using a data set.
Theory: Frequency distribution and trend charts are complementary tools for
data analysis. A frequency distribution organizes data into intervals, showing
how often values occur within each range. This helps identify patterns and
central tendencies. A trend chart, on the other hand, visualizes data points
over time, revealing upward, downward, or stable patterns. By combining these
methods, analysts can gain insights into data behavior, make predictions, and
inform decision-making.
Code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('camera_dataset.csv')
plt.figure(figsize=(10, 6))
df.groupby('Release date')['Max resolution'].max().plot(kind='line', marker='o')
plt.xlabel("Year")
plt.ylabel("Maximum Resolution")
plt.title("Trend Chart: Maximum Resolution over Years")
plt.show()
plt.figure(figsize=(10, 6))
effective_pixels = df['Effective pixels'].value_counts()
plt.bar(effective_pixels.index, effective_pixels.values, edgecolor="black")
plt.xlabel("Effective Pixels")
plt.ylabel("Frequency")
plt.title("Bar Chart: Frequency of Effective Pixels")
plt.show()
OUTPUT
Date-25-07-24
Experiment-3
Aim: Write a program to create a matrix of random numbers and convert it
to a vector.
Software used: Pycharm
Theory
A matrix is a rectangular array of numbers arranged in rows and columns. In
Python, it's often represented as a nested list. The NumPy library provides
efficient matrix operations.
A vector is a one-dimensional array of numbers, essentially a special case of a
matrix with only one column. In Python, it can be represented as a list or a
NumPy array. Vectors are used for various calculations and linear algebra
operations.
Both matrices and vectors are fundamental data structures in fields like linear
algebra, machine learning, and data science.
Code:
import numpy as np
Args:
n (int): The number of dimensions in the output vector.
k (int): The number of dimensions in the input vector.
Returns:
np.ndarray: An n x k matrix representing the linear transformation.
"""
# Create an n x k matrix with random coefficients
matrix = np.random.rand(n, k)
return matrix
print(matrix)
OUTPUT
Date-01-08-24
Experiment-4
Aim: Write a program to perform different text analysis operations using
NLTK.
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
def pos_tagging(text):
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
return tagged_tokens
# Example usage:
text = "This is a sample sentence. Another sentence. And the last one."
tagged_tokens = pos_tagging(text)
Experiment-5.1
Aim: Write a program to generate a random word using
(i) HTTP Request
(ii) A Text File
def get_random_word():
response = requests.get("https://raw.githubusercontent.com/dwyl/english-
words/master/words_alpha.txt")
if response.status_code == 200:
words = response.text.splitlines()
random_word = random.choice(words)
return random_word
else:
return None
random_word = get_random_word()
if random_word:
print("Random word:", random_word)
else:
print("Failed to get random word")
OUTPUT
Date-08-08-24
Experiment-5.2
Aim: Write a program to generate a random word using
(i) HTTP Request
(ii) A Text File
def get_random_word(filename):
with open(filename, 'r') as f:
words = [line.strip() for line in f.readlines()]
random_word = random.choice(words)
return random_word
filename = 'words.txt'
random_word = get_random_word(filename)
print("Random word:", random_word)
OUTPUT
Date-06-09-24
Experiment-6
Aim: To implement Image Morphing in Python
Software Used: Pycharm
Theory: Image morphing in Python involves smoothly transforming one image
into another by manipulating pixels, shapes, or features. This can be achieved
using libraries like OpenCV, which allows for image warping and blending
techniques. By defining corresponding points between two images, algorithms
like Delaunay triangulation or thin plate splines can interpolate the intermediate
images. The result is a seamless transition from one image to another, often used
in applications like animation, facial recognition, or artistic effects. Libraries such
as NumPy are also helpful in managing pixel-level operations efficiently.
Code:
from PIL import Image, ImageChops
import requests
from io import BytesIO
import os
import matplotlib.pyplot as plt
image1_url = "https://cdn.pixabay.com/photo/2014/02/27/16/10/flowers-276014_1280.jpg"
image2_url = "https://cdn.pixabay.com/photo/2015/04/23/22/00/tree-736885_1280.jpg"
response1 = requests.get(image1_url)
response2 = requests.get(image2_url)
num_frames = 30
output_folder = 'morphing_frames/'
os.makedirs(output_folder, exist_ok=True)
else:
print("Failed to download images from the provided URLs.")
OUTPUT
Date-19-09-24
Experiment-7
Aim: To implement and verify EDA techniques
Software Used: Pycharm
Theory: Exploratory Data Analysis (EDA) techniques in Python are crucial for
understanding the underlying patterns, trends, and relationships within a
dataset. Common EDA techniques include descriptive statistics (like mean,
median, and mode), data visualization, and distribution analysis. Libraries such
as Pandas and NumPy allow for easy data manipulation and summary statistics,
while Matplotlib and Seaborn are used for plotting histograms, boxplots, scatter
plots, and correlation heatmaps. Outlier detection, handling missing data, and
normalization or transformation techniques are often applied to clean and better
understand the dataset before applying predictive models.
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
file_path = 'titanic.csv'
df = pd.read_csv(file_path)
# 1. Countplot of Survived
plt.subplot(2, 3, 1)
sns.countplot(x='Survived', data=df)
plt.title('Countplot of Survived')
# 5. Bubble chart: Age vs Fare with Pclass as color and Fare as size
plt.subplot(2, 3, 5)
bubble_sizes = df['Fare'] / 2 # Scale down bubble size for better visualization
sns.scatterplot(x='Age', y='Fare', size=bubble_sizes, hue='Pclass', data=df, palette='viridis',
alpha=0.6)
plt.title('Bubble Chart: Age vs Fare')
plt.tight_layout()
plt.show()
Date-26-09-24
Experiment-8
Aim: To implement KNN algorithm on iris dataset
Software Used: PyCharm
Theory: KNN algorithm operates on the principle of "similarity is proximity."
Given a new data point, KNN finds the K closest data points (neighbours) from
the training set. The class or value of the majority of these neighbours is then
assigned to the new data point. In classification, this means predicting the
category, while in regression, it involves predicting a numerical value. KNN is
often used for tasks like image recognition, recommendation systems, and
anomaly detection.
Iris Dataset consists of 150 samples from each of three species of Iris (Iris setosa,
Iris virginica and Iris versicolor). Four features were measured from each sample:
the length and the width of the sepals and petals, in centimetres.
Code:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
file = "iris.csv"
iris = pd.read_csv(file)
knn.fit(X_train, y_train)
OUTPUT
Date-03-10-24
Experiment-9
Aim: Perform sentiment analysis on text data using a pretrained model.
Software Used: Pycharm
Theory: Sentiment analysis on text data involves automatically determining the
emotional tone of a given piece of text. This can range from positive, negative,
or neutral. A pretrained model, like those based on deep learning architectures
such as Recurrent Neural Networks (RNNs) or Transformers, can be effectively
used for this task. These models are trained on large datasets of labeled text,
allowing them to learn complex patterns and nuances in language that are
indicative of sentiment. By feeding a new piece of text into a pretrained model,
it can predict the sentiment with a certain degree of accuracy, providing valuable
insights for applications like social media monitoring, customer feedback
analysis, movie reviews and market research.
Code:
import tf_keras as keras
import keras
import torch
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love this movie", "This movie sucks!","This movie is damn good!","The movie
wasn't up to the mark","The movie could have been better!","That movie was awesome","I
found it very touching","It feels like movie of the year"]
result = sentiment_pipeline(data)
print(result)
OUTPUT
Date-10-10-24
Experiment-10
Aim: To Implement Principal Component Analysis on a dataset
Software Used: Pycharm
Theory: PCA (Principal Component Analysis) is a dimensionality reduction
technique widely used in machine learning. It transforms a large dataset of
interrelated variables into a smaller set of uncorrelated variables called principal
components. These components capture the most variance in the data, allowing
for efficient data representation and analysis. PCA is commonly used for tasks
like feature engineering, visualization, and noise reduction, making it a valuable
tool in various machine learning applications.
Code:
For Reduction to 1-Dimension
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=1)
X_pca_1d = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 4))
plt.scatter(X_pca_1d, np.zeros_like(X_pca_1d), c=y, cmap='viridis', edgecolor='k', s=100)
plt.xlabel('First Principal Component')
plt.title('PCA of Iris Dataset (1 Dimension)')
plt.yticks([])
plt.show()
explained_variance = pca.explained_variance_ratio_
print(f'Explained variance by the first principal component: {explained_variance[0]}')
OUTPUT
For Reduction to 2-Dimensions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = load_iris()
X = iris.data
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=100)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.colorbar(label='Iris Species')
plt.show()
explained_variance = pca.explained_variance_ratio_
print(f'Explained variance by each principal component: {explained_variance}')
OUTPUT