Data Analytics Lab Manual_250402_095326
Data Analytics Lab Manual_250402_095326
by Preeti Sahu
Course Objectives
Reference Books
1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira.
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs - Jeffrey D Ullman
Stanford Univ.
Data Preprocessing
Overview
Data preprocessing is a critical first step in any data analytics project. It
involves handling missing values, detecting and removing noise, and
identifying and eliminating data redundancy.
The initial inspection of the data helps us to detect whether there are
missing values in the data set. This can be done through Exploratory
Data Analysis (EDA), making it essential for a data scientist to always
perform EDA to identify missing values correctly.
Handling Missing Values
Common Techniques for Missing Value Imputation
1 Mean/Median Imputation
Replace missing values with the mean or median of the respective
feature.
2 Forward/Backward Fill
Replace missing values with the previous or next value in the
sequence.
This code creates a sample DataFrame with missing values and replaces them with the mean of each column.
Forward/Backward Fill Example
import pandas as pd
import numpy as np
# Forward fill
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='ffill', inplace=True)
print(df)
This code demonstrates the forward fill method, which replaces missing values with the previous value in the
sequence.
KNN Imputation Example
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np
The KNN imputation method uses the K-Nearest Neighbors algorithm to find similar data points and impute missing
values based on them.
Noise Detection and
Removal
Methods for Handling Noisy Data
1 Statistical Methods
Use statistical methods such as mean, median, and standard
deviation to detect and remove outliers.
# Remove outliers
df_cleaned = df[(df['A'] >= mean - 2*std_dev) & (df['A'] <= mean + 2*std_dev)]
print(df_cleaned)
This example uses statistical methods to identify and remove outliers by filtering out values that lie beyond 2
standard deviations from the mean.
Machine Learning Methods
for Noise Detection
from sklearn.svm import OneClassSVM
import pandas as pd
import numpy as np
# Predict anomalies
anomaly = model.predict(df[['A']])
# Remove anomalies
df_cleaned = df[anomaly == 1]
print(df_cleaned)
This example calculates the correlation between features and eliminates those that are highly correlated, as they
likely provide redundant information.
KNN Imputation Model Implementation
Let's implement the K-Nearest Neighbors (KNN) imputation model in Python using the scikit-learn library. KNN
imputation works by finding the k most similar data points to the row with missing values and imputing based on
these neighbors.
import pandas as pd
from sklearn.impute import KNNImputer
import numpy as np
print("Original DataFrame:")
print(df)
# Fit the imputer to the data and transform the missing values
imputed_data = imputer.fit_transform(df)
print("\nImputed DataFrame:")
print(imputed_df)
Linear Regression Implementation
Linear Regression is a supervised learning algorithm used for predicting the value of a continuous output variable
based on one or more input features. Let's look at a simple implementation using Python and NumPy.
Purpose Applications
Linear Regression predicts continuous values by Used for sales forecasting, risk assessment,
finding the best-fit line through the data points, housing price prediction, and other scenarios
minimizing the distance between observed values requiring numerical prediction based on existing
and the regression line. data patterns.
Linear Regression Code Example
import numpy as np
class LinearRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None
# Gradient Descent
for _ in range(self.n_iters):
y_predicted = np.dot(X, self.weights) + self.bias
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)
# Make predictions
predicted = model.predict(X)
# Plot data
plt.scatter(X, y, label="Data")
plt.plot(X, predicted, label="Linear Regression", color="red")
plt.legend()
plt.show()
This example demonstrates how to use the LinearRegression class we defined. It generates sample data, creates
and trains the model, makes predictions, and plots the results for visualization.
Logistic Regression Implementation
Logistic Regression is a supervised learning algorithm used for classification problems. It predicts the probability of
an instance belonging to a particular class.
Purpose Applications
Unlike Linear Regression, Logistic Regression is Used for spam detection, disease diagnosis,
used for binary classification problems, predicting customer churn prediction, and other binary
the probability of an outcome using the sigmoid classification scenarios.
function.
Logistic Regression Code Example
import numpy as np
class LogisticRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None
# Gradient Descent
for _ in range(self.n_iters):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = self._sigmoid(linear_model)
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)
# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
# Make predictions
predicted = model.predict(X)
print(predicted)
# Plot data
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
The predict method uses the sigmoid function to convert linear predictions to probabilities, then classifies based on
a threshold of 0.5. The example shows how to use this class with sample data.
Decision Tree Induction for Classification
Decision Tree Induction is a supervised learning algorithm used for classification and regression tasks. It creates a
model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None
if len(np.unique(y[left_indices])) == 1:
left_entropy = 0
else:
left_entropy = self._entropy(y[left_indices])
if len(np.unique(y[right_indices])) == 1:
right_entropy = 0
else:
right_entropy = self._entropy(y[right_indices])
n = len(y)
n_left, n_right = np.sum(left_indices), np.sum(right_indices)
child_entropy = (n_left / n) * left_entropy + (n_right / n) * right_entropy
ig = parent_entropy - child_entropy
return ig
Decision Tree Growth and Prediction
def _grow_tree(self, X, y, depth=0):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))
# Stopping criteria
if (self.max_depth is not None and depth >= self.max_depth
or n_labels == 1 or n_samples == 1):
leaf_value = np.argmax(np.bincount(y))
return leaf_value
return node
# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Make predictions
predictions = model.predict(X_test)
Random Forest Classifier
Implementation
Random Forest is an ensemble learning method that operates by
constructing multiple decision trees during training and outputting the
class that is the mode of the classes of the individual trees.
Key Features
Bootstrap sampling (bagging) of training data, random feature
selection for each tree, majority voting for final prediction.
Benefits
Reduces overfitting compared to single decision trees, handles high-
dimensional data well, provides feature importance measures.
Random Forest Class Implementation
class RandomForest:
def __init__(self, n_trees=100, max_depth=None, n_feats=None):
self.n_trees = n_trees
self.max_depth = max_depth
self.n_feats = n_feats
self.trees = []
predictions = np.array(predictions).T
predictions = [np.bincount(prediction).argmax() for prediction in predictions]
return predictions
# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Make predictions
predictions = model.predict(X_test)
ARIMA for Time Series Data
ARIMA (AutoRegressive Integrated Moving Average) is a popular statistical model used for forecasting time series
data. It combines autoregressive (AR), differencing (I), and moving average (MA) components to model and predict
time series behavior.
Components Applications
Autoregressive (p): Uses past values to predict Stock price forecasting, sales prediction,
future values. Integrated (d): Differencing to make temperature forecasting, and other time-dependent
the time series stationary. Moving Average (q): Uses data analysis.
past forecast errors in a regression model.
ARIMA Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
Image Representation
Represent the input image as a hierarchical structure, such as a tree or a graph, where each node represents a
region or a pixel in the image.
Hierarchical Clustering
Apply hierarchical clustering algorithms to group similar regions or pixels together based on their features,
such as color, texture, or intensity.
Region Merging
Merge regions or nodes based on their similarity and the desired level of detail, guided by a merging
criterion.
Object Segmentation
Identify objects by selecting the regions or nodes that correspond to objects of interest, guided by
prior knowledge or learned models.
Refinement
Refine segmentation results with additional processing steps like boundary refinement or region
growing.
Hierarchical Segmentation Methods
The advantages of hierarchical methods include efficient representation of complex images, multi-scale analysis
capabilities, and robustness to noise. However, these methods can be computationally expensive, difficult to
parameterize correctly, and sensitive to initial conditions.
Visualization Techniques: Bar Chart
A bar chart is a fundamental visualization used to compare values across different categories. It uses rectangular
bars with heights proportional to the values they represent.
# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]
# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]
# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]
# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
z = [10, 20, 30, 40, 50]
# Data
data = np.random.rand(10, 10)
# Data
x = np.random.rand(10)
y = np.random.rand(10)
z = np.random.rand(10)
Purpose Applications
To describe and summarize patient data, treatment Patient demographics analysis, treatment efficacy
patterns, and outcomes to understand what assessment, resource utilization tracking, and
happened in the past. disease prevalence monitoring.
Healthcare Data Example
Let's consider a dataset containing patient information for diabetes patients, including variables such as:
Patient ID
Age
Gender
Blood pressure
Blood glucose level
Medication (yes/no)
Hospitalization (yes/no)
This data can be analyzed using descriptive analytics techniques to understand patient characteristics and
treatment outcomes.
Summary Statistics for
Healthcare Data
Variable Mean Median Mode Std Dev
2 Medication Efficacy
Patients who receive medication tend to have better treatment
outcomes, highlighting the importance of medication adherence.
3 Hospitalization Risk
Hospitalization rates are higher among patients with poorer
treatment outcomes, indicating a need for early intervention
strategies.
Purpose
To forecast future sales volume, identify sales trends, and
understand factors influencing product performance.
Applications
Demand forecasting, inventory management, pricing optimization,
and targeted marketing campaigns.
Product Sales Dataset
A typical product sales dataset for predictive analytics might include:
These variables help in understanding factors that influence sales and building accurate predictive models.
Data Preprocessing for Sales Prediction
import pandas as pd
from sklearn.preprocessing import StandardScaler
Data preprocessing is a crucial step in preparing sales data for predictive modeling. This includes handling missing
values, converting categorical variables into numerical format, and scaling the data to ensure all features contribute
equally to the model.
Feature Engineering for
Sales Prediction
# Create a new feature called Lag_Sales
data['Lag_Sales'] = data['Sales'].shift(1)
The choice of model depends on the characteristics of the sales data, including seasonality, trend, and the
influence of external factors like marketing and promotions.
LSTM Model for Sales Prediction
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense
# Make predictions
y_pred = model.predict(X_test)
# Calculate MAE
mae = np.mean(np.abs(y_test - y_pred))
print(f'MAE: {mae}')
# Calculate MAPE
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f'MAPE: {mape}%')
Predictive Analytics for Weather Forecasting
Weather forecasting uses historical climate data, machine learning algorithms, and statistical models to predict
future weather conditions. Accurate forecasts are crucial for agriculture, transportation, emergency management,
and daily planning.
Purpose Applications
To predict future weather conditions based on Agricultural planning, disaster preparedness,
historical patterns and current atmospheric transportation scheduling, and event planning.
measurements.