0% found this document useful (0 votes)

21 views58 pages

Data Analytics Lab Manual_250402_095326

The document outlines a course on Data Analytics Lab for B.Tech students, detailing objectives, outcomes, and experiments focused on data preprocessing, regression techniques, and visualization methods. It includes practical implementations of Linear Regression, Logistic Regression, Decision Trees, and KNN imputation, along with examples in Python. Additionally, it provides recommended reading materials and emphasizes the importance of data preprocessing techniques like handling missing values and noise detection.

Uploaded by

Bejjanki Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views58 pages

Data Analytics Lab Manual_250402_095326

Uploaded by

Bejjanki Vardhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

DATA ANALYTICS LAB

III-B.Tech II-Semester Course Code: A2AM605PC

by Preeti Sahu
Course Objectives

1 Explore Fundamental Concepts 2 Learn Statistical Analysis

To explore the fundamental concepts of data To learn the principles and methods of statistical
analytics. analysis.

3 Discover Patterns 4 Understand Search & Visualization

Discover interesting patterns, analyze supervised To understand the various search methods and
and unsupervised models and estimate the visualization techniques.
accuracy of the algorithms.
Course Outcomes

1 Regression Understanding 2 Classifier Functionality

Understand linear regression and logistic Understand the functionality of different
regression. classifiers.

3 Visualization Implementation 4 Analytics Application

Implement visualization techniques using Apply descriptive and predictive analytics for
different graphs. different types of data.
List of Experiments - Part 1

1 Data Preprocessing 2 Imputation Model

Handling missing values, noise detection Implement any one imputation model.
removal, identifying data redundancy and
elimination.

3 Linear Regression 4 Logistic Regression

Implement Linear Regression. Implement Logistic Regression.

5 Decision Tree Induction 6 Random Forest Classifier

Implement Decision Tree Induction for Implement Random Forest Classifier.
classification.
List of Experiments - Part 2

1 ARIMA Implementation 2 Object Segmentation

Implement ARIMA on Time Series data. Object segmentation using hierarchical based
methods.

3 Visualization Techniques 4 Descriptive Analytics

Perform Visualization techniques (types of maps Perform Descriptive analytics on healthcare data.
- Bar, Column, Line, Scatter, 3D Cubes etc).

5 Predictive Analytics 6 Weather Forecasting

Perform Predictive analytics on Product Sales Apply Predictive analytics for Weather
data. forecasting.
Recommended Reading Materials
Text Books

1. Student's Handbook for Associate Analytics – II, III.

2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann Publishers.

Reference Books

1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira.
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs - Jeffrey D Ullman
Stanford Univ.
Data Preprocessing
Overview
Data preprocessing is a critical first step in any data analytics project. It
involves handling missing values, detecting and removing noise, and
identifying and eliminating data redundancy.

The initial inspection of the data helps us to detect whether there are
missing values in the data set. This can be done through Exploratory
Data Analysis (EDA), making it essential for a data scientist to always
perform EDA to identify missing values correctly.
Handling Missing Values
Common Techniques for Missing Value Imputation

1 Mean/Median Imputation
Replace missing values with the mean or median of the respective
feature.

2 Forward/Backward Fill
Replace missing values with the previous or next value in the
sequence.

3 K-Nearest Neighbors (KNN)

Replace missing values using the KNN algorithm to find similar
data points.
Mean/Median Imputation Example
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Replace missing values with mean

df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].mean(), inplace=True)
print(df)

This code creates a sample DataFrame with missing values and replaces them with the mean of each column.
Forward/Backward Fill Example
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Forward fill
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='ffill', inplace=True)
print(df)

This code demonstrates the forward fill method, which replaces missing values with the previous value in the
sequence.
KNN Imputation Example
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Create a KNN imputer

imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data

df_imputed = imputer.fit_transform(df)
print(df_imputed)

The KNN imputation method uses the K-Nearest Neighbors algorithm to find similar data points and impute missing
values based on them.
Noise Detection and
Removal
Methods for Handling Noisy Data

1 Statistical Methods
Use statistical methods such as mean, median, and standard
deviation to detect and remove outliers.

2 Machine Learning Methods

Use machine learning algorithms such as One-Class SVM and
Local Outlier Factor (LOF) to detect anomalies.
Statistical Methods for Noise Detection
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Calculate the mean and standard deviation

mean = df['A'].mean()
std_dev = df['A'].std()

# Remove outliers
df_cleaned = df[(df['A'] >= mean - 2*std_dev) & (df['A'] <= mean + 2*std_dev)]
print(df_cleaned)

This example uses statistical methods to identify and remove outliers by filtering out values that lie beyond 2
standard deviations from the mean.
Machine Learning Methods
for Noise Detection
from sklearn.svm import OneClassSVM
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Create a One-Class SVM model

model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1)

# Fit the model

model.fit(df[['A']])

# Predict anomalies
anomaly = model.predict(df[['A']])

# Remove anomalies
df_cleaned = df[anomaly == 1]
print(df_cleaned)

Machine learning methods like One-Class SVM can be used to detect

anomalies by learning the pattern of normal data and identifying points
that deviate from this pattern.
Identifying Data Redundancy
Techniques for Redundancy Detection and Elimination

1 Correlation Analysis 2 Principal Component Analysis (PCA)

Use correlation analysis to identify highly Use PCA to reduce the dimensionality of the data
correlated features and eliminate redundant ones. and eliminate redundant features.
Correlation Analysis Example
import pandas as pd
import numpy as np

# Create a sample DataFrame

df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 6, 8, 10],
'C': [3, 6, 9, 12, 15]
})

# Calculate the correlation matrix

corr_matrix = df.corr()

# Identify highly correlated features

high_corr_features = corr_matrix[(corr_matrix > 0.9) & (corr_matrix < 1)].index

# Eliminate redundant features

df_eliminated = df.drop(high_corr_features, axis=1)
print(df_eliminated)

This example calculates the correlation between features and eliminates those that are highly correlated, as they
likely provide redundant information.
KNN Imputation Model Implementation
Let's implement the K-Nearest Neighbors (KNN) imputation model in Python using the scikit-learn library. KNN
imputation works by finding the k most similar data points to the row with missing values and imputing based on
these neighbors.

import pandas as pd
from sklearn.impute import KNNImputer
import numpy as np

# Create a sample dataset with missing values

data = {'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 3, 4, 5, 6],
'C': [7, 8, 9, np.nan, 11]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Create a KNN imputer with k=3

imputer = KNNImputer(n_neighbors=3)

# Fit the imputer to the data and transform the missing values
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame

imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nImputed DataFrame:")
print(imputed_df)
Linear Regression Implementation
Linear Regression is a supervised learning algorithm used for predicting the value of a continuous output variable
based on one or more input features. Let's look at a simple implementation using Python and NumPy.

Purpose Applications
Linear Regression predicts continuous values by Used for sales forecasting, risk assessment,
finding the best-fit line through the data points, housing price prediction, and other scenarios
minimizing the distance between observed values requiring numerical prediction based on existing
and the regression line. data patterns.
Linear Regression Code Example
import numpy as np

class LinearRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None

def fit(self, X, y):

n_samples, n_features = X.shape
# Initialize weights and bias
self.weights = np.zeros(n_features)
self.bias = 0

# Gradient Descent
for _ in range(self.n_iters):
y_predicted = np.dot(X, self.weights) + self.bias

# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)

# Update weights and bias

self.weights -= self.lr * dw
self.bias -= self.lr * db

def predict(self, X):

y_approximated = np.dot(X, self.weights) + self.bias
return y_approximated
Linear Regression Example Usage
# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt

# Generate sample data

X = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 3, 5, 7, 11])

# Create and train model

model = LinearRegression()
model.fit(X, y)

# Make predictions
predicted = model.predict(X)

# Plot data
plt.scatter(X, y, label="Data")
plt.plot(X, predicted, label="Linear Regression", color="red")
plt.legend()
plt.show()

This example demonstrates how to use the LinearRegression class we defined. It generates sample data, creates
and trains the model, makes predictions, and plots the results for visualization.
Logistic Regression Implementation
Logistic Regression is a supervised learning algorithm used for classification problems. It predicts the probability of
an instance belonging to a particular class.

Purpose Applications
Unlike Linear Regression, Logistic Regression is Used for spam detection, disease diagnosis,
used for binary classification problems, predicting customer churn prediction, and other binary
the probability of an outcome using the sigmoid classification scenarios.
function.
Logistic Regression Code Example
import numpy as np

class LogisticRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None

def _sigmoid(self, x):

return 1 / (1 + np.exp(-x))

def fit(self, X, y):

n_samples, n_features = X.shape
# Initialize weights and bias
self.weights = np.zeros(n_features)
self.bias = 0

# Gradient Descent
for _ in range(self.n_iters):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = self._sigmoid(linear_model)

# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)

# Update weights and bias

self.weights -= self.lr * dw
self.bias -= self.lr * db
Logistic Regression Prediction
def predict(self, X):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = self._sigmoid(linear_model)
y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
return y_predicted_cls

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt

# Generate sample data

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 0, 1, 1])

# Create and train model

model = LogisticRegression()
model.fit(X, y)

# Make predictions
predicted = model.predict(X)
print(predicted)

# Plot data
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

The predict method uses the sigmoid function to convert linear predictions to probabilities, then classifies based on
a threshold of 0.5. The example shows how to use this class with sample data.
Decision Tree Induction for Classification
Decision Tree Induction is a supervised learning algorithm used for classification and regression tasks. It creates a
model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Key Components Advantages

Split points, information gain calculation, tree Easy to understand and interpret, requires little data
structure with nodes and leaves that represent preparation, can handle both numerical and
decisions and outcomes. categorical data.
Decision Tree Helper Functions
import numpy as np

class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None

def _entropy(self, y):

hist = np.bincount(y)
ps = hist / len(y)
return -np.sum([p * np.log2(p) for p in ps if p > 0])

def _gain(self, X_column, X_threshold, y):

parent_entropy = self._entropy(y)
left_indices, right_indices = X_column < X_threshold, X_column >= X_threshold

if len(np.unique(y[left_indices])) == 1:
left_entropy = 0
else:
left_entropy = self._entropy(y[left_indices])

if len(np.unique(y[right_indices])) == 1:
right_entropy = 0
else:
right_entropy = self._entropy(y[right_indices])

n = len(y)
n_left, n_right = np.sum(left_indices), np.sum(right_indices)
child_entropy = (n_left / n) * left_entropy + (n_right / n) * right_entropy

ig = parent_entropy - child_entropy
return ig
Decision Tree Growth and Prediction
def _grow_tree(self, X, y, depth=0):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))

# Stopping criteria
if (self.max_depth is not None and depth >= self.max_depth
or n_labels == 1 or n_samples == 1):
leaf_value = np.argmax(np.bincount(y))
return leaf_value

# Find the best split

best_feat = None
best_thr = None
best_gain = -1

for idx in range(n_features):

X_column = X[:, idx]
thresholds = np.unique(X_column)

for threshold in thresholds:

gain = self._gain(X_column, threshold, y)

if gain > best_gain:

best_gain = gain
best_feat = idx
best_thr = threshold

# Split the data

left_indices = X[:, best_feat] < best_thr
right_indices = X[:, best_feat] >= best_thr

left = self._grow_tree(X[left_indices, :], y[left_indices], depth+1)

right = self._grow_tree(X[right_indices, :], y[right_indices], depth+1)

return {"feature": best_feat, "threshold": best_thr, "left": left, "right": right}

Decision Tree Implementation
def fit(self, X, y):
self.tree = self._grow_tree(X, y)

def predict(self, X):

return [self._predict(inputs) for inputs in X]

def _predict(self, inputs):

node = self.tree

while isinstance(node, dict):

feature = node["feature"]
threshold = node["threshold"]

if inputs[feature] < threshold:

node = node["left"]
else:
node = node["right"]

return node

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data

X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, random_state=42)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and train model

model = DecisionTree(max_depth=5)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
Random Forest Classifier
Implementation
Random Forest is an ensemble learning method that operates by
constructing multiple decision trees during training and outputting the
class that is the mode of the classes of the individual trees.

Key Features
Bootstrap sampling (bagging) of training data, random feature
selection for each tree, majority voting for final prediction.

Benefits
Reduces overfitting compared to single decision trees, handles high-
dimensional data well, provides feature importance measures.
Random Forest Class Implementation
class RandomForest:
def __init__(self, n_trees=100, max_depth=None, n_feats=None):
self.n_trees = n_trees
self.max_depth = max_depth
self.n_feats = n_feats
self.trees = []

def _bootstrap(self, X, y):

n_samples = X.shape[0]
idxs = np.random.choice(n_samples, n_samples, replace=True)
return X[idxs], y[idxs]

def _feature_sampling(self, X):

n_feats = X.shape[1]
if self.n_feats is None:
self.n_feats = int(np.sqrt(n_feats))

feats = np.random.choice(n_feats, self.n_feats, replace=False)

return X[:, feats]

def fit(self, X, y):

for _ in range(self.n_trees):
tree = DecisionTree(max_depth=self.max_depth)
X_boot, y_boot = self._bootstrap(X, y)
X_boot_feat = self._feature_sampling(X_boot)
tree.fit(X_boot_feat, y_boot)
self.trees.append(tree)
Random Forest Prediction Method
def predict(self, X):
predictions = []
for tree in self.trees:
X_feat = self._feature_sampling(X)
prediction = tree.predict(X_feat)
predictions.append(prediction)

predictions = np.array(predictions).T
predictions = [np.bincount(prediction).argmax() for prediction in predictions]
return predictions

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data

X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, random_state=42)

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and train model

model = RandomForest(n_trees=100, max_depth=5)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
ARIMA for Time Series Data
ARIMA (AutoRegressive Integrated Moving Average) is a popular statistical model used for forecasting time series
data. It combines autoregressive (AR), differencing (I), and moving average (MA) components to model and predict
time series behavior.

Components Applications
Autoregressive (p): Uses past values to predict Stock price forecasting, sales prediction,
future values. Integrated (d): Differencing to make temperature forecasting, and other time-dependent
the time series stationary. Moving Average (q): Uses data analysis.
past forecast errors in a regression model.
ARIMA Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Load the time series data

df = pd.read_csv('data.csv', index_col='Date', parse_dates=['Date'])

# Plot the original time series data

plt.figure(figsize=(10, 6))
plt.plot(df['Value'])
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# Split the data into training and testing sets

train_size = int(len(df) * 0.8)
train_data, test_data = df[0:train_size], df[train_size:len(df)]

# Build the ARIMA model

model = ARIMA(train_data, order=(1,1,1))
model_fit = model.fit()

# Print the summary of the model

print(model_fit.summary())
ARIMA Forecasting and Evaluation
# Plot the residuals
residuals = pd.DataFrame(model_fit.resid)
plt.figure(figsize=(10, 6))
plt.plot(residuals)
plt.title('Residuals')
plt.xlabel('Date')
plt.ylabel('Residual Value')
plt.show()

# Plot the density plot of residuals

plt.figure(figsize=(10, 6))
residuals.plot(kind='kde')
plt.title('Density Plot of Residuals')
plt.xlabel('Residual Value')
plt.ylabel('Density')
plt.show()

# Print the statistics of residuals

print(residuals.describe())

# Forecast the test data

forecast_steps = len(test_data)
forecast, stderr, conf_int = model_fit.forecast(steps=forecast_steps)

# Plot the forecasted data

plt.figure(figsize=(10, 6))
plt.plot(train_data, label='Training Data')
plt.plot(test_data, label='Actual Test Data')
plt.plot([None for i in train_data] + [x for x in forecast],
label='Forecasted Test Data', marker='o')
plt.title('Forecasted Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Evaluate the model using Mean Squared Error (MSE)

mse = mean_squared_error(test_data, forecast)
print('Mean Squared Error (MSE):', mse)
Object Segmentation: Hierarchical Methods
Object segmentation using hierarchical based methods involves representing an image as a hierarchical structure,
where each level of the hierarchy represents a different scale or level of detail.

Image Representation
Represent the input image as a hierarchical structure, such as a tree or a graph, where each node represents a
region or a pixel in the image.

Hierarchical Clustering
Apply hierarchical clustering algorithms to group similar regions or pixels together based on their features,
such as color, texture, or intensity.

Region Merging
Merge regions or nodes based on their similarity and the desired level of detail, guided by a merging
criterion.

Object Segmentation
Identify objects by selecting the regions or nodes that correspond to objects of interest, guided by
prior knowledge or learned models.

Refinement
Refine segmentation results with additional processing steps like boundary refinement or region
growing.
Hierarchical Segmentation Methods

1 Hierarchical Clustering 2 Region Growing

Groups similar regions together using hierarchical Starts with small seed regions and grows them by
clustering algorithms like agglomerative or adding similar adjacent regions or pixels.
divisive clustering.

3 Watershed Transform 4 Hierarchical CRFs

Represents the image as a topographic surface Uses hierarchical conditional random fields to
and identifies objects as catchment basins. model relationships between regions at different
scales.
Hierarchical Segmentation Applications
Object Detection Image Segmentation Scene Understanding
Identifying specific objects of Partitioning images into Interpreting the meaning of an
interest within an image, such as meaningful regions or objects for image by identifying and labeling
vehicles, people, or buildings. better understanding and analysis. its constituent parts and their
relationships.

The advantages of hierarchical methods include efficient representation of complex images, multi-scale analysis
capabilities, and robustness to noise. However, these methods can be computationally expensive, difficult to
parameterize correctly, and sensitive to initial conditions.
Visualization Techniques: Bar Chart
A bar chart is a fundamental visualization used to compare values across different categories. It uses rectangular
bars with heights proportional to the values they represent.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis

fig, ax = plt.subplots()

# Create the bar chart

ax.bar(categories, values)

# Set title and labels

ax.set_title('Bar Chart Example')
ax.set_xlabel('Categories')
ax.set_ylabel('Values')

# Show the plot

plt.show()
Visualization Techniques: Column Chart
A column chart is similar to a bar chart, but the bars are oriented horizontally instead of vertically. This is
particularly useful when category labels are long or when comparing across time periods.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis

fig, ax = plt.subplots()

# Create the column chart

ax.barh(categories, values)

# Set title and labels

ax.set_title('Column Chart Example')
ax.set_xlabel('Values')
ax.set_ylabel('Categories')

# Show the plot

plt.show()
Visualization Techniques:
Line Chart
A line chart displays information as a series of data points connected by
straight line segments. It is particularly effective for showing trends over
time or continuous data.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis

fig, ax = plt.subplots()

# Create the line chart

ax.plot(categories, values)

# Set title and labels

ax.set_title('Line Chart Example')
ax.set_xlabel('Categories')
ax.set_ylabel('Values')

# Show the plot

plt.show()
Visualization Techniques: Scatter Plot
A scatter plot uses dots to represent values for two different variables. The position of each dot represents the
value for each observation, and patterns in the dot positions can reveal relationships between variables.

import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Create the figure and axis

fig, ax = plt.subplots()

# Create the scatter plot

ax.scatter(x, y)

# Set title and labels

ax.set_title('Scatter Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')

# Show the plot

plt.show()
Visualization Techniques:
3D Plots
A 3D plot allows visualization of data across three dimensions, providing
a more comprehensive view of relationships between three variables.

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
z = [10, 20, 30, 40, 50]

# Create the figure and axis

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Create the 3D plot

ax.scatter(x, y, z)

# Set title and labels

ax.set_title('3D Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

# Show the plot

plt.show()
Visualization Techniques: Heatmap
A heatmap represents data values as colors in a two-dimensional matrix. It's effective for visualizing patterns,
correlations, and the distribution of data across two categorical dimensions.

import matplotlib.pyplot as plt

import numpy as np

# Data
data = np.random.rand(10, 10)

# Create the figure and axis

fig, ax = plt.subplots()

# Create the heatmap

ax.imshow(data, cmap='hot', interpolation='nearest')

# Set title and labels

ax.set_title('Heatmap Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')

# Show the plot

plt.show()
Visualization Techniques: 3D Cubes
A 3D cube plot is an advanced visualization that uses three-dimensional bars to represent data. This is particularly
useful for showing the relationship between three variables where one represents the height of the cubes.

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Data
x = np.random.rand(10)
y = np.random.rand(10)
z = np.random.rand(10)

# Create the figure and axis

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Create the 3D cube plot

ax.bar3d(x, y, np.zeros(10), 0.1, 0.1, z, color='b')

# Set title and labels

ax.set_title('3D Cube Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

# Show the plot

plt.show()
Descriptive Analytics on Healthcare Data
Descriptive analytics focuses on summarizing historical data to understand past patterns and behaviors. In
healthcare, it helps analyze patient information, medical history, and treatment outcomes to gain valuable insights.

Purpose Applications
To describe and summarize patient data, treatment Patient demographics analysis, treatment efficacy
patterns, and outcomes to understand what assessment, resource utilization tracking, and
happened in the past. disease prevalence monitoring.
Healthcare Data Example
Let's consider a dataset containing patient information for diabetes patients, including variables such as:

Patient ID
Age
Gender
Blood pressure
Blood glucose level
Medication (yes/no)
Hospitalization (yes/no)

Treatment outcome (improved, stable, worsened)

This data can be analyzed using descriptive analytics techniques to understand patient characteristics and
treatment outcomes.
Summary Statistics for
Healthcare Data
Variable Mean Median Mode Std Dev

Blood 130.5 128 120 15.2

pressure

Blood 180.2 175 160 30.5

glucose
level

Summary statistics provide a comprehensive view of continuous

variables like blood pressure and blood glucose levels, showing central
tendency and variability measures.

Variable Frequency Percentage

Medication Yes: 75, No: 25 Yes: 75%, No: 25%

Hospitalization Yes: 20, No: 80 Yes: 20%, No: 80%

Treatment outcome Improved: 40, Improved: 40%,

Stable: 30, Stable: 30%,
Worsened: 30 Worsened: 30%
Healthcare Data Visualization
Visualizing healthcare data helps identify patterns and relationships that might not be apparent from summary
statistics alone.

1 Histogram of Blood 2 Bar Chart of Treatment 3 Scatter Plot of Blood

Pressure Outcomes Glucose vs.
Shows the distribution of Compares the frequency of
Hospitalization
blood pressure readings different treatment outcomes Examines the relationship
across the patient population, (improved, stable, worsened) between blood glucose levels
helping identify common to assess overall and hospitalization rates to
ranges and outliers. effectiveness. identify potential risk factors.
Correlation Analysis in
Healthcare
Correlation analysis identifies relationships between variables in
healthcare data, helping to understand factors that might influence
patient outcomes.

Blood Pressure and Glucose

Correlation: 0.6 - Suggests a moderate positive relationship between
blood pressure and blood glucose levels.

Medication and Outcome

Correlation: 0.4 - Indicates a positive association between
medication use and improved treatment outcomes.

Age and Hospitalization

Correlation: 0.35 - Shows a weak positive relationship between
patient age and likelihood of hospitalization.
Healthcare Analytics Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the healthcare data

data = pd.read_csv('healthcare_data.csv')

# View the first few rows of the data

print(data.head())

# Calculate summary statistics

summary_stats = data.describe()
print(summary_stats)

# Create frequency distributions

medication_freq = data['Medication'].value_counts()
hospitalization_freq = data['Hospitalization'].value_counts()
treatment_outcome_freq = data['Treatment Outcome'].value_counts()

# Calculate correlation coefficients

corr_coef = data.corr()
print(corr_coef)

# Identify relationships between variables

print("Relationship between Blood Pressure and Blood Glucose Level:",
corr_coef['Blood Pressure']['Blood Glucose Level'])
print("Relationship between Medication and Treatment Outcome:",
corr_coef['Medication']['Treatment Outcome'])
Insights and
Recommendations for
Healthcare
Descriptive analytics on healthcare data yields valuable insights that can
inform clinical decision-making and improve patient care:

1 Blood Pressure Management

Patients with higher blood pressure tend to have higher blood
glucose levels, suggesting a need for comprehensive
management of both conditions.

2 Medication Efficacy
Patients who receive medication tend to have better treatment
outcomes, highlighting the importance of medication adherence.

3 Hospitalization Risk
Hospitalization rates are higher among patients with poorer
treatment outcomes, indicating a need for early intervention
strategies.

These insights can guide the development of targeted interventions to

improve patient outcomes and reduce healthcare costs.
Predictive Analytics on
Product Sales Data
Predictive analytics uses statistical and machine learning techniques to
forecast future sales based on historical data patterns. This helps
businesses optimize inventory, marketing strategies, and resource
allocation.

Purpose
To forecast future sales volume, identify sales trends, and
understand factors influencing product performance.

Applications
Demand forecasting, inventory management, pricing optimization,
and targeted marketing campaigns.
Product Sales Dataset
A typical product sales dataset for predictive analytics might include:

Date: The date of each sale

Sales: The number of units sold on each date
Price: The price of the product on each date
Advertising: The amount spent on advertising on each date
Seasonality: Whether the sale occurred during a peak season or not
Promotions: Whether a promotion was active during the sale
Competitor Activity: Information about competitor pricing and promotions

These variables help in understanding factors that influence sales and building accurate predictive models.
Data Preprocessing for Sales Prediction
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset

data = pd.read_csv('sales_data.csv')

# Handle missing values

data.fillna(data.mean(), inplace=True)

# Convert categorical variables into numerical variables

data['Seasonality'] = pd.get_dummies(data['Seasonality'])

# Scale the data

scaler = StandardScaler()
data[['Sales', 'Price', 'Advertising']] = scaler.fit_transform(data[['Sales', 'Price', 'Advertising']])

Data preprocessing is a crucial step in preparing sales data for predictive modeling. This includes handling missing
values, converting categorical variables into numerical format, and scaling the data to ensure all features contribute
equally to the model.
Feature Engineering for
Sales Prediction
# Create a new feature called Lag_Sales
data['Lag_Sales'] = data['Sales'].shift(1)

# Create a moving average feature

data['MA_7days'] = data['Sales'].rolling(window=7).mean()

# Create day of week feature

data['DayOfWeek'] = pd.to_datetime(data['Date']).dt.dayofweek

# Create month feature

data['Month'] = pd.to_datetime(data['Date']).dt.month

# Create holiday flag

data['IsHoliday'] = data['Date'].isin(['2022-12-25', '2022-01-01',
'2022-07-04']).astype(int)

Feature engineering enhances the predictive power of sales forecasting

models by creating new features from existing data. These might include
lag features, moving averages, time-based features, and special event
flags.
Model Selection for Sales Prediction

1 Time Series Models 2 Machine Learning 3 Deep Learning Models

ARIMA, SARIMA, and Prophet
Models LSTM and RNN networks
are specialized for Linear Regression, Random excel at learning sequential
forecasting data with Forest, and Gradient Boosting patterns in sales data over
temporal patterns and can capture complex time.
seasonality. relationships between sales
and multiple features.

The choice of model depends on the characteristics of the sales data, including seasonality, trend, and the
influence of external factors like marketing and promotions.
LSTM Model for Sales Prediction
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Split the data into training and testing sets

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Reshape data for LSTM (samples, time steps, features)

X_train = train_data.reshape((train_data.shape[0], 1, train_data.shape[1]))
X_test = test_data.reshape((test_data.shape[0], 1, test_data.shape[1]))

# Create an LSTM model

model = Sequential()
model.add(LSTM(units=50, return_sequences=True,
input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(LSTM(units=50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# Train the model

model.fit(X_train, y_train, epochs=50, batch_size=32,
validation_data=(X_test, y_test))
Model Evaluation for Sales
Prediction
Mean Squared Error (MSE)
Measures the average of the squares of the errors between actual
and predicted sales values.

Mean Absolute Error (MAE)

Measures the average magnitude of errors without considering their
direction.

Mean Absolute Percentage Error (MAPE)

Expresses accuracy as a percentage of the error, providing relative
performance measurement.

# Evaluate the model

mse = model.evaluate(X_test, y_test)
print(f'MSE: {mse}')

# Make predictions
y_pred = model.predict(X_test)

# Calculate MAE
mae = np.mean(np.abs(y_test - y_pred))
print(f'MAE: {mae}')

# Calculate MAPE
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f'MAPE: {mape}%')
Predictive Analytics for Weather Forecasting
Weather forecasting uses historical climate data, machine learning algorithms, and statistical models to predict
future weather conditions. Accurate forecasts are crucial for agriculture, transportation, emergency management,
and daily planning.

Purpose Applications
To predict future weather conditions based on Agricultural planning, disaster preparedness,
historical patterns and current atmospheric transportation scheduling, and event planning.
measurements.

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Mechanics of Drilling PDF
50% (2)
Mechanics of Drilling PDF
200 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
DA PROGRAM UPTO 6 (1)
No ratings yet
DA PROGRAM UPTO 6 (1)
20 pages
DA_Programs
No ratings yet
DA_Programs
44 pages
DA lab
No ratings yet
DA lab
27 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Machine_Learning_Lab_File (1)
No ratings yet
Machine_Learning_Lab_File (1)
45 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
ML SELF UNIT 2
No ratings yet
ML SELF UNIT 2
20 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
Data Preprocessing for Machine Learning in Python
No ratings yet
Data Preprocessing for Machine Learning in Python
27 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
data-analytics-manual lab g.anill kumar
No ratings yet
data-analytics-manual lab g.anill kumar
23 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Advance Python
No ratings yet
Advance Python
5 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
datascience
No ratings yet
datascience
26 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
ML Practical File
100% (2)
ML Practical File
43 pages
EXP-2
No ratings yet
EXP-2
6 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
EXP-2 ML
No ratings yet
EXP-2 ML
6 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
pp DWDM 4 5
No ratings yet
pp DWDM 4 5
26 pages
ADS EXP Assignments
No ratings yet
ADS EXP Assignments
38 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
8. ML_Lab Manual
No ratings yet
8. ML_Lab Manual
54 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Fiche Econo 2
No ratings yet
Fiche Econo 2
14 pages
dm(2)
No ratings yet
dm(2)
3 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
CS 611 Slides 4
No ratings yet
CS 611 Slides 4
25 pages
UNITIV.BtechIot
No ratings yet
UNITIV.BtechIot
43 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Pre-processing Steps
No ratings yet
Data Pre-processing Steps
32 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Unit 1
No ratings yet
Unit 1
21 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
data science slides
No ratings yet
data science slides
57 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Catálogo de Compresores de Nevera Daewood
No ratings yet
Catálogo de Compresores de Nevera Daewood
30 pages
HP-AN346 - A Guideline For Designing External DC Bias Circuits
No ratings yet
HP-AN346 - A Guideline For Designing External DC Bias Circuits
10 pages
Colour Analyser: Procedure I. Smaller Areas On Larger Samples
No ratings yet
Colour Analyser: Procedure I. Smaller Areas On Larger Samples
4 pages
Mixer_ Mic Input Board (1)
No ratings yet
Mixer_ Mic Input Board (1)
6 pages
Brodifacoum Grain Msds (2)
No ratings yet
Brodifacoum Grain Msds (2)
11 pages
Refer DXF or DWG File For Laser Cutting. 2. Part Inspection Is On Finished Dimension
No ratings yet
Refer DXF or DWG File For Laser Cutting. 2. Part Inspection Is On Finished Dimension
1 page
Kor - Tech - ALUMINUM - RADIATOR - CATALOGUE - Oct 2016
No ratings yet
Kor - Tech - ALUMINUM - RADIATOR - CATALOGUE - Oct 2016
16 pages
The Campbeltown Shipbuilding Company
No ratings yet
The Campbeltown Shipbuilding Company
4 pages
The Souls Scream
No ratings yet
The Souls Scream
812 pages
PIAGGIO Liberty 2t User Manual
No ratings yet
PIAGGIO Liberty 2t User Manual
20 pages
First Law of Thermodynamics For A Control Volume
No ratings yet
First Law of Thermodynamics For A Control Volume
28 pages
Lecture 4 Linear Programming II - Solving Problems Six Slides
No ratings yet
Lecture 4 Linear Programming II - Solving Problems Six Slides
7 pages
Open Space
No ratings yet
Open Space
8 pages
2018 IMAS Second Round UP Solution Eng Final
No ratings yet
2018 IMAS Second Round UP Solution Eng Final
7 pages
ASTM F 2094 - 01 - Silicon Nitride Bearing Balls
No ratings yet
ASTM F 2094 - 01 - Silicon Nitride Bearing Balls
7 pages
Why Current Increases When Capacitance Increases or Capacitive Reactance Decreases?
No ratings yet
Why Current Increases When Capacitance Increases or Capacitive Reactance Decreases?
5 pages
Simmons Wheel Truing Innovation Lmoa Paper
No ratings yet
Simmons Wheel Truing Innovation Lmoa Paper
9 pages
Calculating Weld Volume and Weight: Welding Costs
No ratings yet
Calculating Weld Volume and Weight: Welding Costs
4 pages
Fasteners Screw
No ratings yet
Fasteners Screw
79 pages
Murmann_sample_and_hold
No ratings yet
Murmann_sample_and_hold
9 pages
TLG8SSK4F Rev 1 60hz Parts Manual
No ratings yet
TLG8SSK4F Rev 1 60hz Parts Manual
102 pages
Microchaetus Rappi Lineus Longissimus: Worms
No ratings yet
Microchaetus Rappi Lineus Longissimus: Worms
4 pages
16 Patient and Parent Guide Book On Muscular Dystrophy in Gujarati
No ratings yet
16 Patient and Parent Guide Book On Muscular Dystrophy in Gujarati
240 pages
Á1229Ñ Sterilization of Compendial Articles: Accessed From 10.6.1.1 by mvpstn3kts On Wed Apr 05 03:53:30 EDT 2017
No ratings yet
Á1229Ñ Sterilization of Compendial Articles: Accessed From 10.6.1.1 by mvpstn3kts On Wed Apr 05 03:53:30 EDT 2017
6 pages
Teens On The Move (Test Unit 1) (Teste Sem Correção)
No ratings yet
Teens On The Move (Test Unit 1) (Teste Sem Correção)
3 pages
HP Prime Advanced Workshop Alt V2
No ratings yet
HP Prime Advanced Workshop Alt V2
45 pages
Optimizing Frequency Planning in The GSM System
No ratings yet
Optimizing Frequency Planning in The GSM System
5 pages
Benchmark MyTemp Mini Digital Cooling Incubator Manual - Marshall Scientific
No ratings yet
Benchmark MyTemp Mini Digital Cooling Incubator Manual - Marshall Scientific
1 page
Module 12 13 Reading For Critical Understanding Reading For Study
No ratings yet
Module 12 13 Reading For Critical Understanding Reading For Study
5 pages