0% found this document useful (0 votes)
21 views58 pages

Data Analytics Lab Manual_250402_095326

The document outlines a course on Data Analytics Lab for B.Tech students, detailing objectives, outcomes, and experiments focused on data preprocessing, regression techniques, and visualization methods. It includes practical implementations of Linear Regression, Logistic Regression, Decision Trees, and KNN imputation, along with examples in Python. Additionally, it provides recommended reading materials and emphasizes the importance of data preprocessing techniques like handling missing values and noise detection.

Uploaded by

Bejjanki Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views58 pages

Data Analytics Lab Manual_250402_095326

The document outlines a course on Data Analytics Lab for B.Tech students, detailing objectives, outcomes, and experiments focused on data preprocessing, regression techniques, and visualization methods. It includes practical implementations of Linear Regression, Logistic Regression, Decision Trees, and KNN imputation, along with examples in Python. Additionally, it provides recommended reading materials and emphasizes the importance of data preprocessing techniques like handling missing values and noise detection.

Uploaded by

Bejjanki Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

DATA ANALYTICS LAB

III-B.Tech II-Semester Course Code: A2AM605PC

by Preeti Sahu
Course Objectives

1 Explore Fundamental Concepts 2 Learn Statistical Analysis


To explore the fundamental concepts of data To learn the principles and methods of statistical
analytics. analysis.

3 Discover Patterns 4 Understand Search & Visualization


Discover interesting patterns, analyze supervised To understand the various search methods and
and unsupervised models and estimate the visualization techniques.
accuracy of the algorithms.
Course Outcomes

1 Regression Understanding 2 Classifier Functionality


Understand linear regression and logistic Understand the functionality of different
regression. classifiers.

3 Visualization Implementation 4 Analytics Application


Implement visualization techniques using Apply descriptive and predictive analytics for
different graphs. different types of data.
List of Experiments - Part 1

1 Data Preprocessing 2 Imputation Model


Handling missing values, noise detection Implement any one imputation model.
removal, identifying data redundancy and
elimination.

3 Linear Regression 4 Logistic Regression


Implement Linear Regression. Implement Logistic Regression.

5 Decision Tree Induction 6 Random Forest Classifier


Implement Decision Tree Induction for Implement Random Forest Classifier.
classification.
List of Experiments - Part 2

1 ARIMA Implementation 2 Object Segmentation


Implement ARIMA on Time Series data. Object segmentation using hierarchical based
methods.

3 Visualization Techniques 4 Descriptive Analytics


Perform Visualization techniques (types of maps Perform Descriptive analytics on healthcare data.
- Bar, Column, Line, Scatter, 3D Cubes etc).

5 Predictive Analytics 6 Weather Forecasting


Perform Predictive analytics on Product Sales Apply Predictive analytics for Weather
data. forecasting.
Recommended Reading Materials
Text Books

1. Student's Handbook for Associate Analytics – II, III.


2. Data Mining Concepts and Techniques, Han, Kamber, 3rd Edition, Morgan Kaufmann Publishers.

Reference Books

1. Introduction to Data Mining, Tan, Steinbach and Kumar, Addison Wesley, 2006.
2. Data Mining Analysis and Concepts, M. Zaki and W. Meira.
3. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman Milliway Labs - Jeffrey D Ullman
Stanford Univ.
Data Preprocessing
Overview
Data preprocessing is a critical first step in any data analytics project. It
involves handling missing values, detecting and removing noise, and
identifying and eliminating data redundancy.

The initial inspection of the data helps us to detect whether there are
missing values in the data set. This can be done through Exploratory
Data Analysis (EDA), making it essential for a data scientist to always
perform EDA to identify missing values correctly.
Handling Missing Values
Common Techniques for Missing Value Imputation

1 Mean/Median Imputation
Replace missing values with the mean or median of the respective
feature.

2 Forward/Backward Fill
Replace missing values with the previous or next value in the
sequence.

3 K-Nearest Neighbors (KNN)


Replace missing values using the KNN algorithm to find similar
data points.
Mean/Median Imputation Example
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Replace missing values with mean


df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].mean(), inplace=True)
print(df)

This code creates a sample DataFrame with missing values and replaces them with the mean of each column.
Forward/Backward Fill Example
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Forward fill
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='ffill', inplace=True)
print(df)

This code demonstrates the forward fill method, which replaces missing values with the previous value in the
sequence.
KNN Imputation Example
from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]
})

# Create a KNN imputer


imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data


df_imputed = imputer.fit_transform(df)
print(df_imputed)

The KNN imputation method uses the K-Nearest Neighbors algorithm to find similar data points and impute missing
values based on them.
Noise Detection and
Removal
Methods for Handling Noisy Data

1 Statistical Methods
Use statistical methods such as mean, median, and standard
deviation to detect and remove outliers.

2 Machine Learning Methods


Use machine learning algorithms such as One-Class SVM and
Local Outlier Factor (LOF) to detect anomalies.
Statistical Methods for Noise Detection
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Calculate the mean and standard deviation


mean = df['A'].mean()
std_dev = df['A'].std()

# Remove outliers
df_cleaned = df[(df['A'] >= mean - 2*std_dev) & (df['A'] <= mean + 2*std_dev)]
print(df_cleaned)

This example uses statistical methods to identify and remove outliers by filtering out values that lie beyond 2
standard deviations from the mean.
Machine Learning Methods
for Noise Detection
from sklearn.svm import OneClassSVM
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, 3, 4, 100]
})

# Create a One-Class SVM model


model = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.1)

# Fit the model


model.fit(df[['A']])

# Predict anomalies
anomaly = model.predict(df[['A']])

# Remove anomalies
df_cleaned = df[anomaly == 1]
print(df_cleaned)

Machine learning methods like One-Class SVM can be used to detect


anomalies by learning the pattern of normal data and identifying points
that deviate from this pattern.
Identifying Data Redundancy
Techniques for Redundancy Detection and Elimination

1 Correlation Analysis 2 Principal Component Analysis (PCA)


Use correlation analysis to identify highly Use PCA to reduce the dimensionality of the data
correlated features and eliminate redundant ones. and eliminate redundant features.
Correlation Analysis Example
import pandas as pd
import numpy as np

# Create a sample DataFrame


df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 6, 8, 10],
'C': [3, 6, 9, 12, 15]
})

# Calculate the correlation matrix


corr_matrix = df.corr()

# Identify highly correlated features


high_corr_features = corr_matrix[(corr_matrix > 0.9) & (corr_matrix < 1)].index

# Eliminate redundant features


df_eliminated = df.drop(high_corr_features, axis=1)
print(df_eliminated)

This example calculates the correlation between features and eliminates those that are highly correlated, as they
likely provide redundant information.
KNN Imputation Model Implementation
Let's implement the K-Nearest Neighbors (KNN) imputation model in Python using the scikit-learn library. KNN
imputation works by finding the k most similar data points to the row with missing values and imputing based on
these neighbors.

import pandas as pd
from sklearn.impute import KNNImputer
import numpy as np

# Create a sample dataset with missing values


data = {'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 3, 4, 5, 6],
'C': [7, 8, 9, np.nan, 11]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Create a KNN imputer with k=3


imputer = KNNImputer(n_neighbors=3)

# Fit the imputer to the data and transform the missing values
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame


imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nImputed DataFrame:")
print(imputed_df)
Linear Regression Implementation
Linear Regression is a supervised learning algorithm used for predicting the value of a continuous output variable
based on one or more input features. Let's look at a simple implementation using Python and NumPy.

Purpose Applications
Linear Regression predicts continuous values by Used for sales forecasting, risk assessment,
finding the best-fit line through the data points, housing price prediction, and other scenarios
minimizing the distance between observed values requiring numerical prediction based on existing
and the regression line. data patterns.
Linear Regression Code Example
import numpy as np

class LinearRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None

def fit(self, X, y):


n_samples, n_features = X.shape
# Initialize weights and bias
self.weights = np.zeros(n_features)
self.bias = 0

# Gradient Descent
for _ in range(self.n_iters):
y_predicted = np.dot(X, self.weights) + self.bias

# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)

# Update weights and bias


self.weights -= self.lr * dw
self.bias -= self.lr * db

def predict(self, X):


y_approximated = np.dot(X, self.weights) + self.bias
return y_approximated
Linear Regression Example Usage
# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt

# Generate sample data


X = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 3, 5, 7, 11])

# Create and train model


model = LinearRegression()
model.fit(X, y)

# Make predictions
predicted = model.predict(X)

# Plot data
plt.scatter(X, y, label="Data")
plt.plot(X, predicted, label="Linear Regression", color="red")
plt.legend()
plt.show()

This example demonstrates how to use the LinearRegression class we defined. It generates sample data, creates
and trains the model, makes predictions, and plots the results for visualization.
Logistic Regression Implementation
Logistic Regression is a supervised learning algorithm used for classification problems. It predicts the probability of
an instance belonging to a particular class.

Purpose Applications
Unlike Linear Regression, Logistic Regression is Used for spam detection, disease diagnosis,
used for binary classification problems, predicting customer churn prediction, and other binary
the probability of an outcome using the sigmoid classification scenarios.
function.
Logistic Regression Code Example
import numpy as np

class LogisticRegression:
def __init__(self, learning_rate=0.001, n_iters=1000):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None
self.bias = None

def _sigmoid(self, x):


return 1 / (1 + np.exp(-x))

def fit(self, X, y):


n_samples, n_features = X.shape
# Initialize weights and bias
self.weights = np.zeros(n_features)
self.bias = 0

# Gradient Descent
for _ in range(self.n_iters):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = self._sigmoid(linear_model)

# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
db = (1 / n_samples) * np.sum(y_predicted - y)

# Update weights and bias


self.weights -= self.lr * dw
self.bias -= self.lr * db
Logistic Regression Prediction
def predict(self, X):
linear_model = np.dot(X, self.weights) + self.bias
y_predicted = self._sigmoid(linear_model)
y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
return y_predicted_cls

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt

# Generate sample data


X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 0, 1, 1])

# Create and train model


model = LogisticRegression()
model.fit(X, y)

# Make predictions
predicted = model.predict(X)
print(predicted)

# Plot data
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()

The predict method uses the sigmoid function to convert linear predictions to probabilities, then classifies based on
a threshold of 0.5. The example shows how to use this class with sample data.
Decision Tree Induction for Classification
Decision Tree Induction is a supervised learning algorithm used for classification and regression tasks. It creates a
model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Key Components Advantages


Split points, information gain calculation, tree Easy to understand and interpret, requires little data
structure with nodes and leaves that represent preparation, can handle both numerical and
decisions and outcomes. categorical data.
Decision Tree Helper Functions
import numpy as np

class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None

def _entropy(self, y):


hist = np.bincount(y)
ps = hist / len(y)
return -np.sum([p * np.log2(p) for p in ps if p > 0])

def _gain(self, X_column, X_threshold, y):


parent_entropy = self._entropy(y)
left_indices, right_indices = X_column < X_threshold, X_column >= X_threshold

if len(np.unique(y[left_indices])) == 1:
left_entropy = 0
else:
left_entropy = self._entropy(y[left_indices])

if len(np.unique(y[right_indices])) == 1:
right_entropy = 0
else:
right_entropy = self._entropy(y[right_indices])

n = len(y)
n_left, n_right = np.sum(left_indices), np.sum(right_indices)
child_entropy = (n_left / n) * left_entropy + (n_right / n) * right_entropy

ig = parent_entropy - child_entropy
return ig
Decision Tree Growth and Prediction
def _grow_tree(self, X, y, depth=0):
n_samples, n_features = X.shape
n_labels = len(np.unique(y))

# Stopping criteria
if (self.max_depth is not None and depth >= self.max_depth
or n_labels == 1 or n_samples == 1):
leaf_value = np.argmax(np.bincount(y))
return leaf_value

# Find the best split


best_feat = None
best_thr = None
best_gain = -1

for idx in range(n_features):


X_column = X[:, idx]
thresholds = np.unique(X_column)

for threshold in thresholds:


gain = self._gain(X_column, threshold, y)

if gain > best_gain:


best_gain = gain
best_feat = idx
best_thr = threshold

# Split the data


left_indices = X[:, best_feat] < best_thr
right_indices = X[:, best_feat] >= best_thr

left = self._grow_tree(X[left_indices, :], y[left_indices], depth+1)


right = self._grow_tree(X[right_indices, :], y[right_indices], depth+1)

return {"feature": best_feat, "threshold": best_thr, "left": left, "right": right}


Decision Tree Implementation
def fit(self, X, y):
self.tree = self._grow_tree(X, y)

def predict(self, X):


return [self._predict(inputs) for inputs in X]

def _predict(self, inputs):


node = self.tree

while isinstance(node, dict):


feature = node["feature"]
threshold = node["threshold"]

if inputs[feature] < threshold:


node = node["left"]
else:
node = node["right"]

return node

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data


X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, random_state=42)

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and train model


model = DecisionTree(max_depth=5)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
Random Forest Classifier
Implementation
Random Forest is an ensemble learning method that operates by
constructing multiple decision trees during training and outputting the
class that is the mode of the classes of the individual trees.

Key Features
Bootstrap sampling (bagging) of training data, random feature
selection for each tree, majority voting for final prediction.

Benefits
Reduces overfitting compared to single decision trees, handles high-
dimensional data well, provides feature importance measures.
Random Forest Class Implementation
class RandomForest:
def __init__(self, n_trees=100, max_depth=None, n_feats=None):
self.n_trees = n_trees
self.max_depth = max_depth
self.n_feats = n_feats
self.trees = []

def _bootstrap(self, X, y):


n_samples = X.shape[0]
idxs = np.random.choice(n_samples, n_samples, replace=True)
return X[idxs], y[idxs]

def _feature_sampling(self, X):


n_feats = X.shape[1]
if self.n_feats is None:
self.n_feats = int(np.sqrt(n_feats))

feats = np.random.choice(n_feats, self.n_feats, replace=False)


return X[:, feats]

def fit(self, X, y):


for _ in range(self.n_trees):
tree = DecisionTree(max_depth=self.max_depth)
X_boot, y_boot = self._bootstrap(X, y)
X_boot_feat = self._feature_sampling(X_boot)
tree.fit(X_boot_feat, y_boot)
self.trees.append(tree)
Random Forest Prediction Method
def predict(self, X):
predictions = []
for tree in self.trees:
X_feat = self._feature_sampling(X)
prediction = tree.predict(X_feat)
predictions.append(prediction)

predictions = np.array(predictions).T
predictions = [np.bincount(prediction).argmax() for prediction in predictions]
return predictions

# Example usage
if __name__ == "__main__":
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate sample data


X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
n_redundant=0, random_state=42)

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and train model


model = RandomForest(n_trees=100, max_depth=5)
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
ARIMA for Time Series Data
ARIMA (AutoRegressive Integrated Moving Average) is a popular statistical model used for forecasting time series
data. It combines autoregressive (AR), differencing (I), and moving average (MA) components to model and predict
time series behavior.

Components Applications
Autoregressive (p): Uses past values to predict Stock price forecasting, sales prediction,
future values. Integrated (d): Differencing to make temperature forecasting, and other time-dependent
the time series stationary. Moving Average (q): Uses data analysis.
past forecast errors in a regression model.
ARIMA Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Load the time series data


df = pd.read_csv('data.csv', index_col='Date', parse_dates=['Date'])

# Plot the original time series data


plt.figure(figsize=(10, 6))
plt.plot(df['Value'])
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# Split the data into training and testing sets


train_size = int(len(df) * 0.8)
train_data, test_data = df[0:train_size], df[train_size:len(df)]

# Build the ARIMA model


model = ARIMA(train_data, order=(1,1,1))
model_fit = model.fit()

# Print the summary of the model


print(model_fit.summary())
ARIMA Forecasting and Evaluation
# Plot the residuals
residuals = pd.DataFrame(model_fit.resid)
plt.figure(figsize=(10, 6))
plt.plot(residuals)
plt.title('Residuals')
plt.xlabel('Date')
plt.ylabel('Residual Value')
plt.show()

# Plot the density plot of residuals


plt.figure(figsize=(10, 6))
residuals.plot(kind='kde')
plt.title('Density Plot of Residuals')
plt.xlabel('Residual Value')
plt.ylabel('Density')
plt.show()

# Print the statistics of residuals


print(residuals.describe())

# Forecast the test data


forecast_steps = len(test_data)
forecast, stderr, conf_int = model_fit.forecast(steps=forecast_steps)

# Plot the forecasted data


plt.figure(figsize=(10, 6))
plt.plot(train_data, label='Training Data')
plt.plot(test_data, label='Actual Test Data')
plt.plot([None for i in train_data] + [x for x in forecast],
label='Forecasted Test Data', marker='o')
plt.title('Forecasted Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

# Evaluate the model using Mean Squared Error (MSE)


mse = mean_squared_error(test_data, forecast)
print('Mean Squared Error (MSE):', mse)
Object Segmentation: Hierarchical Methods
Object segmentation using hierarchical based methods involves representing an image as a hierarchical structure,
where each level of the hierarchy represents a different scale or level of detail.

Image Representation
Represent the input image as a hierarchical structure, such as a tree or a graph, where each node represents a
region or a pixel in the image.

Hierarchical Clustering
Apply hierarchical clustering algorithms to group similar regions or pixels together based on their features,
such as color, texture, or intensity.

Region Merging
Merge regions or nodes based on their similarity and the desired level of detail, guided by a merging
criterion.

Object Segmentation
Identify objects by selecting the regions or nodes that correspond to objects of interest, guided by
prior knowledge or learned models.

Refinement
Refine segmentation results with additional processing steps like boundary refinement or region
growing.
Hierarchical Segmentation Methods

1 Hierarchical Clustering 2 Region Growing


Groups similar regions together using hierarchical Starts with small seed regions and grows them by
clustering algorithms like agglomerative or adding similar adjacent regions or pixels.
divisive clustering.

3 Watershed Transform 4 Hierarchical CRFs


Represents the image as a topographic surface Uses hierarchical conditional random fields to
and identifies objects as catchment basins. model relationships between regions at different
scales.
Hierarchical Segmentation Applications
Object Detection Image Segmentation Scene Understanding
Identifying specific objects of Partitioning images into Interpreting the meaning of an
interest within an image, such as meaningful regions or objects for image by identifying and labeling
vehicles, people, or buildings. better understanding and analysis. its constituent parts and their
relationships.

The advantages of hierarchical methods include efficient representation of complex images, multi-scale analysis
capabilities, and robustness to noise. However, these methods can be computationally expensive, difficult to
parameterize correctly, and sensitive to initial conditions.
Visualization Techniques: Bar Chart
A bar chart is a fundamental visualization used to compare values across different categories. It uses rectangular
bars with heights proportional to the values they represent.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis


fig, ax = plt.subplots()

# Create the bar chart


ax.bar(categories, values)

# Set title and labels


ax.set_title('Bar Chart Example')
ax.set_xlabel('Categories')
ax.set_ylabel('Values')

# Show the plot


plt.show()
Visualization Techniques: Column Chart
A column chart is similar to a bar chart, but the bars are oriented horizontally instead of vertically. This is
particularly useful when category labels are long or when comparing across time periods.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis


fig, ax = plt.subplots()

# Create the column chart


ax.barh(categories, values)

# Set title and labels


ax.set_title('Column Chart Example')
ax.set_xlabel('Values')
ax.set_ylabel('Categories')

# Show the plot


plt.show()
Visualization Techniques:
Line Chart
A line chart displays information as a series of data points connected by
straight line segments. It is particularly effective for showing trends over
time or continuous data.

import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create the figure and axis


fig, ax = plt.subplots()

# Create the line chart


ax.plot(categories, values)

# Set title and labels


ax.set_title('Line Chart Example')
ax.set_xlabel('Categories')
ax.set_ylabel('Values')

# Show the plot


plt.show()
Visualization Techniques: Scatter Plot
A scatter plot uses dots to represent values for two different variables. The position of each dot represents the
value for each observation, and patterns in the dot positions can reveal relationships between variables.

import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Create the figure and axis


fig, ax = plt.subplots()

# Create the scatter plot


ax.scatter(x, y)

# Set title and labels


ax.set_title('Scatter Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')

# Show the plot


plt.show()
Visualization Techniques:
3D Plots
A 3D plot allows visualization of data across three dimensions, providing
a more comprehensive view of relationships between three variables.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D

# Data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
z = [10, 20, 30, 40, 50]

# Create the figure and axis


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Create the 3D plot


ax.scatter(x, y, z)

# Set title and labels


ax.set_title('3D Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

# Show the plot


plt.show()
Visualization Techniques: Heatmap
A heatmap represents data values as colors in a two-dimensional matrix. It's effective for visualizing patterns,
correlations, and the distribution of data across two categorical dimensions.

import matplotlib.pyplot as plt


import numpy as np

# Data
data = np.random.rand(10, 10)

# Create the figure and axis


fig, ax = plt.subplots()

# Create the heatmap


ax.imshow(data, cmap='hot', interpolation='nearest')

# Set title and labels


ax.set_title('Heatmap Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')

# Show the plot


plt.show()
Visualization Techniques: 3D Cubes
A 3D cube plot is an advanced visualization that uses three-dimensional bars to represent data. This is particularly
useful for showing the relationship between three variables where one represents the height of the cubes.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Data
x = np.random.rand(10)
y = np.random.rand(10)
z = np.random.rand(10)

# Create the figure and axis


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Create the 3D cube plot


ax.bar3d(x, y, np.zeros(10), 0.1, 0.1, z, color='b')

# Set title and labels


ax.set_title('3D Cube Plot Example')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

# Show the plot


plt.show()
Descriptive Analytics on Healthcare Data
Descriptive analytics focuses on summarizing historical data to understand past patterns and behaviors. In
healthcare, it helps analyze patient information, medical history, and treatment outcomes to gain valuable insights.

Purpose Applications
To describe and summarize patient data, treatment Patient demographics analysis, treatment efficacy
patterns, and outcomes to understand what assessment, resource utilization tracking, and
happened in the past. disease prevalence monitoring.
Healthcare Data Example
Let's consider a dataset containing patient information for diabetes patients, including variables such as:

Patient ID
Age
Gender
Blood pressure
Blood glucose level
Medication (yes/no)
Hospitalization (yes/no)

Treatment outcome (improved, stable, worsened)

This data can be analyzed using descriptive analytics techniques to understand patient characteristics and
treatment outcomes.
Summary Statistics for
Healthcare Data
Variable Mean Median Mode Std Dev

Blood 130.5 128 120 15.2


pressure

Blood 180.2 175 160 30.5


glucose
level

Summary statistics provide a comprehensive view of continuous


variables like blood pressure and blood glucose levels, showing central
tendency and variability measures.

Variable Frequency Percentage

Medication Yes: 75, No: 25 Yes: 75%, No: 25%

Hospitalization Yes: 20, No: 80 Yes: 20%, No: 80%

Treatment outcome Improved: 40, Improved: 40%,


Stable: 30, Stable: 30%,
Worsened: 30 Worsened: 30%
Healthcare Data Visualization
Visualizing healthcare data helps identify patterns and relationships that might not be apparent from summary
statistics alone.

1 Histogram of Blood 2 Bar Chart of Treatment 3 Scatter Plot of Blood


Pressure Outcomes Glucose vs.
Shows the distribution of Compares the frequency of
Hospitalization
blood pressure readings different treatment outcomes Examines the relationship
across the patient population, (improved, stable, worsened) between blood glucose levels
helping identify common to assess overall and hospitalization rates to
ranges and outliers. effectiveness. identify potential risk factors.
Correlation Analysis in
Healthcare
Correlation analysis identifies relationships between variables in
healthcare data, helping to understand factors that might influence
patient outcomes.

Blood Pressure and Glucose


Correlation: 0.6 - Suggests a moderate positive relationship between
blood pressure and blood glucose levels.

Medication and Outcome


Correlation: 0.4 - Indicates a positive association between
medication use and improved treatment outcomes.

Age and Hospitalization


Correlation: 0.35 - Shows a weak positive relationship between
patient age and likelihood of hospitalization.
Healthcare Analytics Code Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the healthcare data


data = pd.read_csv('healthcare_data.csv')

# View the first few rows of the data


print(data.head())

# Calculate summary statistics


summary_stats = data.describe()
print(summary_stats)

# Create frequency distributions


medication_freq = data['Medication'].value_counts()
hospitalization_freq = data['Hospitalization'].value_counts()
treatment_outcome_freq = data['Treatment Outcome'].value_counts()

# Calculate correlation coefficients


corr_coef = data.corr()
print(corr_coef)

# Identify relationships between variables


print("Relationship between Blood Pressure and Blood Glucose Level:",
corr_coef['Blood Pressure']['Blood Glucose Level'])
print("Relationship between Medication and Treatment Outcome:",
corr_coef['Medication']['Treatment Outcome'])
Insights and
Recommendations for
Healthcare
Descriptive analytics on healthcare data yields valuable insights that can
inform clinical decision-making and improve patient care:

1 Blood Pressure Management


Patients with higher blood pressure tend to have higher blood
glucose levels, suggesting a need for comprehensive
management of both conditions.

2 Medication Efficacy
Patients who receive medication tend to have better treatment
outcomes, highlighting the importance of medication adherence.

3 Hospitalization Risk
Hospitalization rates are higher among patients with poorer
treatment outcomes, indicating a need for early intervention
strategies.

These insights can guide the development of targeted interventions to


improve patient outcomes and reduce healthcare costs.
Predictive Analytics on
Product Sales Data
Predictive analytics uses statistical and machine learning techniques to
forecast future sales based on historical data patterns. This helps
businesses optimize inventory, marketing strategies, and resource
allocation.

Purpose
To forecast future sales volume, identify sales trends, and
understand factors influencing product performance.

Applications
Demand forecasting, inventory management, pricing optimization,
and targeted marketing campaigns.
Product Sales Dataset
A typical product sales dataset for predictive analytics might include:

Date: The date of each sale


Sales: The number of units sold on each date
Price: The price of the product on each date
Advertising: The amount spent on advertising on each date
Seasonality: Whether the sale occurred during a peak season or not
Promotions: Whether a promotion was active during the sale
Competitor Activity: Information about competitor pricing and promotions

These variables help in understanding factors that influence sales and building accurate predictive models.
Data Preprocessing for Sales Prediction
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset


data = pd.read_csv('sales_data.csv')

# Handle missing values


data.fillna(data.mean(), inplace=True)

# Convert categorical variables into numerical variables


data['Seasonality'] = pd.get_dummies(data['Seasonality'])

# Scale the data


scaler = StandardScaler()
data[['Sales', 'Price', 'Advertising']] = scaler.fit_transform(data[['Sales', 'Price', 'Advertising']])

Data preprocessing is a crucial step in preparing sales data for predictive modeling. This includes handling missing
values, converting categorical variables into numerical format, and scaling the data to ensure all features contribute
equally to the model.
Feature Engineering for
Sales Prediction
# Create a new feature called Lag_Sales
data['Lag_Sales'] = data['Sales'].shift(1)

# Create a moving average feature


data['MA_7days'] = data['Sales'].rolling(window=7).mean()

# Create day of week feature


data['DayOfWeek'] = pd.to_datetime(data['Date']).dt.dayofweek

# Create month feature


data['Month'] = pd.to_datetime(data['Date']).dt.month

# Create holiday flag


data['IsHoliday'] = data['Date'].isin(['2022-12-25', '2022-01-01',
'2022-07-04']).astype(int)

Feature engineering enhances the predictive power of sales forecasting


models by creating new features from existing data. These might include
lag features, moving averages, time-based features, and special event
flags.
Model Selection for Sales Prediction

1 Time Series Models 2 Machine Learning 3 Deep Learning Models


ARIMA, SARIMA, and Prophet
Models LSTM and RNN networks
are specialized for Linear Regression, Random excel at learning sequential
forecasting data with Forest, and Gradient Boosting patterns in sales data over
temporal patterns and can capture complex time.
seasonality. relationships between sales
and multiple features.

The choice of model depends on the characteristics of the sales data, including seasonality, trend, and the
influence of external factors like marketing and promotions.
LSTM Model for Sales Prediction
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Split the data into training and testing sets


train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Reshape data for LSTM (samples, time steps, features)


X_train = train_data.reshape((train_data.shape[0], 1, train_data.shape[1]))
X_test = test_data.reshape((test_data.shape[0], 1, test_data.shape[1]))

# Create an LSTM model


model = Sequential()
model.add(LSTM(units=50, return_sequences=True,
input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(LSTM(units=50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# Train the model


model.fit(X_train, y_train, epochs=50, batch_size=32,
validation_data=(X_test, y_test))
Model Evaluation for Sales
Prediction
Mean Squared Error (MSE)
Measures the average of the squares of the errors between actual
and predicted sales values.

Mean Absolute Error (MAE)


Measures the average magnitude of errors without considering their
direction.

Mean Absolute Percentage Error (MAPE)


Expresses accuracy as a percentage of the error, providing relative
performance measurement.

# Evaluate the model


mse = model.evaluate(X_test, y_test)
print(f'MSE: {mse}')

# Make predictions
y_pred = model.predict(X_test)

# Calculate MAE
mae = np.mean(np.abs(y_test - y_pred))
print(f'MAE: {mae}')

# Calculate MAPE
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f'MAPE: {mape}%')
Predictive Analytics for Weather Forecasting
Weather forecasting uses historical climate data, machine learning algorithms, and statistical models to predict
future weather conditions. Accurate forecasts are crucial for agriculture, transportation, emergency management,
and daily planning.

Purpose Applications
To predict future weather conditions based on Agricultural planning, disaster preparedness,
historical patterns and current atmospheric transportation scheduling, and event planning.
measurements.

You might also like