0% found this document useful (0 votes)
8 views

ml programs

The document provides a comprehensive overview of Python libraries for statistical analysis and machine learning, including examples of using NumPy, SciPy, Pandas, and Matplotlib. It covers central tendency measures, measures of dispersion, and the implementation of simple linear regression. The document also highlights the key features and common operations of each library, demonstrating their applications in data manipulation, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ml programs

The document provides a comprehensive overview of Python libraries for statistical analysis and machine learning, including examples of using NumPy, SciPy, Pandas, and Matplotlib. It covers central tendency measures, measures of dispersion, and the implementation of simple linear regression. The document also highlights the key features and common operations of each library, demonstrating their applications in data manipulation, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

1.

Write a python program to compute Central Tendency Measures: Mean, Median,


Mode Measure of Dispersion: Variance, Standard Deviation
import numpy as np
import statistics as stats

def compute_central_tendency_and_dispersion(data):
# Central Tendency Measures
mean = np.mean(data)
median = np.median(data)
try:
mode = stats.mode(data)
except stats.StatisticsError:
mode = "No unique mode"

# Dispersion Measures
data_range = np.ptp(data) # Range: max - min
variance = np.var(data)
standard_deviation = np.std(data)

# Printing Results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {data_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {standard_deviation}")

# Example data
data = [12, 15, 12, 10, 18, 20, 25, 12, 15, 18]

# Call the function


compute_central_tendency_and_dispersion(data)
output
Mean: 15.7
Median: 15.0
Mode: 12
Range: 15
Variance: 19.009999999999998
Standard Deviation: 4.360045871318328

Explanation:

 Central Tendency:
o Mean: The average of all values.
o Median: The middle value when the data is sorted.
o Mode: The value that occurs most frequently (if no unique mode exists, it
returns a message).
 Dispersion:
o Range: The difference between the maximum and minimum values.
o Variance: A measure of how spread out the data is.
o Standard Deviation: The square root of the variance, representing the spread
in the same units as the data.

2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy

1. Statistics Library:

The statistics module provides functions to compute mathematical statistics of numeric data.
It is a built-in library that does not require installation.

Common functions:

 mean(): Computes the arithmetic mean (average) of a list of numbers.


 median(): Computes the median (middle value) of a list of numbers.
 mode(): Computes the mode (most frequent value) of a list.
 variance(): Computes the variance (spread of data from the mean).
 stdev(): Computes the standard deviation (how much variation exists from the mean).

import statistics

data = [12, 15, 12, 10, 18, 20, 25, 12, 15, 18]

mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
variance = statistics.variance(data)
stdev = statistics.stdev(data)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {stdev}")

output
Mean: 15.7
Median: 15.0
Mode: 12
Variance: 21.122222222222224
Standard Deviation: 4.595891885393109

2. Math Library:

The math module provides mathematical functions such as trigonometric functions,


logarithmic functions, constants like pi, and many other tools for advanced math operations.
It is also a built-in library.

Common functions:

 math.sqrt(): Computes the square root of a number.


 math.sin(), math.cos(), math.tan(): Trigonometric functions.
 math.log(): Computes the logarithm of a number (natural log by default).
 math.factorial(): Computes the factorial of a number.
 math.pi: The constant π.
 math.e: The constant e (Euler's number).

import math

number = 16
sqrt_val = math.sqrt(number)
factorial_val = math.factorial(5)
log_val = math.log(100, 10)

print(f"Square root of {number}: {sqrt_val}")


print(f"Factorial of 5: {factorial_val}")
print(f"Log base 10 of 100: {log_val}")

output
Square root of 16: 4.0
Factorial of 5: 120
Log base 10 of 100: 2.0

3. NumPy Library:
NumPy is a powerful library for numerical computations in Python. It provides support for
large multi-dimensional arrays and matrices, as well as a collection of high-level
mathematical functions to operate on these arrays.

Key features:

 Arrays: Provides numpy.ndarray for efficient storage and manipulation of data.


 Vectorized operations: Perform operations on entire arrays, without explicit loops.
 Statistical operations: Mean, median, standard deviation, variance, etc.
 Linear algebra: Matrix multiplication, eigenvalues, etc.
 Random number generation: Generating random data with various distributions.

import numpy as np

# Creating a NumPy array


data = np.array([12, 15, 12, 10, 18, 20, 25, 12, 15, 18])

# Basic stats
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)

print(f"NumPy Mean: {mean}")


print(f"NumPy Median: {median}")
print(f"NumPy Standard Deviation: {std_dev}")
print(f"NumPy Variance: {variance}")

output
NumPy Mean: 15.7
NumPy Median: 15.0
NumPy Standard Deviation: 4.360045871318328

NumPy Variance: 19.009999999999998

4. SciPy Library:

SciPy builds on top of NumPy and provides a collection of algorithms and mathematical tools
for scientific computing. It covers optimization, integration, interpolation, eigenvalue
problems, and more.

Key Features:

 Integration: Functions for numerical integration (e.g., scipy.integrate.quad).


 Optimization: Functions for optimization problems (e.g., scipy.optimize.minimize).
 Signal Processing: Tools for Fourier transforms, filtering, etc.
 Statistics: Advanced statistical distributions and hypothesis testing.
 Linear Algebra: More advanced linear algebra functions than NumPy.
import scipy.stats as stats

# Normal Distribution - finding the probability density function (PDF) at a point


x=3
mean = 0
std_dev = 1
pdf_value = stats.norm.pdf(x, loc=mean, scale=std_dev)

print(f"PDF value of Normal Distribution at x={x}: {pdf_value}")

# T-test for comparing two sample means


sample1 = [23, 21, 18, 24, 30]
sample2 = [17, 22, 19, 23, 25]
t_stat, p_value = stats.ttest_ind(sample1, sample2)

print(f"T-statistic: {t_stat}, P-value: {p_value}")

output
PDF value of Normal Distribution at x=3: 0.0044318484119380075
T-statistic: 0.8178608201095307, P-value: 0.4371160707340265

Comparision and Use Cases:

 Statistics: Best for basic statistical operations (mean, median, variance, etc.).
 Math: For basic math operations, constants, and trigonometric functions.
 NumPy: Best for handling large datasets, array manipulation, vectorized operations,
and matrix operations.
 SciPy: Provides higher-level operations for scientific and engineering tasks like
optimization, integration, and advanced statistical tests.

Summary:

 statistics is simple and useful for basic statistical measures (mean, median, variance).
 math is perfect for simple mathematical operations and constants.
 numpy is used for high-performance numerical computing, especially when handling
large datasets or working with arrays.
 scipy is built on top of NumPy and extends its capabilities with advanced tools for
optimization, integration, signal processing, and more.

3. Study of Python Libraries for ML application such as Pandas and Matplotlib

1. Pandas Library

Pandas is a powerful library for data manipulation and analysis. It provides easy-to-use data
structures and data analysis tools, particularly for working with structured data (i.e., data in
the form of tables, such as CSV, Excel files, SQL databases, or JSON).
Key Features:

 DataFrame: A 2D table-like data structure (similar to a spreadsheet or SQL table),


with labeled axes (rows and columns).
 Series: A one-dimensional array-like object that can hold any data type.
 Data manipulation: Functions for filtering, sorting, grouping, merging, and
reshaping datasets.
 Handling missing data: Easy handling of NaN (Not a Number) values.
 Time series support: Efficient handling of date and time data.

Common Operations in Pandas:

 Loading Data: Pandas supports reading data from various formats, including CSV,
Excel, SQL, and JSON.
 Data Selection and Indexing: You can select rows, columns, and perform slicing.
 Data Aggregation: Group by operations to aggregate data based on categories.
 Data Cleaning: Handling missing values, dropping duplicates, etc.
 Data Transformation: Operations like normalization, encoding, and scaling.

import pandas as pd
# Create a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [23, 35, 40, 28, 22],
'Salary': [70000, 80000, 120000, 90000, 95000]
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Basic Statistics
print("\nBasic Statistics:")
print(df.describe()) # Get summary statistics
# Filtering Data
print("\nFiltered Data (Age > 30):")
filtered_df = df[df['Age'] > 30]
print(filtered_df)
# Sorting Data
print("\nSorted Data by Salary:")
sorted_df = df.sort_values(by='Salary', ascending=False)
print(sorted_df)
# Handling Missing Data (if present)
df_with_nan = df.copy()
df_with_nan['Age'][2] = None # Introduce a NaN value
print("\nData with Missing Values:")
print(df_with_nan)
df_cleaned = df_with_nan.dropna() # Remove rows with NaN values
print("\nCleaned Data:")
print(df_cleaned)

output
3 David 28 90000
4 Eva 22 95000

Basic Statistics:
Age Salary
count 5.000000 5.000000
mean 29.600000 91000.000000
std 7.765307 18841.443681
min 22.000000 70000.000000
25% 23.000000 80000.000000
50% 28.000000 90000.000000
75% 35.000000 95000.000000
max 40.000000 120000.000000

Filtered Data (Age > 30):


Name Age Salary
1 Bob 35 80000
2 Charlie 40 120000

Sorted Data by Salary:


Name Age Salary
2 Charlie 40 120000
4 Eva 22 95000
3 David 28 90000
1 Bob 35 80000
0 Alice 23 70000
Data with Missing Values:
Name Age Salary
0 Alice 23.0 70000
1 Bob 35.0 80000
2 Charlie NaN 120000
3 David 28.0 90000
4 Eva 22.0 95000

Cleaned Data:
Name Age Salary
0 Alice 23.0 70000
1 Bob 35.0 80000
3 David 28.0 90000
4 Eva 22.0 95000

2. Matplotlib Library

Matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python. It is widely used for plotting data and providing insights visually,
which is critical for analyzing ML models and datasets.

Key Features:

 2D plotting: Create line plots, bar plots, histograms, scatter plots, and more.
 Customizable plots: Customize titles, labels, legends, and axes.
 Subplots: Ability to create multiple plots in one figure.
 Animation: Create animated visualizations.
 Interactive plots: Supports interactive visualizations in Jupyter Notebooks.

Common Plot Types:

 Line Plot: For showing trends over time or continuous data.


 Bar Plot: For comparing categories or discrete data.
 Scatter Plot: For showing relationships between two continuous variables.
 Histogram: For showing the distribution of data.
 Box Plot: For showing the spread and outliers in the data.

import matplotlib.pyplot as plt


# Line Plot
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

# Bar Plot
categories = ['A', 'B', 'C', 'D']
values = [7, 15, 23, 8]
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

# Scatter Plot
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

# Histogram
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5]
plt.hist(data, bins=5)
plt.title('Histogram')
plt.xlabel('Bins')
plt.ylabel('Frequency')
plt.show()

# Box Plot
data = [ [1, 2, 3, 4, 5], [3, 4, 5, 6, 7], [7, 8, 9, 10, 11] ]
plt.boxplot(data)
plt.title('Box Plot')
plt.show()

output
4 .Write a Python program to implement Simple Linear Regression

Simple Linear Regression is a fundamental algorithm in machine learning that is used to


model the relationship between two variables. In this example, we will use Python to
implement Simple Linear Regression using Pandas for data manipulation, NumPy for
mathematical operations, and Matplotlib for visualization.

Here's the step-by-step implementation:

1. Import necessary libraries:

We'll need numpy for mathematical operations, matplotlib for visualization, and pandas for
data manipulation.

2. Create or Load Data:

We will create a simple dataset where we have an independent variable (X) and a dependent
variable (Y).

3. Calculate the parameters (slope and intercept) of the line using the Simple Linear
Regression formula:

 Slope (m) = n(∑xy)−(∑x)(∑y)n(∑x2)−(∑x)2\frac{n(\sum{xy}) - (\sum{x})(\


sum{y})}{n(\sum{x^2}) - (\sum{x})^2}n(∑x2)−(∑x)2n(∑xy)−(∑x)(∑y)
 Intercept (b) = (∑y)−m(∑x)n\frac{(\sum{y}) - m(\sum{x})}{n}n(∑y)−m(∑x)

4. Make Predictions:

Once we have the slope and intercept, we can make predictions using the formula:

 Y=mX+bY = mX + bY=mX+b

5. Visualize:

Finally, we will plot the data and the regression line using matplotlib.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Create a simple dataset


# Let's assume the data represents years of experience (X) vs salary (Y)
data = {
'Years_of_Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [40000, 42000, 44000, 46000, 48000, 50000, 52000, 54000, 56000, 58000]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 2: Extract the independent and dependent variables


X = df['Years_of_Experience'].values # Independent variable (Years of Experience)
Y = df['Salary'].values # Dependent variable (Salary)

# Step 3: Calculate the slope (m) and intercept (b)


# We use the formulas for Simple Linear Regression
n = len(X)
sum_x = np.sum(X)
sum_y = np.sum(Y)
sum_xy = np.sum(X * Y)
sum_x_squared = np.sum(X ** 2)

# Calculating slope (m) and intercept (b)


m = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2)
b = (sum_y - m * sum_x) / n

print(f"Slope (m): {m}")


print(f"Intercept (b): {b}")

# Step 4: Make predictions


# Using the equation Y = mX + b
Y_pred = m * X + b

# Step 5: Visualize the results


plt.scatter(X, Y, color='blue', label='Actual Data')
plt.plot(X, Y_pred, color='red', label='Regression Line')
plt.title('Simple Linear Regression: Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()

# Step 6: (Optional) Evaluate the model using R-squared value (goodness of fit)
# R-squared value represents the proportion of the variance in Y that is explained by X.
ss_total = np.sum((Y - np.mean(Y)) ** 2) # Total sum of squares
ss_residual = np.sum((Y - Y_pred) ** 2) # Residual sum of squares
r_squared = 1 - (ss_residual / ss_total)

print(f"R-squared value: {r_squared}")

output
Slope (m): 2000.0
Intercept (b): 38000.0
Explanation of the Code:

1. Dataset:
o We created a simple dataset with Years_of_Experience and corresponding
Salary.

2. Formula for Simple Linear Regression:


o We calculated the slope (m) and intercept (b) using the standard formulas for
Simple Linear Regression.

3. Prediction:
o Using the formula Y=mX+bY = mX + bY=mX+b, we predicted Y values
(Salaries) for each X value (Years of Experience).

4. Visualization:
o We plotted both the actual data points (in blue) and the regression line (in red)
using matplotlib.

5. R-squared:
o An optional evaluation metric, the R-squared value, is calculated to measure
how well the regression line fits the data. An R-squared value close to 1
indicates a good fit.

5.Implementation of Multiple Linear Regression for House Price Prediction using


sklearn

Implementing Multiple Linear Regression for house price prediction using scikit-learn is a
common use case in machine learning. Multiple Linear Regression is an extension of Simple
Linear Regression that uses more than one independent variable (feature) to predict the
dependent variable (target).

Steps for the Implementation:

1. Load Data: We will use a dataset that contains multiple features (e.g., size of the
house, number of bedrooms, location, etc.) to predict the house price.
2. Preprocess the Data: This step involves handling missing data, encoding categorical
variables, and scaling the features if necessary.
3. Split Data: Divide the data into training and testing sets.
4. Train the Model: Fit the model using the training data.
5. Evaluate the Model: Evaluate the performance of the model on the test data.
6. Predict: Use the model to predict house prices for new data.

# Step 1: Import necessary libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, mean_absolute_error

import matplotlib.pyplot as plt

# Step 2: Create a larger sample dataset (with more data points for meaningful results)

data = {

'Square_Feet': [1500, 1800, 2400, 3000, 3500, 4000, 4500, 5000, 5500, 6000],

'Bedrooms': [3, 3, 4, 4, 5, 5, 6, 6, 7, 7],

'Age_of_House': [10, 15, 20, 5, 8, 12, 18, 3, 4, 9],

'Price': [400000, 450000, 500000, 600000, 650000, 700000, 750000, 800000,


850000, 900000]

# Create a DataFrame from the data

df = pd.DataFrame(data)

# Step 3: Define independent variables (X) and dependent variable (y)

X = df[['Square_Feet', 'Bedrooms', 'Age_of_House']] # Independent variables

y = df['Price'] # Dependent variable (Price)

# Step 4: Split the data into training and testing sets (80% for training, 20% for
testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Step 5: Train the Multiple Linear Regression model

model = LinearRegression()

model.fit(X_train, y_train)

# Step 6: Make predictions on the test set

y_pred = model.predict(X_test)

# Step 7: Evaluate the model performance

mse = mean_squared_error(y_test, y_pred) # Mean Squared Error


mae = mean_absolute_error(y_test, y_pred) # Mean Absolute Error

print("Mean Squared Error:", mse)

print("Mean Absolute Error:", mae)

# Step 8: Visualize the predictions (Optional: For simple datasets like this, we can
visualize actual vs predicted prices)

plt.scatter(y_test, y_pred)

plt.xlabel('Actual Prices')

plt.ylabel('Predicted Prices')

plt.title('Actual vs Predicted House Prices')

plt.show()

# Step 9: Show the model coefficients and intercept

print(f"Intercept (b): {model.intercept_}")

print(f"Coefficients (m): {model.coef_}")

# Step 10: Use the model to predict a new data point (optional)

new_data = np.array([[2000, 3, 10]]) # Example: Square_Feet=2000, Bedrooms=3,


Age_of_House=10

predicted_price = model.predict(new_data)

print(f"Predicted price for the new house: ${predicted_price[0]:,.2f}")

output

Mean Squared Error: 33740072.40275199

Mean Absolute Error: 5057.868643881695


Explanation of the Code:

1. Loading Data:
o In this example, we manually created a small dataset with Square_Feet,
Bedrooms, Age_of_House, and Price (which is the target variable).
o Normally, you would load a dataset using pd.read_csv('your_dataset.csv').

2. Defining Features and Target:


o X (features) contains the independent variables: Square_Feet, Bedrooms, and
Age_of_House.
o y (target) contains the dependent variable, which is the Price.

3. Splitting the Data:


o The dataset is split into training (80%) and testing (20%) sets using
train_test_split() from sklearn.model_selection.

4. Training the Model:


o We use LinearRegression() from sklearn.linear_model to create the model and
fit it using X_train and y_train.

5. Making Predictions:
o After training the model, we use it to predict the house prices on the testing set
X_test.

6. Evaluating the Model:


o We use Mean Squared Error (MSE) and R-squared as metrics for
evaluation:
 MSE gives us the average squared difference between the actual and
predicted values (lower values indicate a better fit).
 R-squared gives us the proportion of variance explained by the model
(values closer to 1 indicate a better fit).

7. Visualizing Predictions:
o We plot the actual vs predicted house prices using a scatter plot to visually
inspect the model's performance.
8. Model Coefficients:
o We print the model's intercept (b) and coefficients (m), which tell us how
much each feature contributes to the prediction.

9. Predicting for New Data:


o We use the trained model to predict the price of a new house with 2000 square
feet, 3 bedrooms, and 10 years old.

6. Implementation of Decision tree using sklearn and its parameter tuning

Implementation of Decision Tree using Scikit-learn and Parameter Tuning

A Decision Tree is a popular supervised machine learning algorithm used for both regression
and classification tasks. It splits the data into subsets based on feature values and builds a
tree-like structure to predict the target variable.

We'll implement a Decision Tree Classifier for classification or a Decision Tree Regressor
for regression, and we'll also explore how to tune the hyperparameters to improve the model's
performance using Grid Search.

For this example, I'll implement a Decision Tree Classifier and perform parameter tuning
using GridSearchCV.

Steps Involved:

1. Import Libraries: Import necessary libraries.


2. Load Dataset: We'll use a popular dataset, such as the Iris dataset for classification.
3. Preprocess Data: Prepare the data, split it into training and testing sets.
4. Train the Model: Create and fit the Decision Tree model.
5. Evaluate the Model: Use accuracy or other metrics to evaluate the model.
6. Hyperparameter Tuning: Perform hyperparameter tuning using GridSearchCV to
find the best parameters.

# Step 1: Import necessary libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, classification_report

from sklearn.datasets import load_iris

from sklearn.model_selection import GridSearchCV


# Step 2: Load a dataset (Iris dataset)

# For simplicity, we're using a well-known Iris dataset that comes with sklearn.

data = load_iris()

X = data.data # Features

y = data.target # Target labels

# Step 3: Split the data into training and testing sets (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Initialize and train the Decision Tree Classifier

model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)

# Step 5: Evaluate the initial model

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Initial Accuracy: {accuracy * 100:.2f}%")

print("Classification Report:\n", classification_report(y_test, y_pred))

# Step 6: Hyperparameter Tuning using GridSearchCV

# Define the parameter grid

param_grid = {

'criterion': ['gini', 'entropy'], # Criteria for splitting

'max_depth': [None, 10, 20, 30], # Max depth of the tree

'min_samples_split': [2, 5, 10], # Minimum samples to split

'min_samples_leaf': [1, 2, 4], # Minimum samples for a leaf node

'max_features': [None, 'sqrt', 'log2'] # Number of features to consider when splitting

# Step 7: Create the GridSearchCV object to search for the best hyperparameters
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1,
verbose=1)

# Step 8: Fit GridSearchCV to the training data

grid_search.fit(X_train, y_train)

# Step 9: Get the best parameters and best score from GridSearchCV

print(f"Best Parameters: {grid_search.best_params_}")

print(f"Best Cross-validation Score: {grid_search.best_score_:.4f}")

# Step 10: Evaluate the tuned model

best_model = grid_search.best_estimator_

y_pred_tuned = best_model.predict(X_test)

tuned_accuracy = accuracy_score(y_test, y_pred_tuned)

print(f"Tuned Model Accuracy: {tuned_accuracy * 100:.2f}%")

print("Tuned Classification Report:\n", classification_report(y_test, y_pred_tuned))

output

Initial Accuracy: 96.67%

Classification Report:

precision recall f1-score support

0 1.00 1.00 1.00 6

1 1.00 0.80 0.89 5

2 0.89 1.00 0.94 4

accuracy 0.97 15

macro avg 0.96 0.93 0.94 15

weighted avg 0.97 0.97 0.96 15


Fitting 5 folds for each of 72 candidates, totalling 360 fits

Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': None,


'min_samples_leaf': 1, 'min_samples_split': 10}

Best Cross-validation Score: 0.9667

Tuned Model Accuracy: 100.00%

Tuned Classification Report:

precision recall f1-score support

0 1.00 1.00 1.00 6

1 1.00 1.00 1.00 5

2 1.00 1.00 1.00 4

accuracy 1.00 15

macro avg 1.00 1.00 1.00 15

weighted avg 1.00 1.00 1.00 15

Explanation of the Code:

1. Import Libraries:
o We import necessary libraries like pandas, numpy, train_test_split for splitting
the dataset, DecisionTreeClassifier for the model, GridSearchCV for
hyperparameter tuning, and classification_report for performance evaluation.

2. Load Dataset:
o We use the Iris dataset, which is a simple classification dataset that contains
three classes of flowers, each with four features (sepal length, sepal width,
petal length, and petal width).

3. Split Data:
o The dataset is split into training (80%) and testing (20%) sets using
train_test_split.

4. Train the Model:


o A DecisionTreeClassifier is created, trained on the training data (X_train,
y_train), and then used to predict the test data (X_test).
5. Evaluate the Model:
o The initial accuracy of the model is calculated using accuracy_score and we
print the classification report, which includes precision, recall, and F1-score.

6. Hyperparameter Tuning:
o We use GridSearchCV to search for the best combination of hyperparameters.
The parameters to tune include:
 criterion: The function to measure the quality of a split, either 'gini' or
'entropy'.
 max_depth: The maximum depth of the tree.
 min_samples_split: The minimum number of samples required to split
an internal node.
 min_samples_leaf: The minimum number of samples required to be at
a leaf node.
 max_features: The number of features to consider when looking for the
best split.
o We use 5-fold cross-validation to evaluate each combination of parameters
and parallelize the grid search with n_jobs=-1.

7. Get Best Parameters and Evaluate the Tuned Model:


o After the grid search completes, we retrieve the best combination of
parameters using grid_search.best_params_.
o We then use the best estimator (model with the optimal parameters) to predict
the test set and evaluate the model's performance again.

7.Implementation of KNN using sklearn

Implementation of K-Nearest Neighbors (KNN) using Scikit-learn

K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both
classification and regression. For classification, it predicts the class of a sample based on the
majority class of its K-nearest neighbors in the training dataset.

Here, I will demonstrate how to implement KNN for classification using Scikit-learn. We
will:

1. Load a dataset.
2. Preprocess the data (split into training and testing).
3. Train the KNN model.
4. Evaluate the model using accuracy and other classification metrics.
5. Optionally, perform hyperparameter tuning to find the optimal value of K (the
number of neighbors).

Steps to Implement KNN:

1. Import Libraries: Import necessary libraries for loading the dataset, model training,
and evaluation.
2. Load Dataset: We'll use the popular Iris dataset for this example.
3. Preprocess Data: Split the data into training and testing sets.
4. Train the Model: Create and train a KNN classifier using Scikit-learn.
5. Evaluate the Model: Use accuracy and other classification metrics to evaluate the
model.
6. Hyperparameter Tuning: Perform hyperparameter tuning to find the best number of
neighbors (K).

# Step 1: Import necessary libraries


import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

# Step 2: Load the Wine dataset


wine = load_wine()
X = wine.data # Features
y = wine.target # Target variable (wine classes)

# Step 3: Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Initialize and train the KNN classifier


knn = KNeighborsClassifier(n_neighbors=5) # Set K=5 for this example
knn.fit(X_train, y_train)

# Step 5: Make predictions on the test set


y_pred = knn.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print the classification report


print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 7: Hyperparameter Tuning to find the best value for K


# Using GridSearchCV to tune the value of K (number of neighbors)
param_grid = {'n_neighbors': np.arange(1, 21)} # Searching for values of K from 1 to 20

# Initialize GridSearchCV with KNN and the parameter grid


grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, n_jobs=-1)

# Fit the grid search on the training data


grid_search.fit(X_train, y_train)
# Get the best parameters and score from GridSearchCV
print(f"\nBest value for K: {grid_search.best_params_['n_neighbors']}")
print(f"Best Cross-validation Score: {grid_search.best_score_:.4f}")

# Step 8: Re-train the model with the best K


best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test)

# Step 9: Evaluate the tuned model


tuned_accuracy = accuracy_score(y_test, y_pred_best)
print(f"\nTuned Model Accuracy: {tuned_accuracy * 100:.2f}%")
print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_best))

# Step 10: Optionally, plot the accuracy with different values of K


k_values = np.arange(1, 21)
accuracy_scores = []

for k in k_values:
knn_temp = KNeighborsClassifier(n_neighbors=k)
knn_temp.fit(X_train, y_train)
y_pred_temp = knn_temp.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, y_pred_temp))

# Plot the accuracy vs K values


plt.plot(k_values, accuracy_scores, marker='o')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs K (Number of Neighbors)')
plt.show()

output
Accuracy: 72.22%

Classification Report:
precision recall f1-score support

0 0.86 0.86 0.86 14


1 0.79 0.79 0.79 14
2 0.38 0.38 0.38 8

accuracy 0.72 36
macro avg 0.67 0.67 0.67 36
weighted avg 0.72 0.72 0.72 36

Best value for K: 17


Best Cross-validation Score: 0.7254
Tuned Model Accuracy: 77.78%

Tuned Classification Report:


precision recall f1-score support

0 0.93 1.00 0.97 14


1 0.82 0.64 0.72 14
2 0.50 0.62 0.56 8

accuracy 0.78 36
macro avg 0.75 0.76 0.75 36
weighted avg 0.79 0.78 0.78 36

8 .Implementation of Logistic Regression using sklearn

Implementation of Logistic Regression using Scikit-learn

Logistic Regression is a simple and widely used classification algorithm. Despite its name, it
is a classification algorithm, not a regression algorithm. It is used to predict the probability of
a categorical dependent variable based on one or more independent variables.

Here, we'll implement a Logistic Regression classifier using Scikit-learn on a popular dataset
called the Iris dataset (for simplicity), which is a multiclass classification problem.

Steps to Implement Logistic Regression:

1. Import Libraries: We will import necessary libraries.


2. Load Dataset: We will use the Iris dataset.
3. Preprocess Data: Split the dataset into training and testing sets.
4. Train Logistic Regression Model: We will use Scikit-learn's LogisticRegression
class to train the model.
5. Evaluate the Model: We will evaluate the model using accuracy and classification
report.
6. Hyperparameter Tuning: Optionally, we can tune the hyperparameters to improve
the model.

# Step 1: Import necessary libraries

import numpy as np

import pandas as pd

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report

from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt

# Step 2: Load the Iris dataset

iris = load_iris()

X = iris.data # Features (sepal length, sepal width, petal length, petal width)

y = iris.target # Target variable (species of Iris)

# Step 3: Split the data into training and testing sets (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Step 4: Initialize and train the Logistic Regression model

logreg = LogisticRegression(max_iter=200) # Set max_iter=200 to ensure


convergence

logreg.fit(X_train, y_train)

# Step 5: Make predictions on the test set

y_pred = logreg.predict(X_test)

# Step 6: Evaluate the model

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy * 100:.2f}%")

# Print the classification report

print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 7: Hyperparameter Tuning using GridSearchCV

# Tuning the solver and regularization parameter (C)

param_grid = {'solver': ['liblinear', 'lbfgs', 'newton-cg'],

'C': np.logspace(-4, 4, 20)} # Trying different values for C (regularization


strength)

# Initialize GridSearchCV

grid_search = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5,


n_jobs=-1)

# Fit the grid search on the training data

grid_search.fit(X_train, y_train)

# Get the best parameters and score from GridSearchCV

print(f"\nBest parameters from GridSearchCV: {grid_search.best_params_}")

print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Step 8: Re-train the model with the best parameters

best_logreg = grid_search.best_estimator_

y_pred_best = best_logreg.predict(X_test)

# Step 9: Evaluate the tuned model

tuned_accuracy = accuracy_score(y_test, y_pred_best)

print(f"\nTuned Model Accuracy: {tuned_accuracy * 100:.2f}%")

print("\nTuned Classification Report:\n", classification_report(y_test, y_pred_best))

# Step 10: Plot the accuracy with different regularization strength values (C)

C_values = np.logspace(-4, 4, 20)


accuracy_scores = []

for C in C_values:

logreg_temp = LogisticRegression(C=C, max_iter=200)

logreg_temp.fit(X_train, y_train)

y_pred_temp = logreg_temp.predict(X_test)

accuracy_scores.append(accuracy_score(y_test, y_pred_temp))

# Plot the accuracy vs C values

plt.plot(C_values, accuracy_scores, marker='o')

plt.xscale('log')

plt.xlabel('Regularization strength (C)')

plt.ylabel('Accuracy')

plt.title('Accuracy vs Regularization Strength (C)')

plt.show()

output

Accuracy: 94.44%

Classification Report:

precision recall f1-score support

0 1.00 1.00 1.00 9

1 0.91 0.85 0.88 13

2 0.88 1.00 0.94 8

accuracy 0.94 30

macro avg 0.93 0.95 0.94 30


weighted avg 0.94 0.94 0.94 30

Best parameters from GridSearchCV: {'C': 1.0, 'solver': 'lbfgs'}

Best cross-validation score: 0.9667

Tuned Model Accuracy: 96.67%

Tuned Classification Report:

precision recall f1-score support

0 1.00 1.00 1.00 9

1 0.94 0.92 0.93 13

2 0.93 1.00 0.96 8

accuracy 0.97 30

macro avg 0.96 0.97 0.96 30

weighted avg 0.97 0.97 0.97 30


Explanation of the Code:

1. Import Libraries:
o We use libraries like numpy, pandas, and matplotlib for data manipulation and
visualization.
o We import Scikit-learn modules to load the dataset, split the data, create the
Logistic Regression model, and evaluate the performance.

2. Load Dataset:
o We use the Iris dataset (a multiclass classification dataset) available directly
from Scikit-learn using load_iris().
o X contains the features (sepal length, sepal width, petal length, petal width),
and y contains the target variable (species).

3. Train-Test Split:
o We split the dataset into training and testing sets using train_test_split() (80%
for training, 20% for testing).

4. Train Logistic Regression:


o We initialize the LogisticRegression model and set max_iter=200 to ensure
the model converges (sometimes, the default iteration count may not be
enough).
o We train the model using the fit() method.

5. Model Evaluation:
o After training, we use predict() to make predictions on the test set (X_test).
o We calculate the accuracy using accuracy_score() and print the classification
report to check the precision, recall, and F1-score for each class.

6. Hyperparameter Tuning:
o We perform hyperparameter tuning using GridSearchCV to find the best
hyperparameters for the model.
o We tune the solver and C (regularization strength) parameters.
 solver: Algorithm to use for optimization (e.g., liblinear, lbfgs,
newton-cg).
 C: Regularization strength (inverse of regularization strength, where
smaller values mean more regularization).
o We use cross-validation (cv=5) to evaluate the model and find the best
parameters.

7. Re-train with Best Parameters:


o After identifying the best parameters, we re-train the model with the best
estimator and evaluate it on the test set.

8. Plot Regularization Strength vs Accuracy:


o Finally, we plot the accuracy against different values of C (regularization
strength) to see how regularization affects model performance.
9.Implementation of K-Means Clustering

K-Means Clustering is a popular unsupervised machine learning algorithm used to partition a


dataset into K distinct clusters based on similarity. It works by minimizing the sum of
squared distances between data points and their assigned cluster centers (centroids).

Here's an implementation of K-Means Clustering using Python and the Scikit-Learn


library.

Steps for Implementing K-Means:

1. Load a Dataset: We'll use the Iris dataset, a commonly used dataset for clustering
problems.
2. Preprocess Data: Ensure the data is in the right format.
3. Apply K-Means Clustering: Use Scikit-Learn's KMeans class.
4. Evaluate the Results: Visualize the clustering and check the accuracy of the model
(though accuracy is not a direct measure for unsupervised learning).

# Step 1: Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Step 2: Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Actual labels (for reference)

# Step 3: Standardize the data (optional but recommended for K-Means)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 4: Apply K-Means Clustering


kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)

# Step 5: Get the predicted cluster labels


y_pred = kmeans.predict(X_scaled)

# Step 6: Visualize the clustering result


plt.figure(figsize=(8, 6))

# Plot first and second feature


plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X',
s=200, label='Centroids')

plt.title("K-Means Clustering on Iris Dataset")


plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# Step 7: Evaluate clustering performance using Silhouette Score


silhouette_avg = silhouette_score(X_scaled, y_pred)
print(f"Silhouette Score: {silhouette_avg:.4f}")

# Step 8: Print the cluster centers


print("Cluster Centers (Centroids):")
print(kmeans.cluster_centers_)

output
Silhouette Score: 0.4799
Cluster Centers (Centroids):
[[ 0.57100359 -0.37176778 0.69111943 0.66315198]
[-0.81623084 1.31895771 -1.28683379 -1.2197118 ]
[-1.32765367 -0.373138 -1.13723572 -1.11486192]]

Explanation of Code:

1. Data Loading:
o We load the Iris dataset from Scikit-Learn. This dataset consists of 4 features
(sepal length, sepal width, petal length, and petal width) and the target labels
(setosa, versicolor, and virginica).

2. Data Standardization:
o We standardize the data using StandardScaler to ensure that each feature
contributes equally to the distance metric. This is important because K-Means
uses distance to assign points to clusters.

3. Applying K-Means:
o We set n_clusters=3 because there are three species in the Iris dataset, and we
want to find 3 clusters.
o KMeans(n_clusters=3) is the key function used to perform K-Means
clustering.
o We fit the model on the scaled data using .fit() and predict the cluster labels
using .predict().

4. Visualization:
o We use a scatter plot to visualize how the data points are clustered. We use the
first two features for simplicity. Points are colored based on the predicted
cluster labels, and the centroids of the clusters are marked with red X symbols.

5. Evaluation:
o Silhouette Score: This metric measures how similar points within a cluster are
compared to points in other clusters. A higher silhouette score indicates well-
defined clusters.
o We print the cluster centroids, which are the coordinates of the cluster
centers that the K-Means algorithm has identified.

10.Performance analysis of Classification Algorithms on a specific dataset (Mini


Project)

Mini Project: Performance Analysis of Classification Algorithms on a Specific Dataset

In this mini-project, we will perform a performance analysis of several classification


algorithms on a specific dataset. We will evaluate these algorithms based on various metrics
like accuracy, precision, recall, F1-score, and confusion matrix. Additionally, we'll
perform hyperparameter tuning to improve model performance.

We'll be using the Iris dataset, which is a widely used dataset in machine learning. It
contains 150 samples of iris flowers, with 4 features: sepal length, sepal width, petal length,
and petal width, and a target variable (species) with 3 possible classes: setosa, versicolor,
and virginica.

Classification Algorithms to be Compared:

1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machine (SVM)
4. Decision Tree Classifier
5. Random Forest Classifier

We'll evaluate these classifiers based on cross-validation and classification performance


metrics.

Steps:

1. Load and Preprocess Data


2. Train Multiple Classification Models
3. Evaluate Performance:
o Accuracy
o Precision, Recall, and F1-Score
o Confusion Matrix
4. Hyperparameter Tuning (Optional)
5. Performance Comparison

# Step 1: Import Necessary Libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

# Step 2: Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Step 3: Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Standardize the features (important for models like Logistic Regression, KNN, and
SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Initialize models


models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'K-Nearest Neighbors (KNN)': KNeighborsClassifier(),
'Support Vector Machine (SVM)': SVC(),
'Decision Tree Classifier': DecisionTreeClassifier(),
'Random Forest Classifier': RandomForestClassifier()
}

# Step 6: Train and evaluate each model


results = {}
for model_name, model in models.items():
print(f"\nTraining {model_name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

# Performance Metrics
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

# Store results
results[model_name] = {
'accuracy': accuracy,
'classification_report': classification_rep,
'confusion_matrix': confusion_mat
}

# Print Results
print(f"\n{model_name} - Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", confusion_mat)

# Step 7: Hyperparameter Tuning (optional, for Random Forest as an example)


print("\nTuning Random Forest Classifier...")
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters for Random Forest:", grid_search.best_params_)

# Step 8: Compare results visually


model_names = list(results.keys())
accuracies = [results[model]['accuracy'] for model in model_names]

plt.figure(figsize=(10, 6))
plt.barh(model_names, accuracies, color='skyblue')
plt.xlabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.show()

# Optional: Display confusion matrices for all models (if required)


for model_name, result in results.items():
print(f"\nConfusion Matrix for {model_name}:")
print(result['confusion_matrix'])
output
Training Logistic Regression...

Logistic Regression - Accuracy: 100.00%


Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Training K-Nearest Neighbors (KNN)...

K-Nearest Neighbors (KNN) - Accuracy: 100.00%


Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Training Support Vector Machine (SVM)...

Support Vector Machine (SVM) - Accuracy: 100.00%


Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Training Decision Tree Classifier...

Decision Tree Classifier - Accuracy: 100.00%


Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Training Random Forest Classifier...

Random Forest Classifier - Accuracy: 100.00%


Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Tuning Random Forest Classifier...


Best parameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1,
'min_samples_split': 2, 'n_estimators': 150}
Confusion Matrix for Logistic Regression:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Confusion Matrix for K-Nearest Neighbors (KNN):


[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Confusion Matrix for Support Vector Machine (SVM):


[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Confusion Matrix for Decision Tree Classifier:


[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

Confusion Matrix for Random Forest Classifier:


[[10 0 0]
[ 0 9 0]
[ 0 0 11]]

You might also like