ml programs
ml programs
def compute_central_tendency_and_dispersion(data):
# Central Tendency Measures
mean = np.mean(data)
median = np.median(data)
try:
mode = stats.mode(data)
except stats.StatisticsError:
mode = "No unique mode"
# Dispersion Measures
data_range = np.ptp(data) # Range: max - min
variance = np.var(data)
standard_deviation = np.std(data)
# Printing Results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {data_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {standard_deviation}")
# Example data
data = [12, 15, 12, 10, 18, 20, 25, 12, 15, 18]
Explanation:
Central Tendency:
o Mean: The average of all values.
o Median: The middle value when the data is sorted.
o Mode: The value that occurs most frequently (if no unique mode exists, it
returns a message).
Dispersion:
o Range: The difference between the maximum and minimum values.
o Variance: A measure of how spread out the data is.
o Standard Deviation: The square root of the variance, representing the spread
in the same units as the data.
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
1. Statistics Library:
The statistics module provides functions to compute mathematical statistics of numeric data.
It is a built-in library that does not require installation.
Common functions:
import statistics
data = [12, 15, 12, 10, 18, 20, 25, 12, 15, 18]
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
variance = statistics.variance(data)
stdev = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {stdev}")
output
Mean: 15.7
Median: 15.0
Mode: 12
Variance: 21.122222222222224
Standard Deviation: 4.595891885393109
2. Math Library:
Common functions:
import math
number = 16
sqrt_val = math.sqrt(number)
factorial_val = math.factorial(5)
log_val = math.log(100, 10)
output
Square root of 16: 4.0
Factorial of 5: 120
Log base 10 of 100: 2.0
3. NumPy Library:
NumPy is a powerful library for numerical computations in Python. It provides support for
large multi-dimensional arrays and matrices, as well as a collection of high-level
mathematical functions to operate on these arrays.
Key features:
import numpy as np
# Basic stats
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
output
NumPy Mean: 15.7
NumPy Median: 15.0
NumPy Standard Deviation: 4.360045871318328
4. SciPy Library:
SciPy builds on top of NumPy and provides a collection of algorithms and mathematical tools
for scientific computing. It covers optimization, integration, interpolation, eigenvalue
problems, and more.
Key Features:
output
PDF value of Normal Distribution at x=3: 0.0044318484119380075
T-statistic: 0.8178608201095307, P-value: 0.4371160707340265
Statistics: Best for basic statistical operations (mean, median, variance, etc.).
Math: For basic math operations, constants, and trigonometric functions.
NumPy: Best for handling large datasets, array manipulation, vectorized operations,
and matrix operations.
SciPy: Provides higher-level operations for scientific and engineering tasks like
optimization, integration, and advanced statistical tests.
Summary:
statistics is simple and useful for basic statistical measures (mean, median, variance).
math is perfect for simple mathematical operations and constants.
numpy is used for high-performance numerical computing, especially when handling
large datasets or working with arrays.
scipy is built on top of NumPy and extends its capabilities with advanced tools for
optimization, integration, signal processing, and more.
1. Pandas Library
Pandas is a powerful library for data manipulation and analysis. It provides easy-to-use data
structures and data analysis tools, particularly for working with structured data (i.e., data in
the form of tables, such as CSV, Excel files, SQL databases, or JSON).
Key Features:
Loading Data: Pandas supports reading data from various formats, including CSV,
Excel, SQL, and JSON.
Data Selection and Indexing: You can select rows, columns, and perform slicing.
Data Aggregation: Group by operations to aggregate data based on categories.
Data Cleaning: Handling missing values, dropping duplicates, etc.
Data Transformation: Operations like normalization, encoding, and scaling.
import pandas as pd
# Create a simple DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [23, 35, 40, 28, 22],
'Salary': [70000, 80000, 120000, 90000, 95000]
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Basic Statistics
print("\nBasic Statistics:")
print(df.describe()) # Get summary statistics
# Filtering Data
print("\nFiltered Data (Age > 30):")
filtered_df = df[df['Age'] > 30]
print(filtered_df)
# Sorting Data
print("\nSorted Data by Salary:")
sorted_df = df.sort_values(by='Salary', ascending=False)
print(sorted_df)
# Handling Missing Data (if present)
df_with_nan = df.copy()
df_with_nan['Age'][2] = None # Introduce a NaN value
print("\nData with Missing Values:")
print(df_with_nan)
df_cleaned = df_with_nan.dropna() # Remove rows with NaN values
print("\nCleaned Data:")
print(df_cleaned)
output
3 David 28 90000
4 Eva 22 95000
Basic Statistics:
Age Salary
count 5.000000 5.000000
mean 29.600000 91000.000000
std 7.765307 18841.443681
min 22.000000 70000.000000
25% 23.000000 80000.000000
50% 28.000000 90000.000000
75% 35.000000 95000.000000
max 40.000000 120000.000000
Cleaned Data:
Name Age Salary
0 Alice 23.0 70000
1 Bob 35.0 80000
3 David 28.0 90000
4 Eva 22.0 95000
2. Matplotlib Library
Key Features:
2D plotting: Create line plots, bar plots, histograms, scatter plots, and more.
Customizable plots: Customize titles, labels, legends, and axes.
Subplots: Ability to create multiple plots in one figure.
Animation: Create animated visualizations.
Interactive plots: Supports interactive visualizations in Jupyter Notebooks.
# Bar Plot
categories = ['A', 'B', 'C', 'D']
values = [7, 15, 23, 8]
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
# Scatter Plot
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
# Histogram
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5]
plt.hist(data, bins=5)
plt.title('Histogram')
plt.xlabel('Bins')
plt.ylabel('Frequency')
plt.show()
# Box Plot
data = [ [1, 2, 3, 4, 5], [3, 4, 5, 6, 7], [7, 8, 9, 10, 11] ]
plt.boxplot(data)
plt.title('Box Plot')
plt.show()
output
4 .Write a Python program to implement Simple Linear Regression
We'll need numpy for mathematical operations, matplotlib for visualization, and pandas for
data manipulation.
We will create a simple dataset where we have an independent variable (X) and a dependent
variable (Y).
3. Calculate the parameters (slope and intercept) of the line using the Simple Linear
Regression formula:
4. Make Predictions:
Once we have the slope and intercept, we can make predictions using the formula:
Y=mX+bY = mX + bY=mX+b
5. Visualize:
Finally, we will plot the data and the regression line using matplotlib.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
# Step 6: (Optional) Evaluate the model using R-squared value (goodness of fit)
# R-squared value represents the proportion of the variance in Y that is explained by X.
ss_total = np.sum((Y - np.mean(Y)) ** 2) # Total sum of squares
ss_residual = np.sum((Y - Y_pred) ** 2) # Residual sum of squares
r_squared = 1 - (ss_residual / ss_total)
output
Slope (m): 2000.0
Intercept (b): 38000.0
Explanation of the Code:
1. Dataset:
o We created a simple dataset with Years_of_Experience and corresponding
Salary.
3. Prediction:
o Using the formula Y=mX+bY = mX + bY=mX+b, we predicted Y values
(Salaries) for each X value (Years of Experience).
4. Visualization:
o We plotted both the actual data points (in blue) and the regression line (in red)
using matplotlib.
5. R-squared:
o An optional evaluation metric, the R-squared value, is calculated to measure
how well the regression line fits the data. An R-squared value close to 1
indicates a good fit.
Implementing Multiple Linear Regression for house price prediction using scikit-learn is a
common use case in machine learning. Multiple Linear Regression is an extension of Simple
Linear Regression that uses more than one independent variable (feature) to predict the
dependent variable (target).
1. Load Data: We will use a dataset that contains multiple features (e.g., size of the
house, number of bedrooms, location, etc.) to predict the house price.
2. Preprocess the Data: This step involves handling missing data, encoding categorical
variables, and scaling the features if necessary.
3. Split Data: Divide the data into training and testing sets.
4. Train the Model: Fit the model using the training data.
5. Evaluate the Model: Evaluate the performance of the model on the test data.
6. Predict: Use the model to predict house prices for new data.
import pandas as pd
import numpy as np
# Step 2: Create a larger sample dataset (with more data points for meaningful results)
data = {
'Square_Feet': [1500, 1800, 2400, 3000, 3500, 4000, 4500, 5000, 5500, 6000],
df = pd.DataFrame(data)
# Step 4: Split the data into training and testing sets (80% for training, 20% for
testing)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Step 8: Visualize the predictions (Optional: For simple datasets like this, we can
visualize actual vs predicted prices)
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
# Step 10: Use the model to predict a new data point (optional)
predicted_price = model.predict(new_data)
output
1. Loading Data:
o In this example, we manually created a small dataset with Square_Feet,
Bedrooms, Age_of_House, and Price (which is the target variable).
o Normally, you would load a dataset using pd.read_csv('your_dataset.csv').
5. Making Predictions:
o After training the model, we use it to predict the house prices on the testing set
X_test.
7. Visualizing Predictions:
o We plot the actual vs predicted house prices using a scatter plot to visually
inspect the model's performance.
8. Model Coefficients:
o We print the model's intercept (b) and coefficients (m), which tell us how
much each feature contributes to the prediction.
A Decision Tree is a popular supervised machine learning algorithm used for both regression
and classification tasks. It splits the data into subsets based on feature values and builds a
tree-like structure to predict the target variable.
We'll implement a Decision Tree Classifier for classification or a Decision Tree Regressor
for regression, and we'll also explore how to tune the hyperparameters to improve the model's
performance using Grid Search.
For this example, I'll implement a Decision Tree Classifier and perform parameter tuning
using GridSearchCV.
Steps Involved:
import pandas as pd
import numpy as np
# For simplicity, we're using a well-known Iris dataset that comes with sklearn.
data = load_iris()
X = data.data # Features
# Step 3: Split the data into training and testing sets (80% training, 20% testing)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
param_grid = {
# Step 7: Create the GridSearchCV object to search for the best hyperparameters
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1,
verbose=1)
grid_search.fit(X_train, y_train)
# Step 9: Get the best parameters and best score from GridSearchCV
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)
output
Classification Report:
accuracy 0.97 15
accuracy 1.00 15
1. Import Libraries:
o We import necessary libraries like pandas, numpy, train_test_split for splitting
the dataset, DecisionTreeClassifier for the model, GridSearchCV for
hyperparameter tuning, and classification_report for performance evaluation.
2. Load Dataset:
o We use the Iris dataset, which is a simple classification dataset that contains
three classes of flowers, each with four features (sepal length, sepal width,
petal length, and petal width).
3. Split Data:
o The dataset is split into training (80%) and testing (20%) sets using
train_test_split.
6. Hyperparameter Tuning:
o We use GridSearchCV to search for the best combination of hyperparameters.
The parameters to tune include:
criterion: The function to measure the quality of a split, either 'gini' or
'entropy'.
max_depth: The maximum depth of the tree.
min_samples_split: The minimum number of samples required to split
an internal node.
min_samples_leaf: The minimum number of samples required to be at
a leaf node.
max_features: The number of features to consider when looking for the
best split.
o We use 5-fold cross-validation to evaluate each combination of parameters
and parallelize the grid search with n_jobs=-1.
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both
classification and regression. For classification, it predicts the class of a sample based on the
majority class of its K-nearest neighbors in the training dataset.
Here, I will demonstrate how to implement KNN for classification using Scikit-learn. We
will:
1. Load a dataset.
2. Preprocess the data (split into training and testing).
3. Train the KNN model.
4. Evaluate the model using accuracy and other classification metrics.
5. Optionally, perform hyperparameter tuning to find the optimal value of K (the
number of neighbors).
1. Import Libraries: Import necessary libraries for loading the dataset, model training,
and evaluation.
2. Load Dataset: We'll use the popular Iris dataset for this example.
3. Preprocess Data: Split the data into training and testing sets.
4. Train the Model: Create and train a KNN classifier using Scikit-learn.
5. Evaluate the Model: Use accuracy and other classification metrics to evaluate the
model.
6. Hyperparameter Tuning: Perform hyperparameter tuning to find the best number of
neighbors (K).
# Step 3: Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
for k in k_values:
knn_temp = KNeighborsClassifier(n_neighbors=k)
knn_temp.fit(X_train, y_train)
y_pred_temp = knn_temp.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, y_pred_temp))
output
Accuracy: 72.22%
Classification Report:
precision recall f1-score support
accuracy 0.72 36
macro avg 0.67 0.67 0.67 36
weighted avg 0.72 0.72 0.72 36
accuracy 0.78 36
macro avg 0.75 0.76 0.75 36
weighted avg 0.79 0.78 0.78 36
Logistic Regression is a simple and widely used classification algorithm. Despite its name, it
is a classification algorithm, not a regression algorithm. It is used to predict the probability of
a categorical dependent variable based on one or more independent variables.
Here, we'll implement a Logistic Regression classifier using Scikit-learn on a popular dataset
called the Iris dataset (for simplicity), which is a multiclass classification problem.
import numpy as np
import pandas as pd
iris = load_iris()
X = iris.data # Features (sepal length, sepal width, petal length, petal width)
# Step 3: Split the data into training and testing sets (80% training, 20% testing)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
# Initialize GridSearchCV
grid_search.fit(X_train, y_train)
best_logreg = grid_search.best_estimator_
y_pred_best = best_logreg.predict(X_test)
# Step 10: Plot the accuracy with different regularization strength values (C)
for C in C_values:
logreg_temp.fit(X_train, y_train)
y_pred_temp = logreg_temp.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, y_pred_temp))
plt.xscale('log')
plt.ylabel('Accuracy')
plt.show()
output
Accuracy: 94.44%
Classification Report:
accuracy 0.94 30
accuracy 0.97 30
1. Import Libraries:
o We use libraries like numpy, pandas, and matplotlib for data manipulation and
visualization.
o We import Scikit-learn modules to load the dataset, split the data, create the
Logistic Regression model, and evaluate the performance.
2. Load Dataset:
o We use the Iris dataset (a multiclass classification dataset) available directly
from Scikit-learn using load_iris().
o X contains the features (sepal length, sepal width, petal length, petal width),
and y contains the target variable (species).
3. Train-Test Split:
o We split the dataset into training and testing sets using train_test_split() (80%
for training, 20% for testing).
5. Model Evaluation:
o After training, we use predict() to make predictions on the test set (X_test).
o We calculate the accuracy using accuracy_score() and print the classification
report to check the precision, recall, and F1-score for each class.
6. Hyperparameter Tuning:
o We perform hyperparameter tuning using GridSearchCV to find the best
hyperparameters for the model.
o We tune the solver and C (regularization strength) parameters.
solver: Algorithm to use for optimization (e.g., liblinear, lbfgs,
newton-cg).
C: Regularization strength (inverse of regularization strength, where
smaller values mean more regularization).
o We use cross-validation (cv=5) to evaluate the model and find the best
parameters.
1. Load a Dataset: We'll use the Iris dataset, a commonly used dataset for clustering
problems.
2. Preprocess Data: Ensure the data is in the right format.
3. Apply K-Means Clustering: Use Scikit-Learn's KMeans class.
4. Evaluate the Results: Visualize the clustering and check the accuracy of the model
(though accuracy is not a direct measure for unsupervised learning).
output
Silhouette Score: 0.4799
Cluster Centers (Centroids):
[[ 0.57100359 -0.37176778 0.69111943 0.66315198]
[-0.81623084 1.31895771 -1.28683379 -1.2197118 ]
[-1.32765367 -0.373138 -1.13723572 -1.11486192]]
Explanation of Code:
1. Data Loading:
o We load the Iris dataset from Scikit-Learn. This dataset consists of 4 features
(sepal length, sepal width, petal length, and petal width) and the target labels
(setosa, versicolor, and virginica).
2. Data Standardization:
o We standardize the data using StandardScaler to ensure that each feature
contributes equally to the distance metric. This is important because K-Means
uses distance to assign points to clusters.
3. Applying K-Means:
o We set n_clusters=3 because there are three species in the Iris dataset, and we
want to find 3 clusters.
o KMeans(n_clusters=3) is the key function used to perform K-Means
clustering.
o We fit the model on the scaled data using .fit() and predict the cluster labels
using .predict().
4. Visualization:
o We use a scatter plot to visualize how the data points are clustered. We use the
first two features for simplicity. Points are colored based on the predicted
cluster labels, and the centroids of the clusters are marked with red X symbols.
5. Evaluation:
o Silhouette Score: This metric measures how similar points within a cluster are
compared to points in other clusters. A higher silhouette score indicates well-
defined clusters.
o We print the cluster centroids, which are the coordinates of the cluster
centers that the K-Means algorithm has identified.
We'll be using the Iris dataset, which is a widely used dataset in machine learning. It
contains 150 samples of iris flowers, with 4 features: sepal length, sepal width, petal length,
and petal width, and a target variable (species) with 3 possible classes: setosa, versicolor,
and virginica.
1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machine (SVM)
4. Decision Tree Classifier
5. Random Forest Classifier
Steps:
# Step 3: Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Standardize the features (important for models like Logistic Regression, KNN, and
SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Performance Metrics
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
# Store results
results[model_name] = {
'accuracy': accuracy,
'classification_report': classification_rep,
'confusion_matrix': confusion_mat
}
# Print Results
print(f"\n{model_name} - Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", confusion_mat)
plt.figure(figsize=(10, 6))
plt.barh(model_names, accuracies, color='skyblue')
plt.xlabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.show()
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
Training Decision Tree Classifier...
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]