0% found this document useful (0 votes)
6 views

Concrete Strength.ipynb - Colab

The document outlines a machine learning workflow for predicting concrete strength using a dataset that includes various components of concrete. It details steps for data preprocessing, including handling missing values and outliers, normalizing data, and splitting the dataset into training and testing sets. The document also describes the training and evaluation of different regression models, including Decision Tree, Random Forest, and XGBoost, along with their performance metrics.

Uploaded by

251SAYEE REKHE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Concrete Strength.ipynb - Colab

The document outlines a machine learning workflow for predicting concrete strength using a dataset that includes various components of concrete. It details steps for data preprocessing, including handling missing values and outliers, normalizing data, and splitting the dataset into training and testing sets. The document also describes the training and evaluation of different regression models, including Decision Tree, Random Forest, and XGBoost, along with their performance metrics.

Uploaded by

251SAYEE REKHE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11/8/24, 3:48 PM concrete strength.

ipynb - Colab

# Import necessary libraries


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore

# Load the dataset


data = pd.read_csv('/content/ConcreteStrengthData.csv')

# Display the first few rows of the data


print("Initial Data:\n", data.head())

Initial Data:
CementComponent BlastFurnaceSlag FlyAshComponent WaterComponent \
0 540.0 0.0 0.0 162.0
1 540.0 0.0 0.0 162.0
2 332.5 142.5 0.0 228.0
3 332.5 142.5 0.0 228.0
4 198.6 132.4 0.0 192.0

SuperplasticizerComponent CoarseAggregateComponent \
0 2.5 1040.0
1 2.5 1055.0
2 0.0 932.0
3 0.0 932.0
4 0.0 978.4

FineAggregateComponent AgeInDays Strength


0 676.0 28 79.99
1 676.0 28 61.89
2 594.0 270 40.27
3 594.0 365 41.05
4 825.5 360 44.30

# Step 1: Handle Missing Values


missing_value_strategy = input("Enter 'remove' to drop missing values or 'median' to fill
if missing_value_strategy == 'remove':
data = data.dropna()
print("Missing values removed.")
elif missing_value_strategy == 'median':
data = data.fillna(data.median())
print("Missing values filled with median.")

Enter 'remove' to drop missing values or 'median' to fill with median: median
Missing values filled with median.

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 1/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab

# Display data after handling missing values


print("\nData after handling missing values:\n", data.head())

Data after handling missing values:


CementComponent BlastFurnaceSlag FlyAshComponent WaterComponent \
0 540.0 0.0 0.0 162.0
1 540.0 0.0 0.0 162.0
2 332.5 142.5 0.0 228.0
3 332.5 142.5 0.0 228.0
4 198.6 132.4 0.0 192.0

SuperplasticizerComponent CoarseAggregateComponent \
0 2.5 1040.0
1 2.5 1055.0
2 0.0 932.0
3 0.0 932.0
4 0.0 978.4

FineAggregateComponent AgeInDays Strength


0 676.0 28 79.99
1 676.0 28 61.89
2 594.0 270 40.27
3 594.0 365 41.05
4 825.5 360 44.30

# Step 2: Handle Outliers using IQR and Z-score


def remove_outliers_iqr(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
return df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

def remove_outliers_zscore(df, threshold=3):


z_scores = np.abs(zscore(df))
return df[(z_scores < threshold).all(axis=1)]

outlier_method = input("Enter 'iqr' for IQR method or 'zscore' for Z-score method to hand
if outlier_method == 'iqr':
data = remove_outliers_iqr(data)
print("Outliers handled using IQR.")
elif outlier_method == 'zscore':
data = remove_outliers_zscore(data)
print("Outliers handled using Z-score.")

Enter 'iqr' for IQR method or 'zscore' for Z-score method to handle outliers: zscore
Outliers handled using Z-score.

# Display data after handling outliers


print("\nData after handling outliers:\n", data.head())

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 2/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab

Data after handling outliers:


CementComponent BlastFurnaceSlag FlyAshComponent WaterComponent \
0 540.0 0.0 0.0 162.0
1 540.0 0.0 0.0 162.0
5 266.0 114.0 0.0 228.0
7 380.0 95.0 0.0 228.0
8 266.0 114.0 0.0 228.0

SuperplasticizerComponent CoarseAggregateComponent \
0 2.5 1040.0
1 2.5 1055.0
5 0.0 932.0
7 0.0 932.0
8 0.0 932.0

FineAggregateComponent AgeInDays Strength


0 676.0 28 79.99
1 676.0 28 61.89
5 670.0 90 47.03
7 594.0 28 36.45
8 670.0 28 45.85

# Step 3: Normalize Data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
data = pd.DataFrame(scaled_data, columns=data.columns)
print("\nData after normalization:\n", data.head())

Data after normalization:


CementComponent BlastFurnaceSlag FlyAshComponent WaterComponent \
0 2.560014 -0.858514 -0.88112 -0.931973
1 2.560014 -0.858514 -0.88112 -0.931973
2 -0.112045 0.480231 -0.88112 2.346817
3 0.999687 0.257107 -0.88112 2.346817
4 -0.112045 0.480231 -0.88112 2.346817

SuperplasticizerComponent CoarseAggregateComponent \
0 -0.673726 0.839761
1 -0.673726 1.032749
2 -1.129625 -0.549747
3 -1.129625 -0.549747
4 -1.129625 -0.549747

FineAggregateComponent AgeInDays Strength


0 -1.288508 -0.229254 2.672454
1 -1.288508 -0.229254 1.590217
2 -1.365815 1.453139 0.701707
3 -2.345042 -0.229254 0.069106
4 -1.365815 -0.229254 0.631152

# Splitting the data into features and target


X = data.drop(columns=['Strength']) # 'Strength' is the target column
y = data['Strength']

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 3/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the shapes of the split datasets


print("Training feature set shape:", X_train.shape)
print("Testing feature set shape:", X_test.shape)
print("Training target set shape:", y_train.shape)
print("Testing target set shape:", y_test.shape)

Training feature set shape: (784, 8)


Testing feature set shape: (197, 8)
Training target set shape: (784,)
Testing target set shape: (197,)

# Define Models
models = {
"Decision Tree": DecisionTreeRegressor(random_state=42),
"Random Forest": RandomForestRegressor(random_state=42),
"XGBoost": XGBRegressor(random_state=42, objective='reg:squarederror')
}

# Hyperparameter tuning for Random Forest and XGBoost using GridSearchCV


rf_param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10]
}

xgb_param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10],
'learning_rate': [0.01, 0.1, 0.3]
}

# Tune and fit models


for model_name, model in models.items():
if model_name == "Random Forest":
grid_search = GridSearchCV(estimator=model, param_grid=rf_param_grid, cv=3, scori
grid_search.fit(X_train, y_train)
models[model_name] = grid_search.best_estimator_
print(f"\n{model_name} Best Parameters: {grid_search.best_params_}")
elif model_name == "XGBoost":
grid_search = GridSearchCV(estimator=model, param_grid=xgb_param_grid, cv=3, scor
grid_search.fit(X_train, y_train)
models[model_name] = grid_search.best_estimator_
print(f"\n{model_name} Best Parameters: {grid_search.best_params_}")
else:
model.fit(X_train, y_train)

Random Forest Best Parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimat

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 4/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab
XGBoost Best Parameters: {'learning_rate': 0.3, 'max_depth': 5, 'n_estimators': 200}

# Model Evaluation Function


def evaluate_model(model, X_test, y_test):
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
return mae, r2, rmse

# Evaluate each model


for model_name, model in models.items():
mae, r2, rmse = evaluate_model(model, X_test, y_test)
print(f"\n{model_name} Evaluation Metrics:")
print(f" MAE: {mae}\n R²: {r2}\n RMSE: {rmse}")

Decision Tree Evaluation Metrics:


MAE: 0.3109036376665524
R²: 0.7531568581871906
RMSE: 0.4824773835905145

Random Forest Evaluation Metrics:


MAE: 0.23158315377649288
R²: 0.8737109631559321
RMSE: 0.3451034120134297

XGBoost Evaluation Metrics:


MAE: 0.1958459295404801
R²: 0.8916473524633806
RMSE: 0.3196584515582839

# Delta Buckets Evaluation for Each Model


def delta_buckets(y_true, y_pred, delta=10):
differences = np.abs(y_true - y_pred)
within_delta = np.sum(differences <= delta) / len(differences)
return within_delta * 100 # Percentage of predictions within delta

# Display Delta Bucket Evaluation


for model_name, model in models.items():
y_pred = model.predict(X_test)
delta_10 = delta_buckets(y_test, y_pred, delta=10)
print(f"\n{model_name} - Percentage of predictions within ±10 units of true value: {d

Decision Tree - Percentage of predictions within ±10 units of true value: 100.0%

Random Forest - Percentage of predictions within ±10 units of true value: 100.0%

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 5/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab
XGBoost - Percentage of predictions within ±10 units of true value: 100.0%

# Function to evaluate model performance


def evaluate_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
return mae, mse, r2

# Initialize models
models = {
"Decision Tree": DecisionTreeRegressor(random_state=42),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"XGBoost": XGBRegressor(n_estimators=100, random_state=42, objective='reg:squarederro
}

# Dictionary to store model performance


performance = {}

# Evaluate each model


for model_name, model in models.items():
mae, mse, r2 = evaluate_model(model, X_train, X_test, y_train, y_test)
performance[model_name] = {
"Mean Absolute Error": mae,
"Mean Squared Error": mse,
"R^2 Score": r2
}
print(f"{model_name} Performance:")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")
print("-" * 30)

# Display performance comparison


import pandas as pd
performance_df = pd.DataFrame(performance).T
print("\nPerformance Comparison:")
print(performance_df)

Decision Tree Performance:


Mean Absolute Error: 0.31
Mean Squared Error: 0.23
R^2 Score: 0.75
------------------------------
Random Forest Performance:
Mean Absolute Error: 0.23
Mean Squared Error: 0.12
R^2 Score: 0.87
------------------------------
XGBoost Performance:
Mean Absolute Error: 0.20
Mean Squared Error: 0.11
R^2 Score: 0.89

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 6/7
11/8/24, 3:48 PM concrete strength.ipynb - Colab
------------------------------

Performance Comparison:
Mean Absolute Error Mean Squared Error R^2 Score
Decision Tree 0.310904 0.232784 0.753157
Random Forest 0.230687 0.120433 0.872293
XGBoost 0.199964 0.107200 0.886325

https://colab.research.google.com/drive/1OaThaXg5S2tBw172V5NmebklkrofR9HK#scrollTo=HrAqhLmHxtqv&printMode=true 7/7

You might also like