Machine Learning-1
Machine Learning-1
ASSIGNMENT-1
NAME : RAMAN MAURYA
REGNO : 21DBCAD023
Decision Tree Classifier
A decision tree is a non-parametric supervised learning algorithm, which is utilized
for both classification and regression tasks. It has a hierarchical, tree structure,
which consists of a root node, branches, internal nodes and leaf nodes.
Its primary purpose is to create a model that predicts the target variable (a
categorical label in classification problems) based on a set of input features.
It does this by recursively partitioning the input data into subsets, making decisions
at each step, ultimately forming a tree-like structure where the leaves represent the
class labels or predicted values.
Purpose of Decision Tree Classifier
Classification : Decision trees are commonly used for classification tasks, where
the goal is to predict a categorical target variable. This could include tasks like spam
detection, disease diagnosis, sentiment analysis, or customer churn prediction.
Feature Importance: Decision trees can implicitly rank the importance of input
features, allowing you to identify which features are most relevant for making
predictions.
Export_text
The export_text function in scikit-learn is used to generate a textual representation
of a decision tree model, which can be useful for understanding how the tree makes
predictions.
This textual representation provides information about the decision rules at each
node in the tree, including feature names, threshold values, and class predictions.
Debugging: When working with decision tree models, especially deep or complex
ones, export_text can be used for debugging. You can visually inspect the tree structure
and decision rules to identify potential issues or sources of misclassification.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data’]
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=3)
decision_tree = decision_tree.fit(X, y)
from sklearn.tree.export import export_text
r = export_text(decision_tree, feature_names=iris['feature_names'],decimals=0, show_weights=True)
print(r)
Output:
Seaborn
Seaborn is a Python data visualization library based on Matplotlib.
Its primary purpose is to provide a high-level interface for creating attractive and
informative statistical graphics.
Aesthetically Pleasing Plots: Seaborn comes with a set of appealing color palettes
and themes that make it easy to create visually pleasing and publication-quality plots
with minimal customization.
Faceted Data Exploration: Seaborn provides built-in support for faceted data
exploration, allowing you to create multi-plot grids to examine interactions between
variables more easily.
import numpy as np
import seaborn as sns
sns.set(style="white")
rs = np.random.RandomState(10)
d = rs.normal(size=100)
sns.histplot(d, kde=True, color="m")
Its primary purpose is to allow you to explicitly specify the data type for columns in your
DataFrame.
It can be useful for data type conversion, data manipulation, and data analysis tasks.
Purpose of astype
Data Type Conversion: The primary purpose of astype is to change the data type of
one or more columns in a DataFrame. This can be helpful when you need to ensure
that a column has the correct data type for your analysis or when you want to convert
data from one type to another (e.g., from a string to a numeric type).
Memory Optimization: By changing data types, you can reduce the memory usage of
your DataFrame. For example, converting integer columns to smaller integer types or
using float32 instead of float64 can lead to significant memory savings, which can be
crucial when dealing with large datasets.
Data Cleaning: It can be used to clean data by converting erroneous or inconsistent
data into the correct data type. For example, converting string representations of
numbers into actual numeric types.
import pandas as pd
data = {'A': ['1', '2', '3’], 'B': [4.1, 5.2,
6.3]}
df = pd.DataFrame(data)
print("Initial Data Types:")
print(df.dtypes)
df['A'] = df['A'].astype(int)
df['B'] = df['B'].astype('float32')
print("\nUpdated Data Types:")
print(df.dtypes)
Output :
cat.codes
In Pandas, the cat.codes attribute is used to obtain the category codes (or integer
codes) of the values in a categorical or "category" data type column.
Output :
Classification Report
The classification report is a tool in machine learning for evaluating the performance of
a classification model. Its primary purpose is to provide a detailed summary of the
model's performance in terms of various evaluation metrics for each class or category in
a classification problem.
It's particularly useful for understanding how well a model is performing across different
classes and can help in identifying where the model may be making mistakes.
The classification report typically includes metrics such as precision, recall, F1-score,
and support for each class, along with an overall accuracy score. These metrics are
valuable for assessing a model's performance in tasks like binary classification, multi-
class classification, and multi-label classification.
Purpose of classification Report
Output :
Standard Scaler
The Standard Scaler is a preprocessing technique commonly used in machine learning
to scale and center numerical features (variables) in a dataset.
Its primary purpose is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
Standardization (also known as z-score normalization) is particularly useful when
dealing with features that have different scales or units because it ensures that all
features have the same scale.
It can improve the performance of many machine learning algorithms, especially those
sensitive to the scale of features.
Purpose of Standard Scaler
Scale Features: Standard Scaler scales each feature independently, transforming them
to have a mean of 0 and a standard deviation of 1. This scaling makes it easier to
compare and interpret the impact of different features on a model.
Improve Model Performance: Many machine learning algorithms, such as support
vector machines, k-nearest neighbors and principal component analysis, perform better
when features are standardized. Standardization helps prevent features with larger
scales from dominating the learning process.
Normalize Distributions: Standard Scaler can help normalize the feature distributions,
making the data more suitable for models that assume normality.
import numpy as np
from sklearn.preprocessing import StandardScaler
data = np.array([[10.0, 5.0], [20.0, 10.0], [30.0, 15.0]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Original Data:")
print(data)
print("\nScaled Data:")
print(scaled_data)
Output :
Label Encoding
Label encoding is a technique used to convert categorical data (data that consists of
labels or categories) into numerical format.
Its primary purpose is to prepare categorical data for machine learning algorithms that
require numerical inputs.
Label encoding assigns a unique integer value to each category, effectively converting
them into numeric labels.
Purpose of Label Encoding
Preserve Ordinal Information: Label encoding can be useful for ordinal categorical
data, where the order or ranking among categories matters. The assigned integers
retain the ordinal relationship between the categories.
from sklearn.preprocessing import LabelEncoder
data = ['red', 'green', 'blue', 'green', 'red']
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data)
print("Original Data:")
print(data)
print("\nEncoded Data:")
print(encoded_data)
decoded_data =
label_encoder.inverse_transform(encoded_data)
print("\nDecoded Data:")
print(decoded_data)
Output :