DS Journal-1
DS Journal-1
CERTIFICATE
20_________.
Place:
Date:
Introduction to Excel
● Perform conditional formatting on a dataset using
Hypothesis Testing
● Formulate null and alternative hypotheses for a
given problem.
8 20/02/25 ● Conduct a hypothesis test using appropriate
statistical tests (e.g., t-test, chi square test).
● Interpret the results and draw conclusions based on
the test outcomes.
Practical no : 01 : Data Frames and Basic
Data Pre-processing
In [3]: df.tail()
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
In [4]: df.describe()
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
In [6]: df.shape
Out[6]: (150, 6)
In [7]: df.dtypes
Out[7]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
In [8]: df.columns
In [10]: df = df.drop(columns=['Species'])
In [15]: # Step 6: Filtering Data (Example: Select rows where a column value is greater tha
threshold_value = 3 # Example threshold value
filter = do[do['SepalLengthCm'] > threshold_value]
print(f"\nFiltered Data (Rows with column_name > {threshold_value}):")
print(filter)
In [17]: # Step 8: Grouping Data (Example: Group by a column and calculate the mean of anot
df_grouped = df_sorted.groupby('SepalLengthCm').agg({
'PetalLengthCm': 'mean', # Calculate the mean of 'another_column' for each gr
'PetalWidthCm': 'sum' # Calculate the sum of 'yet_another_column' for each gr
}).reset_index() # Reset index to avoid multi-index
print("\nGrouped Data (Mean and Sum for Each Group):")
print(df_grouped.head())
In [2]: df = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv")
df
Original DataFrame:
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[3]: classlabel Alcohol Malic Acid
0 1 13.20 1.78
1 1 13.16 2.36
2 1 14.37 1.95
3 1 13.24 2.59
4 1 14.20 1.76
MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to zero
and the maximum of feature equal to one. MinMax Scaler shrinks the data within the given
range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the
values to a specific value range without changing the shape of the original distribution.
In [ ]: scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax Scaling")
df1
StandardScaler
StandardScaler is a preprocessing technique in scikit-learn used for standardizing features
by removing the mean and scaling to unit variance. StandardScaler, a popular
preprocessing technique provided by scikit-learn, offers a simple yet effective method for
standardizing feature values. StandardScaler operates on the principle of normalization,
where it transforms the distribution of each feature to have a mean of zero and a standard
deviation of one. This process ensures that all features are on the same scale, preventing
any single feature from dominating the learning process due to its larger magnitude.
In [4]: scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_standardvalue
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
print("\n Dataframe after Standard Scaling")
df1
0 1 0.255824 -0.501624
1 1 0.206229 0.018020
2 1 1.706501 -0.349315
3 1 0.305420 0.224086
4 1 1.495719 -0.519543
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[6]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
In [7]: le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
iris
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[7]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species code
Iris-
0 1 5.1 3.5 1.4 0.2 0
setosa
Iris-
1 2 4.9 3.0 1.4 0.2 0
setosa
Iris-
2 3 4.7 3.2 1.3 0.2 0
setosa
Iris-
3 4 4.6 3.1 1.5 0.2 0
setosa
Iris-
4 5 5.0 3.6 1.4 0.2 0
setosa
Iris-
145 146 6.7 3.0 5.2 2.3 2
virginica
Iris-
146 147 6.3 2.5 5.0 1.9 2
virginica
Iris-
147 148 6.5 3.0 5.2 2.0 2
virginica
Iris-
148 149 6.2 3.4 5.4 2.3 2
virginica
Iris-
149 150 5.9 3.0 5.1 1.8 2
virginica
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Pratical No: 03 - Regression and Its Types
In [2]: df = pd.read_csv(r"C:\Users\DELL\Downloads\fetch_california_housing.csv")
df.head()
In [3]: df.tail()
In [4]: df.shape
Out[4]: (20640, 9)
In [5]: df.size
Out[5]: 185760
In [6]: df.describe()
In [7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
In [8]: df.dtypes
housing = fetch_california_housing()
# Convert to DataFrame
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df.head() # Print first few rows
In [10]: housing_df['PRICE']=housing.target
X=housing_df[['AveRooms']]
y=housing_df[['PRICE']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
In [11]: model=LinearRegression()
model.fit(X_train,y_train)
Out[11]: ▾ LinearRegression
LinearRegression()
In [12]: mse=mean_squared_error(y_test,model.predict(X_test))
r2=r2_score(y_test,model.predict(X_test))
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\lin
ear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Out[5]: ▾ LogisticRegression
LogisticRegression()
In [6]: # Predictions
y_pred_logistic = logistic_model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
Accuracy: 1.0
Classification Report
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
model.fit(X_train, y_train)
y_pred_tree = model.predict(X_test)
y_pred_tree
Classification Report
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
In [ ]:
Practical No: 05 - K-Means Clustering
# Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)