ADS LAB Merged
ADS LAB Merged
Theory:
Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. They provide a
quick overview of the data and help in understanding its distribution, central tendency, and
variability. Key measures include:
1. Measures of Central Tendency:
○ Mean: The average value of the dataset.
○ Median: The middle value when the data is sorted.
○ Mode: The most frequently occurring value.
2. Measures of Dispersion:
○ Range: The difference between the maximum and minimum values.
○ Variance: The average of the squared differences from the mean.
○ Standard Deviation: The square root of variance, indicating the spread of data.
○ Interquartile Range (IQR): The range between the first quartile (Q1) and the
third quartile (Q3).
3. Shape of the Distribution:
○ Skewness: Measures the asymmetry of the data distribution.
○ Kurtosis: Measures the "tailedness" of the data distribution.
4. Other Measures:
○ Coefficient of Variation (CV): The ratio of the standard deviation to the mean,
expressed as a percentage.
○ Trimmed Mean: The mean after removing a certain percentage of the
smallest and largest values.
○ Sum of Squares: The sum of squared deviations from the mean.
5. Visualizations:
○ Box-and-Whisker Plot: Displays the distribution of data based on quartiles and
identifies outliers.
○ Scatter Plot: Shows the relationship between two numeric variables.
○ Correlation Matrix: Displays the correlation coefficients between numeric
variables.
Inferential Statistics
Inferential statistics make inferences about a population based on a sample of data. They
help in testing hypotheses and drawing conclusions.
1. Distributions:
○ Normal Distribution: A symmetric, bell-shaped distribution where most values
cluster around the mean.
○ Poisson Distribution: A discrete distribution that describes the probability of a
given number of events occurring in a fixed interval.
2. Population Parameters and Sampling Errors:
○ Population Parameters: Characteristics of the entire population (e.g.,
population mean, population variance).
○ Sampling Errors: Differences between the sample statistic and the population
parameter.
3. Confidence Intervals:
○ A range of values within which the population parameter is expected to lie,
with a certain level of confidence (e.g., 95%).
4. Hypothesis Testing:
○ Null Hypothesis (H0): A statement that there is no effect or no difference.
○ Alternative Hypothesis (H1): A statement that contradicts the null hypothesis.
○ Type I Error: Rejecting the null hypothesis when it is true (false positive).
○ Type II Error: Failing to reject the null hypothesis when it is false (false
negative).
○ Z-Test: A statistical test used when the sample size is large and the
population variance is known.
○ T-Test: A statistical test used when the sample size is small and the
population variance is unknown.
○ ANOVA (Analysis of Variance): A test used to compare the means of three or
more groups.
#importing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
df = pd.read_csv("loan_data_set.csv")
print(df.head())
#output:
A. DESCRIPTIVE STATISTICS
print("Descriptive Statistics:")
print(df.describe())
print("\nMean:")
print(df.mean(numeric_only=True))
print("\nMedian:")
print(df.median(numeric_only=True))
print("\nMode:")
print(df.mode().iloc[0])
print("\nMin:")
print(df.min(numeric_only=True))
print("\nMax:")
print(df.max(numeric_only=True))
print("\nSum:")
print(df.sum(numeric_only=True))
print("\nRange:")
print(df.max(numeric_only=True) - df.min(numeric_only=True))
Q1 = df.quantile(0.25, numeric_only=True)
Q3 = df.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
print("\nFirst Quartile (Q1):")
print(Q1)
variance = df.var(numeric_only=True)
print("\nCorrelation Matrix:")
corr = df.corr(numeric_only=True)
print(corr)
# Standard Error of the Mean (SEM)
print("\nStandard Error of the Mean (SEM):")
print(df.sem(numeric_only=True))
print("Coefficient of Variation:")
print(cv_results)
print("\nMissing Values:")
print(df.isnull().sum())
# Total Rows
print("\nTotal Rows (N total):")
print(len(df))
# Cumulative Sum
print("\nCumulative Sum:")
print(df.select_dtypes(include='number').cumsum())
# Sum of Squares
print("\nSum of Squares:")
print((df.select_dtypes(include='number') ** 2).sum())
# Skewness
print("\nSkewness:")
print(df.skew(numeric_only=True))
# Kurtosis
print("\nKurtosis:")
print(df.kurtosis(numeric_only=True))
# Box-and-Whisker Plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="ApplicantIncome")
plt.title("Box-and-Whisker Plot of ApplicantIncome")
plt.show()
# Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x="ApplicantIncome", y="LoanAmount")
plt.title("Scatter Plot of ApplicantIncome vs LoanAmount")
plt.show()
B. INFERENTIAL STATISTICS
Theory:
Imputation is the process of replacing missing or null values (like NaN or NA) in a dataset
with estimated or calculated values. This helps in maintaining the completeness of the data
for accurate analysis and modeling.
plt.figure(figsize=(12, 6))
sns.barplot(x=missing_values.index, y=missing_values.values,
palette='viridis')
plt.xticks(rotation=45)
plt.title('Missing Values by Column')
plt.xlabel('Column')
plt.ylabel('Number of Missing Values')
plt.show()
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['float64',
'int64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns
df_mean[numeric_cols] =
mean_imputer.fit_transform(df[numeric_cols])
df_median[numeric_cols] =
median_imputer.fit_transform(df[numeric_cols])
plt.tight_layout()
plt.show()
EXPERIMENT 3
Theory:
Data visualization is the graphical representation of information and data using visual
elements such as charts, graphs, and maps. The primary purpose of data visualization is to
simplify complex data and make it more understandable, accessible, and useful for
decision-making.
Key Purposes:
1. Simplifies Complex Data – Large datasets can be difficult to interpret in raw form.
Visualizations help summarize and present data in an intuitive way.
2. Enhances Understanding – Patterns, trends, and correlations are easier to recognize
when displayed visually.
3. Improves Decision-Making – Businesses and analysts can make informed decisions
quickly based on visual insights.
4. Identifies Patterns & Trends – Helps in detecting trends, correlations, and outliers
that may not be obvious in numerical data.
5. Facilitates Communication – Visual data representation is more effective for
presentations and reports, making it easier to share insights with stakeholders.
6. Increases Engagement – Interactive and visually appealing dashboards keep users
engaged and interested in the data.
Data visualization tools help users create insightful visual representations of data with
minimal effort. These tools come with built-in features for customization, interactivity, and
analytics.
Key Benefits:
Faster Analysis – Converts raw data into meaningful visuals within seconds, making
analysis quicker and more efficient.
Improved Accuracy – Reduces errors caused by manual data interpretation and helps in
making more precise predictions.
Better Data Storytelling – Enables users to tell compelling stories through visuals that
highlight important insights.
Enhanced Productivity – Saves time by automating data representation, allowing teams to
focus on decision-making rather than data processing.
Real-time Data Monitoring – Many tools offer live dashboards that update dynamically,
helping organizations track performance in real-time.
Customizable & Interactive – Users can filter, drill down, and explore data interactively for
deeper insights.
Supports Multiple Data Sources – Most visualization tools integrate with databases,
spreadsheets, APIs, and cloud storage for seamless data analysis.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from bokeh.plotting import figure, output_notebook, show
import missingno as msno
#Univariate Visualization
#Histogram
plt.figure(figsize=(8, 5))
sns.histplot(df['LoanAmount'], kde=True, bins=20)
plt.title("Histogram of Loan Amount")
plt.xlabel("Loan Amount")
plt.ylabel("Frequency")
plt.show()
#BarChart
plt.figure(figsize=(8, 5))
sns.countplot(x="Education", data=df)
plt.title("Bar Chart of Education Levels")
plt.xlabel("Education")
plt.ylabel("Count")
plt.show()
#Multivariate Visualization
#Scatter Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='ApplicantIncome', y='LoanAmount',
hue='Education', data=df)
plt.title("Scatter Plot: Applicant Income vs Loan Amount")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")
plt.show()
#Scatter Matrix
scatter_matrix(df[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount']], figsize=(8, 8), diagonal='kde')
plt.show()
#Bubble Chart
fig = px.scatter(df, x="ApplicantIncome", y="LoanAmount",
size="CoapplicantIncome", color="Education",
title="Bubble Chart: Loan Amount vs Applicant
Income")
fig.show()
#density chart
plt.figure(figsize=(8, 5))
# Corrected kdeplot syntax
sns.kdeplot(x=df['ApplicantIncome'], y=df['LoanAmount'],
cmap="Reds", fill=True)
plt.title("Density Chart: Applicant Income vs Loan Amount")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")
plt.show()
#Heat Map
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
plt.figure(figsize=(8, 5))
# Generate heatmap on numeric data only
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm",
linewidths=0.5)
plt.title("Heat Map of Correlation Matrix")
plt.show()
EXPERIMENT 4
AIM: Implement and explore performance evaluation metrics for data models (Supervised)
Theory:
To measure how well the model performs, we use different evaluation metrics, which vary
depending on the type of supervised learning problem:
Classification Metrics
Classification models predict discrete categories (e.g., spam detection, disease
classification). The following metrics evaluate their performance:
1. Accuracy
3. Error Rate
● ROC Curve: A plot of True Positive Rate (TPR) vs. False Positive Rate (FPR).
● AUC (Area Under Curve): Measures the model’s ability to distinguish between
classes.
7. F1 Score
Measures the balance between Sensitivity and Specificity, ensuring both positive and
negative cases are correctly classified.
Regression Metrics
Regression models predict continuous values (e.g., stock prices, temperature). The following
metrics evaluate their performance:
1. Pearson Correlation Coefficient (rrr)
Measures the strength and direction of the relationship between actual and predicted values.
Indicates how much variance in the target variable is explained by the model.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,
LinearRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix, roc_curve, auc
from sklearn.metrics import mean_absolute_error,
mean_squared_error, r2_score
# Load dataset
df = pd.read_csv("supermarket_sales.csv") # Update the path if
required
# Confusion Matrix
conf_matrix = confusion_matrix(y_test_class, y_pred_class)
# ROC Curve
fpr, tpr, _ = roc_curve(y_test_class,
clf.predict_proba(X_test_class)[:, 1], pos_label=1)
roc_auc = auc(fpr, tpr)
AIM: Implement and explore performance evaluation metrics for data models
(Unsupervised Learning)
Theory:
Clustering is an unsupervised learning technique used to group similar data points based on
certain features. Unlike classification, clustering does not rely on predefined labels, making
evaluation more challenging. To assess the performance of clustering models, various
internal and external evaluation metrics are used. Below are the key metrics used in
clustering evaluation:
1. Rand Index (RI): The Rand Index (RI) measures the similarity between the predicted
clustering assignments and the true class labels. It evaluates how well the clustering model
has assigned data points by comparing pairs of samples.
2. Adjusted Rand Index (ARI): The Adjusted Rand Index (ARI) is an improved version of the
Rand Index that accounts for the probability of random clustering. It ensures that the score
remains close to zero for random assignments and one for perfect clustering.
3. Mutual Information (MI): Mutual Information (MI) measures the amount of shared
information between the true labels and predicted clusters. It is based on entropy and
evaluates how much knowing one variable (true labels) reduces uncertainty about the other
variable (predicted clusters).
4. Silhouette Coefficient: The Silhouette Coefficient is an internal clustering metric that
measures how similar a data point is to its own cluster compared to other clusters. It is
based on the distances between data points.
Code:
import numpy as np
import pandas as pd
from itertools import combinations
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from collections import Counter
import math
### **1. Implementing Rand Index (RI) and Adjusted Rand Index
(ARI)**
def rand_index(y_true, y_pred):
pairs = list(combinations(range(len(y_true)), 2))
a = sum((y_true[i] == y_true[j]) and (y_pred[i] == y_pred[j])
for i, j in pairs)
b = sum((y_true[i] != y_true[j]) and (y_pred[i] != y_pred[j])
for i, j in pairs)
return (a + b) / len(pairs)
mi = 0
for c in clusters_true:
for k in clusters_pred:
n_ck = sum((y_true[i] == c and y_pred[i] == k) for i
in range(n))
if n_ck > 0:
p_ck = n_ck / n
p_c = clusters_true[c] / n
p_k = clusters_pred[k] / n
mi += p_ck * math.log(p_ck / (p_c * p_k))
return mi
silhouette_scores = []
for i in range(n):
same_cluster = [X[j] for j in range(n) if labels[j] ==
labels[i] and i != j]
other_clusters = {c: [X[j] for j in range(n) if labels[j]
== c] for c in unique_clusters if c != labels[i]}
if same_cluster:
a_i = np.mean([euclidean_distance(X[i], p) for p in
same_cluster]) # Intra-cluster distance
else:
a_i = 0
return np.mean(silhouette_scores)
# Print results
print(f"Rand Index: {rand_idx:.4f}")
print(f"Adjusted Rand Index: {adjusted_rand:.4f}")
print(f"Mutual Information: {mutual_info:.4f}")
print(f"Silhouette Coefficient: {silhouette_coeff:.4f}")
Output:
EXPERIMENT 6
Theory:
Time series analysis and forecasting are essential for understanding patterns in data
collected over time. By studying historical trends, businesses and researchers can make
informed decisions about future events. The key reasons for conducting time series analysis
and forecasting include:
Time series data often contains short-term irregular variations that can mislead
decision-making. By analyzing the data, we can determine whether a sudden change is a
natural fluctuation or an outlier. Outliers are extreme values that deviate significantly from
the trend and may be caused by unexpected events, errors in data collection, or external
influences. Identifying outliers helps in making more accurate predictions and avoiding
misleading conclusions.
Many datasets exhibit seasonality, meaning they follow a recurring pattern over specific time
intervals (e.g., daily, monthly, yearly). For example, sales of winter clothing increase in colder
months and decline in summer. Without time series analysis, it can be difficult to distinguish
whether a trend is genuine growth or just a seasonal effect. By decomposing the data into
trend, seasonal, and residual components, we can better understand the underlying behavior
and make accurate forecasts.
Time series analysis helps in recognizing how a variable evolves over different periods. For
instance, stock prices, temperature variations, or economic indicators change continuously,
and analyzing past patterns helps predict future movements. By examining trends,
seasonality, and irregular components, we can determine whether a process is stable,
increasing, or declining over time, which aids in better planning and decision-making.
One of the primary objectives of time series forecasting is to understand the direction in
which data is heading. Trends indicate whether a variable is experiencing consistent growth,
decline, or stability over time. For example, if a company's revenue shows a long-term
upward trend despite short-term fluctuations, it suggests business growth. Identifying these
trends helps organizations plan future strategies, allocate resources effectively, and respond
to market changes efficiently.
Code & Output:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,
mean_squared_error
import warnings
warnings.filterwarnings("ignore")
# Load dataset
df = pd.read_csv("supermarket_sales.csv", parse_dates=['Date'])
df.set_index('Date', inplace=True)
print(df.info())
# PACF plot
plt.figure(figsize=(12, 5))
plot_pacf(df['Quantity'].dropna(), lags=30)
plt.show()
# ADF test
adf_result = adfuller(df['Unit price'].dropna())
print(f"ADF Statistic: {adf_result[0]}")
print(f"p-value: {adf_result[1]}")
print("Critical Values:")
for key, value in adf_result[4].items():
print(f" {key}: {value}")
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(y_pred)
# Moving average smoothing model
df['Moving_Avg'] = df['Quantity'].rolling(window=3).mean()
plt.figure(figsize=(10, 5))
plt.plot(df.index, df['Quantity'], label='Original')
plt.plot(df.index, df['Moving_Avg'], label='Moving Average',
linestyle='dashed')
plt.legend()
plt.show()
# ARIMA model
train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]
p, d, q = 1, 1, 1
arima_model = ARIMA(train['Unit price'], order=(p, d, q))
arima_fit = arima_model.fit()
predictions = arima_fit.forecast(steps=len(test))
plt.figure(figsize=(10, 5))
plt.plot(train.index, train['Unit price'], label='Train Data')
plt.plot(test.index, test['Unit price'], label='Test Data')
plt.plot(test.index, predictions, label='Predictions',
linestyle='dashed')
plt.legend()
plt.show()
# Model Evaluation
mae = mean_absolute_error(y_test, y_pred)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
Theory:
An outlier is a data point that significantly deviates from the rest of the dataset. It does not
conform to the general pattern of the data and may result from measurement errors,
variability in data, or genuine anomalies. Outliers can distort statistical analyses, affect
machine learning models, and lead to incorrect conclusions if not properly handled.
Types of Outliers
○ Example:
○ Example:
○ A group of data points that together deviate from the norm, even though
individual values might not be outliers.
○ Example:
● Outliers can skew trends and forecasts in datasets, affecting data interpretation.
● In business analytics, they can lead to poor strategic decisions based on incorrect
insights.
● Supervised learning: Outliers can bias model training, leading to incorrect predictions.
● Unsupervised learning: Clustering algorithms like K-Means and DBSCAN may fail to
correctly classify data if outliers are present.
● Outlier detection is widely used in fraud detection for banking, insurance, and
cybersecurity.
● Example:
○ A credit card being used in two different countries within minutes may
indicate fraud.
● In healthcare, unusual spikes in patient vital signs could signal a medical emergency.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors, LocalOutlierFactor
from sklearn.cluster import DBSCAN
# Load dataset
df = pd.read_csv("Churn_Modelling.csv")
# Distribution plot
sns.displot(df['Balance'], kde=True)
plt.xlabel('Balance')
plt.ylabel('Density')
plt.title('Distribution of Balance')
plt.show()
plt.plot(dist.mean(axis=1))
plt.title("Mean Distance to Nearest Neighbors")
plt.show()
outlier_index = np.where(dist.mean(axis=1) >
np.percentile(dist.mean(axis=1), 95))
outlier_values = df.iloc[outlier_index]
print("Distance-based Outliers:")
print(outlier_values[['CreditScore', 'Age']])
# Plot outliers
plt.scatter(df['CreditScore'], df['Age'], color="b", label='Normal
Data')
plt.scatter(outlier_values['CreditScore'], outlier_values['Age'],
color="r", label='Outliers')
plt.xlabel('Credit Score')
plt.ylabel('Age')
plt.legend()
plt.title("Distance-Based Outlier Detection")
plt.show()
# Density-Based Outlier Detection (LOF)
lof = LocalOutlierFactor(n_neighbors=3, contamination=0.05)
preds = lof.fit_predict(X)
outliers = np.where(preds == -1)[0]
outlier_values = df.iloc[outliers]
print("LOF Outliers:")
print(outlier_values[['CreditScore', 'Age']])
sns.boxplot(y=trimmed_df['CreditScore'])
plt.title('Boxplot after Trimming')
plt.show()
# Removing Outliers (Winsorization)
df['CreditScore'] = np.where(df['CreditScore'] >= upper_limit,
upper_limit,
np.where(df['CreditScore'] <=
lower_limit, lower_limit, df['CreditScore']))
AIM: Use SMOTE technique to generate synthetic data (to solve the problem of class
imbalance)
Theory:
Class imbalance occurs when one class in a dataset has significantly more samples than
another. This imbalance can lead to biased models that favor the majority class while poorly
predicting the minority class. Several techniques can be used to handle class imbalance:
1. Data-Level Approaches
These methods focus on modifying the dataset before training the model.
a. Oversampling:
b. Undersampling:
● Reduces the number of samples in the majority class to balance the dataset.
c. Hybrid Methods:
● Example: SMOTE + Tomek Links removes overlapping majority class samples after
applying SMOTE.
2. Algorithm-Level Approaches
These methods modify the learning algorithm to be more sensitive to class imbalance.
a. Cost-Sensitive Learning:
● Adjusts class weights in algorithms like Random Forest and Logistic Regression
(class_weight="balanced" in Scikit-Learn).
3. Ensemble Methods
SMOTE is an oversampling technique that generates synthetic samples for the minority
class instead of simply duplicating existing data points.
Theory:
When working with data, understanding its statistical properties is crucial for making
informed decisions. Various inferential statistics techniques help analyze and interpret the
data, estimate population parameters, and test hypotheses. Below are key statistical
methods and concepts used in data analysis.
1. Inferential Statistics
a) Distributions
A distribution represents how data values are spread over a range. Common probability
distributions include:
2. Poisson Distribution:
● Used for modeling rare events (e.g., number of calls at a help desk per hour).
● Population parameters (e.g., population mean μ, population variance σ²) describe the
entire dataset.
● Sample statistics (e.g., sample mean X̄, sample variance S²) estimate population
parameters based on a sample.
4. Sampling Errors:
● The difference between a sample statistic and the true population parameter.
A confidence interval provides a range within which we expect the true population parameter
to lie with a certain probability (e.g., 95%).
Example: A 95% confidence interval of (50, 60) means we are 95% confident that the
population mean lies between 50 and 60.
c) Hypothesis Testing
● Used when sample size (n) > 30 and population standard deviation is known.
● Tests whether the sample mean significantly differs from the population mean.
Example: Checking if the average height of students differs from 170 cm.
2) T-Test
● Used when sample size (n) < 30 or population standard deviation is unknown.
Types of t-tests:
1. One-sample t-test: Compares sample mean to a known population mean.
2. Independent (two-sample) t-test: Compares means of two independent groups.
3. Paired t-test: Compares means of the same group before and after a treatment.
Example: Comparing the effectiveness of two different drugs on blood pressure reduction.
3) ANOVA (Analysis of Variance)
● Determines if at least one group mean is significantly different from the others.
Types of ANOVA:
1. One-Way ANOVA: Compares means across one factor (e.g., test scores of students in
three different schools).
2. Two-Way ANOVA: Compares means across two factors (e.g., test scores based on both
school and gender).
# c) Hypothesis Testing
Theory:
1. Objective
The purpose of this case study is to conduct a comprehensive exploratory data analysis
(EDA) on a dataset containing restaurant-related information from Zomato. As a group, our
objective is to examine the structure and characteristics of the data, identify underlying
patterns, and gain meaningful insights. We aim to explore aspects such as:
- Popular types of cuisines and restaurant categories
- Average ratings across different cities
- Distribution of cost for two people
- Influence of online delivery and table booking on customer ratings
Through this study, we intend to develop proficiency in using data analysis tools,
conducting descriptive statistics, and presenting data-driven conclusions with appropriate
visualizations.
2. Dataset Description
Dataset Used: Zomato Restaurants Dataset – Kaggle Link
(https://www.kaggle.com/datasets/PromptCloudHQ/zomato-restaurants-data)
This dataset consists of approximately 9500 records of restaurants listed on Zomato. The
data covers various attributes such as restaurant name, location, cuisines offered, average
cost for two people, rating metrics, availability of online delivery and table booking, and
customer votes. These features offer a balanced mix of numerical and categorical data for
analysis.
The dataset primarily represents restaurant data from Indian cities, though some entries
from other countries are included. It serves as a suitable dataset to understand consumer
preferences, market segmentation, and regional trends in the restaurant industry.
- Python
- Jupyter Notebook or Google Colab
- Pandas and NumPy for data handling
- Matplotlib and Seaborn for data visualization
4. Steps in the Case Study
Step 7: Conclusion
- Explain how users and restaurant owners can benefit from these insights.
5. Deliverables
- Python notebook with code and plots
- PDF report or PPT summarizing:
• Objectives
• Visualizations
• Key insights
• Conclusion
6. Optional Extensions
- Create interactive visualizations using Plotly.
- Use WordCloud to visualize frequent cuisine words.
- Use geopandas or folium to plot restaurant locations (if latitude/longitude data is
available).
7. Sample Visualizations
Here are some sample visualizations that can enhance the report and provide clear insights: