0% found this document useful (0 votes)
4 views

R Program Project

Uploaded by

sricharans5656
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

R Program Project

Uploaded by

sricharans5656
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A MINI PROJECT ON THE ANALYSIS OF DATASETS

A Study on Exploring Various Datasets Using “R Programming”


To Analyse the Data
Submitted in Partial Fulfilment of the requirement for the award of
the R Programming certificate course conducted in 2 nd Semester by
Edupinnacle
MASTER OF BUSINESS ADMINISTRATION

Bangalore University
2023 – 2025
By
SRICHARAN.S
P03LZ23M015080
Under the Guidance of
Mr. Bharath Rajanna
Edupinnacle
Project -1
Analysing Students' Daily Lives Through
Data
Introduction
The daily routines and lifestyle choices of students significantly influence their academic
outcomes, mental health, social connections, and overall quality of life. With the growing
accessibility of data, we now can delve deeper into students' habits, behaviours, and day-to-
day experiences. The Student Lifestyle Dataset offers a comprehensive resource for
examining various dimensions of student life, including study patterns, time management,
social interactions, participation in activities, and sleep schedules.

This dataset is typically gathered through surveys, interviews, or digital tracking methods,
capturing diverse factors that affect student lifestyles. By analysing this information, we can
uncover patterns that reveal how students manage academics alongside personal
commitments, the impact of lifestyle choices on academic success and mental well-being, and
how these behaviours evolve over time.

In this project, we will investigate key aspects such as:

• Study routines: Strategies students use to allocate time for classes, assignments, and
exam preparation.

• Social interactions: The influence of social media, friendships, and group activities
on their lives.
• Physical and mental health: Examining sleep habits, exercise, diet, and stress
management practices.

• Extracurricular activities: Engagement in sports, clubs, volunteering, and part-time


work.

The objective of this project is to leverage the Student Lifestyle Dataset to uncover patterns
that can help educators, policymakers, and mental health professionals provide better support
to students. By gaining a deeper understanding of these trends, we can make well-informed
decisions to enhance student services, academic resources, and wellness programs, ultimately
improving the overall student experience.

By providing these insights, the project also aims to empower students to make more
informed decisions about their own lives. Additionally, institutions can design more effective
interventions and outreach strategies to address specific challenges faced by students, such as
stress, burnout, or lack of engagement. In the long term, these findings could contribute to
building more supportive educational environments, ensuring that students thrive both
academically and personally, and are better prepared for future challenges.
Analysis and Interpretation
Data Pre-Processing: -

Output after cleaning Data: -


General Analysis
Analysis 1: Boxplot of Sleep Hours

A boxplot of sleep hours provides a clear visualization of the distribution and central
tendencies of students' sleeping patterns. It highlights key statistical measures such as the
median, the range (minimum and maximum), and the interquartile range (IQR), which
includes the lower and upper quartiles. Additionally, it identifies potential outliers—students
whose sleep duration deviates significantly from the norm.

Code:

Output:

Interpretation:
A boxplot provides a concise summary of the distribution of sleep hours among students,
illustrating the spread, central tendency, and variability of sleep patterns. By plotting the
median, interquartile range (IQR), and potential outliers, we can identify students with
unusually high or low sleep durations. This helps in analysing the overall balance of students'
sleep habits and spotting extreme cases that might indicate stress, time management issues, or
health concerns.
Analysis 2: Scatterplot of Study Hours vs. GPA

A scatterplot visually represents the relationship between study hours and GPA, offering
insights into whether increased study time correlates with higher academic performance. By
plotting individual data points, patterns or trends—such as positive, negative, or no
correlation—can be easily observed. This analysis helps determine whether study habits have
a significant impact on GPA, while also identifying any anomalies or unique cases that might
warrant further investigation.

Code:

Output:

Interpretation:
The scatterplot offers a graphical representation of the relationship between study hours and
GPA. By observing the data points, we can identify trends or correlations, such as whether
students who dedicate more hours to studying tend to achieve higher GPAs. This visual
analysis provides clarity on the extent to which study time influences academic success and
helps identify outliers or inconsistent patterns that may be due to individual differences or
external factors.
Analysis 3: Density Plot for Stress Levels by GPA

A density plot is used to visualize the distribution of GPAs across varying levels of stress.
This analysis enables a comparison of how stress levels correlate with academic performance,
with the density curves highlighting concentrations of students within specific GPA ranges
for each stress level. By examining the shapes and overlaps of these distributions, we can
identify trends, such as whether students with higher stress levels tend to achieve lower GPAs
or whether moderate stress correlates with optimal performance. This insight can guide
interventions aimed at managing stress for better academic outcomes.

Code:

Output:

Interpretation:
A density plot provides a visual comparison of GPA distributions across different stress
levels: high, moderate, and low. By analysing the curves for each category, we can observe
patterns such as shifts in GPA concentrations or overlaps among the groups. This helps
determine whether stress levels significantly influence academic performance.
Statistical Analysis
Analysis 1: Scatterplot with Regression Line (Study Hours vs. GPA)

A scatterplot combined with a regression line offers a detailed visualization of the


relationship between study hours and GPA. Each point represents an individual student, with
the regression line indicating the overall trend. A steeper slope suggests a stronger influence
of study hours on GPA, while a flatter line may indicate a weaker relationship. This analysis
also highlights whether the relationship is positive or negative and helps in estimating how
much variation in GPA can be explained by study habits.

Code:

Output:

Interpretation:
The correlation coefficient provides a numerical value between -1 and 1 that measures the
strength and direction of the relationship between study hours and GPA. A value closer to 1
indicates a strong positive correlation, meaning that as study hours increase, GPA tends to
increase as well. Conversely, a correlation near -1 suggests a strong negative relationship,
while a value close to 0 implies little to no correlation. This statistical measure enhances the
understanding of how closely study habits are linked to academic performance.
Analysis 2: Correlation Between Sleep and Physical Activity
This analysis calculates the correlation between sleep duration and physical activity hours,
helping to determine whether there is a relationship between the two variables. A positive
correlation would suggest that students who get more sleep tend to engage in more physical
activity, possibly indicating a healthier lifestyle overall. Conversely, a weak or negative
correlation could highlight patterns such as students with irregular sleep schedules engaging
in less physical activity. This insight could be useful in promoting better sleep and exercise
habits for improved student well-being.
Code:

Output:

Interpretation:
This analysis examines the relationship between sleep hours and physical activity hours,
helping to determine if students who get more sleep tend to engage in more physical activity.
A positive correlation would suggest that students who sleep more are also more physically
active, possibly indicating a balanced lifestyle. On the other hand, a lack of correlation or a
negative correlation could suggest that sleep and physical activity are independent of each
other in this context.
Analysis 3: Correlation Between Study Hours and Extracurricular Hours
This analysis calculates the correlation between study hours and extracurricular activity hours
to assess whether there is a trade-off between academic and extracurricular involvement. A
negative correlation might indicate that students who spend more time studying have less
time available for extracurricular activities, suggesting a balance or prioritization between
academic and non-academic commitments. Conversely, a positive correlation would imply
that students are able to effectively manage both study and extracurricular activities, leading
to a more well-rounded experience.
Code:

Output:

Interpretation:
This analysis explores whether there is a relationship between study hours and extracurricular
activity hours, providing insight into whether students who study more also engage in more
extracurricular activities. A positive correlation would suggest that students who dedicate
more time to their studies are also involved in extracurricular activities,
Predictive Analysis
Analysis 1: Visualizing Residuals

This plot visualizes the residuals (the differences between actual and predicted values) to
assess the accuracy of a predictive model. By examining the residuals, we can identify any
patterns or trends that suggest the model is not fully capturing the underlying data
relationships. Ideally, the residuals should appear randomly scattered, indicating that the
model is well-fitted. However, if patterns emerge, it may suggest areas where the model
could be improved.

Code:

Output:

Interpretation:

A residuals plot shows the differences between the actual and predicted values of a model. It
helps determine if the model is appropriate by checking if the residuals are randomly
distributed around zero. If there is a clear pattern in the residuals, it suggests that the model
might not be fitting the data well. This plot also helps assess whether key assumptions, like
homoscedasticity (consistent variance of residuals), are met. If the residuals are evenly spread
without any discernible pattern, it indicates that the model is appropriate for the data.
Project- 2
Exploring Energy Investment Trends:
Insights from The World Energy
Investment Report
INTRODUCTION
The World Energy Investment (WEI) report, published by the International Energy Agency
(IEA), offers a comprehensive analysis of global investment trends in the energy sector. It
highlights key developments in the energy landscape, including the shift toward renewable
energy, the ongoing investment in fossil fuels, and growing concerns about energy security.
The 2024 edition of the report focuses on the global transition to cleaner energy sources, the
factors driving investment decisions, and the geographical distribution of these investments.
Given the rising importance of clean energy and regional energy security, the report plays a
crucial role in understanding the future direction of the global energy market, especially in
the context of recent energy crises.

OBJECTIVES:

1. Understanding Global Energy Investment Trends:


To explore the shifting dynamics between investments in traditional fossil fuels and
renewable energy, with an emphasis on global energy priorities and transitions.

2. Assessing Energy Security:


To evaluate how concerns over energy security influence investment patterns,
particularly in the aftermath of recent global energy challenges.

3. Evaluating Economic Impacts:


To understand the economic benefits of energy investments, focusing on job creation,
economic growth, and technological innovation.

4. Supporting Climate Goals:


To measure the alignment of energy investments with climate change mitigation goals
and track progress toward achieving low-carbon energy systems.

The project aims to extract actionable insights from the WEI report by visualizing data and
applying statistical tools to understand the impact of energy investments on security, the
economy, and climate goals. By analysing these key metrics, the project offers valuable
perspectives on how economic growth, policy decisions, and energy investments intersect,
ultimately providing recommendations for stakeholders in the energy sector. Through these
findings, the project aims to inform decision-making and drive forward sustainable energy
practices.
Generic Analysis
Analysis 1: Trend Analysis of Total Investments
Code:

Output:

Interpretation:
This plot illustrates the overall trend in global energy investments. The upward trajectory
starting from 2020 suggests a recovery following the pandemic, with increasing attention and
capital flowing into energy projects. Notable peaks in 2023 and 2024 likely reflect a surge in
investments in clean energy, driven by the growing demand for sustainable solutions,
governmental policies supporting the energy transition, and the global push to combat climate
change. These trends highlight the shift towards cleaner, more secure energy sources, which
is crucial for achieving long-term sustainability goals.
Analysis 2: Comparison Of Clean Energy Vs. Fossil Fuels Investments (2024)
Code:

Output:

Interpretation:
This bar plot compares investments in clean energy versus fossil fuels for the year 2024. The
taller bar representing clean energy indicates the ongoing global shift towards renewable
energy sources. This trend is largely driven by climate goals, increasing government support,
and the declining costs of renewable energy technologies. It underscores the growing
prioritization of sustainability and the global commitment to reducing reliance on fossil fuels,
highlighting clean energy as the central focus of future energy investments.
Analysis 3: Proportion of Clean Energy Investment Over Total Investment (2015–2024)
Code:

Output:

Interpretation:
This graph illustrates the proportion of clean energy investments within the total energy
investments. An increasing proportion over time signifies the growing global emphasis on
clean energy projects. As more funds are allocated to renewable energy, it reflects the
increasing prioritization of sustainability, driven by climate change mitigation goals,
international agreements, and technological advancements in clean energy.
Statistical Analysis
Analysis 1: Descriptive Statistics of Total Investments (2015–2024)
Code:

Analysis:

Interpretation:
This table shows key statistics (mean, median, min, max) for total energy investments,
helping to understand the average investment, typical investment, and the range of values
over the years.

Analysis 2: Correlation Between Clean Energy and Fossil Fuels Investments


Code:

Analysis:

Interpretation:
This table provides key statistics (mean, median, minimum, and maximum) for total energy
investments. These values help in understanding the central tendency and distribution of
energy investments over time. The mean gives an average value, showing the general level of
investment. The median represents the middle value, offering insight into the typical
investment, unaffected by outliers.
Analysis 3: Anova: Yearly Variations in Total Investments
Code:

Analysis:

Interpretation:
The ANOVA test assesses whether total energy investments vary significantly across different
years. A significant result suggests that investment patterns change over time, reflecting shifts
in global energy priorities and factors such as economic conditions, policy changes, and
emerging energy trends.
Predictive Model
Linear Regression to Predict Total Investments In 2025
Code:

Analysis:

Interpretation:
The linear regression model predicts the total energy investment for 2025. If the prediction
shows an upward trend, it confirms the ongoing global focus on energy investments,
especially in clean energy projects. This projection aligns with broader trends of increasing
investment in renewable energy, supporting long-term sustainability goals.
Project- 3
E-Commerce Sales Trends Analysis: A
Hypothetical Dataset
Introduction
In today's fast-paced and competitive e-commerce landscape, businesses face growing
pressure to understand consumer behaviour and optimize sales strategies. The ability to
analyse trends, forecast demand, and identify profitable segments is critical for maintaining a
competitive edge. Leveraging data analytics can provide organizations with a deeper
understanding of sales dynamics and customer preferences, enabling informed decision-
making.
This project focuses on a comprehensive dataset of e-commerce transactions, capturing key
details such as order information, customer behaviour, sales metrics, and profit margins. By
systematically analysing this data, we aim to uncover meaningful patterns and trends,
providing valuable insights for stakeholders.
The scope of the analysis is threefold:

1. Identifying Sales and Product Trends:


Understanding which product categories drive the highest sales and recognizing
seasonal or monthly trends can help businesses align their strategies with consumer
demand.
2. Evaluating Key Metrics Relationships:
By examining the relationship between metrics like sales and profits, businesses can
pinpoint factors that influence profitability and optimize their operations for
maximum return.
3. Developing Predictive Models:
Forecasting profits and classifying orders based on their profitability enables better
planning and resource allocation. These models support decision-making by
identifying opportunities for growth and addressing areas of risk.
The analysis presented in this project serves as a comprehensive approach to understanding e-
commerce sales trends and profitability, this project provides actionable intelligence that
businesses can use to enhance their operational efficiency, marketing strategies, and financial
performance. Through this, we aim to offer a holistic perspective on how e-commerce data
can be transformed into valuable insights for strategic decision-making.
Analysis and Interpretation
Data Pre-Processing Steps:

Output after Cleaning Data:


General Analysis
Analysis 1: Total Sales by Product Category
Code:

Output:

Interpretation:
This bar plot highlights that the "Electronics" category generates the highest sales among all
categories, indicating its dominant role in the e-commerce market. Following Electronics,
"Apparel" and "Home Decor" also contribute significantly but at a lower level. These insights
suggest that the Electronics category should be a priority for inventory optimization and
promotional strategies. Meanwhile, further analysis of Apparel and Home Decor could
uncover opportunities to boost their sales performance. Businesses can use this information to
allocate resources more effectively and align marketing campaigns with high-performing
categories.
Analysis 2: Monthly Sales Trends
Code:

Output:

Interpretation:
The line graph reveals significant fluctuations in monthly sales, with certain months
experiencing noticeable peaks. These peaks are likely tied to seasonal demands, promotional
events, or holiday shopping periods. Understanding these trends allows businesses to prepare
for high-demand months by optimizing inventory levels, staffing, and marketing efforts.
Additionally, the analysis provides opportunities to identify slower months and develop
strategies, such as targeted discounts, to increase sales during these periods. Leveraging this
data ensures that resources are used efficiently to maximize revenue throughout the year.
Analysis 3: Distribution of Quantity Ordered
Code:

Output:

Interpretation:
The histogram shows that most orders involve small quantities, typically between 1 and 3
units. This indicates a strong consumer preference for purchasing single or few items at a
time, reflecting individual or personal use buying behaviour. Bulk purchases are less
frequent, suggesting a potential opportunity for businesses to encourage larger orders through
bulk discounts or bundled offers. By analysing this trend, companies can design targeted
campaigns to increase the average order value, contributing to improved profitability and
customer engagement.
Statistical Analysis
Analysis 1: Descriptive Statistics and Visualization
We will start by calculating basic descriptive statistics for continuous variables, such as the
mean, median, standard deviation, and range. Alongside, we will visualize the distributions
using histograms or boxplots. This approach will provide a clearer understanding of the data’s
spread, central tendency, and help identify any potential outliers, offering a solid foundation
for further analysis and insights.

Code:

Output:

Interpretation:

The summary function provides essential statistics for continuous variables, including the
minimum, 1st quartile, median, mean, 3rd quartile, and maximum, offering a comprehensive
overview of the data's central tendency and spread. Additionally, the ggplot2 histograms will
visually display the distribution of variables like Age and Mutual Funds, helping to identify
patterns such as skewness, outliers, and providing a clearer understanding of the data's
spread, which can inform further analysis or data cleaning steps.
Analysis 2: Correlation Analysis
Given that the dataset contains multiple continuous variables such as Mutual Funds, Equity
Market, Debentures, and others, we will examine the correlations between these variables. By
calculating the correlation coefficients, we can identify any significant relationships or
patterns between them

Code:

Output:

Interpretation:

The logistic regression model uses a binary target variable (Stock Market) and includes
predictors such as age, Mutual Funds, Equity Market, and others. The summary function
provides insights into the coefficients, p-values, and significance of each predictor, helping to
identify which variables significantly influence the likelihood of investing in the stock
market. Additionally, the predicted probabilities offer the likelihood that each individual will
invest in the stock market based on their attributes.
Project- 4
Data-Driven Insights into Customer
Purchasing Decisions
Introduction
The Customer Purchase Dataset offers valuable insights into consumer purchasing behaviour,
a key element for businesses to refine their marketing strategies, optimize product offerings,
and enhance customer satisfaction. This dataset includes customer attributes such as
demographics, purchase history, and location, along with transaction details. By analysing
this data, businesses can understand how consumers engage with their products and services.

By leveraging statistical and machine learning techniques, this project will uncover factors
influencing purchasing decisions, identify customer segments, and predict future buying
behaviours. The analysis will help identify correlations between product types, seasonal
trends, and customer demographics, providing actionable insights to improve sales strategies.

Objectives of the Project:

• Identifying Popular Products and Categories:


Analyse the dataset to find the most frequently purchased products and product
categories.

• Customer Segmentation:
Segment customers based on their purchasing habits to tailor marketing efforts and
product offerings.
• Exploring Purchase Patterns:
Analyse factors such as time of purchase, customer demographics, and product
preferences to understand what drives purchasing decisions.

• Developing Predictive Models:


Use predictive modelling to forecast future customer purchases, helping businesses
anticipate demand and adjust strategies.

By leveraging the insights from this project, businesses can refine their inventory
management strategies, implement targeted marketing campaigns, and improve sales
forecasting, ensuring they are well-positioned to meet consumer expectations. As a result,
businesses can optimize their operations, improve customer engagement, and drive long-term
growth in a competitive market. This project aims to provide businesses with actionable
insights, allowing them to make data-driven decisions that enhance customer retention,
optimize product offerings, and drive revenue growth.
Analysis and Interpretation
Data Pre-Processing: -

Output after cleaning Data: -


General Analysis
Analysis 1: Distribution of Age

We will create a histogram to visualize the distribution of customer ages, allowing us to identify
common age ranges and trends. This will help us understand which age groups are most
prominent in the customer base, providing valuable insights that can inform targeted marketing
campaigns and product strategies tailored to specific demographics

Code:

Output:

Interpretation:
This histogram will display the distribution of customer ages, allowing us to see how ages are
spread across the dataset. By visualizing this, we can identify which age groups are most
represented, whether younger or older customers dominate, and understand the overall
demographic profile. This insight can guide businesses in tailoring marketing campaigns,
adjusting product offerings, and even determining the best channels for reaching different age
segments.
Analysis 2: Trend of Policy Measures Over Time
We will create a boxplot to analyse how membership duration (in months) affects spending.
This will help identify trends and variations in spending behaviour, revealing whether
customers with longer membership durations tend to spend more or less, and highlight any
outliers or patterns linked to the length of membership.

Code:

Output:

Interpretation:

The boxplot will illustrate how spending amounts differ across various membership
durations, providing insight into whether customers who have been members for a longer
period tend to spend more or less on average. It will also highlight the spread of spending
within each membership group and reveal any outliers, helping to identify trends and
behaviours that can inform customer retention and engagement strategies.
Analysis 3: Correlation Heatmap Between Numerical Variables

We will generate a correlation heatmap for numerical variables such as Age, Income,
Spending, Purchase, and Last Purchase Amount. This will help identify relationships between
these variables, highlighting any strong or weak correlations, which can provide insights into
how these factors influence each other and guide data-driven decision-making for marketing
and product strategies.

Code:

Output:

Interpretation:

The heatmap will show the correlation between different variables, with positive correlations
represented in red, negative correlations in blue, and weak correlations in white. This analysis
will help identify which variables are most strongly related, such as the potential link
between Income and Spending, providing valuable insights for targeting and optimizing
business strategies.
Statistical Analysis
Analysis 1: Histogram – Distribution of Income
A histogram can help visualize the distribution of continuous data like
Income. By plotting the frequency of income values, we can identify
patterns such as the range of incomes, the presence of skewness, and
the most common income ranges within the dataset
Code:

Output:

Interpretation:

The histogram shows the distribution of the Income variable. If the histogram is skewed to
the right (positively skewed), it indicates that most people have lower incomes, with fewer
individuals having very high incomes. On the other hand, if the histogram is more symmetric,
it suggests that income is more evenly distributed across the dataset. This analysis helps to
understand the overall income distribution within the customer base, which can guide
targeted marketing and product offerings.
Analysis 2: Density Plot of Purchase Amount by Age Group
We will create a density plot to visualize the distribution of Purchase Amount across different
Age Groups. This will help us understand if age influences purchasing behaviour, showing
whether certain age groups tend to spend more or less. The density plot will reveal patterns in
how purchase amounts vary within each age group, offering valuable insights into how age
demographics impact consumer spending.

Code:

Output:

Interpretation:
The density plot will show how the Purchase Amount is distributed across different Age
Groups. This analysis will help identify which age groups are more likely to make larger
purchases and uncover any age-related spending habits or trends, providing valuable insights
for targeted marketing and product strategy.
Analysis 3: Box Plot of Birth Date Year Distribution
We will use a boxplot to explore how spending varies based on the year of birth (or
membership duration, depending on context). This analysis will help identify if there are any
significant differences in spending patterns between different age groups or membership
tenures. It will also provide insights into whether long-term members tend to spend compared
to newer members, highlighting potential trends in customer loyalty and spending behaviour.

Code:

Output:

Interpretation:
The boxplot will show the distribution of spending across different membership durations. It
will help assess whether longer memberships lead to higher or more consistent spending and
identify any outliers or significant variations in spending behaviour. This can provide insights
into customer loyalty and the potential impact of membership length on purchasing habits.
Predictive Analysis:
Analysis 1: Linear Regression Model – Predicting Spending from Income
Linear regression will be used to predict Spending, a continuous outcome, based on Income,
one of the predictor variables. This model helps understand the relationship between income
and spending, allowing us to predict how changes in income may influence consumer
spending. By fitting a linear regression line to the data, we can quantify the strength of this
relationship and use it to make informed predictions about future spending patterns.

Code:

Output:

Interpretation:

The scatter plot shows individual data points representing the relationship between Income and
Spending. The red regression line indicates the best linear fit, showing the overall trend
between the two variables. By analysing this line, the model can predict Spending based on
Income, helping to understand how changes in income might influence consumer spending
behaviours.

You might also like