Group_2_Practical
Group_2_Practical
Objectives:
● Intuition
○ Understanding the basics of linear regression
○ Conceptualizing how linear regression works
● Dataset Specification
○ Describing the dataset used in the practical
○ Identifying the variables and their significance
● Data Pre-processing
○ Cleaning and handling missing data
○ Feature scaling and normalization
○ Encoding categorical variables
● Data Splitting
○ Dividing the dataset into training and testing sets
○ Determining the split ratio
● Model Selection
○ Choosing linear regression as the predictive model
○ Justification for selecting linear regression
● Model Training
○ Implementing a simple linear regression model
○ Training the model using the training dataset
● Model Evaluation
A probability distribution is a mathematical function or a model that describes how the values of a
random variable are distributed. In other words, it tells you the likelihood of various possible
outcomes for a random event or experiment. Probability distributions are fundamental in statistics
and probability theory and are used to understand, analyse, and model uncertainty in various real-
world scenarios.
• The normal distribution, often referred to as the Gaussian distribution, is one of the most
important and widely used probability distributions.
• It is characterized by a bell-shaped curve with two parameters: the mean (μ) and the standard
deviation (σ).
• Many natural phenomena, such as heights, weights, and measurement errors, closely follow a
normal distribution.
• The distribution is symmetric and unimodal, with a mean of μ and a variance of σ^2.
Poisson Distribution:
• The Poisson distribution models the number of events occurring in a fixed interval of time or
space.
• It is characterized by a single parameter λ (lambda), which represents the average rate of
occurrence.
• The distribution is discrete and often used for rare events or phenomena that occur randomly
but with a known average rate.
Bernoulli Distribution:
• The Bernoulli distribution models a binary outcome with two possible results: success (1) and
failure (0).
• It is characterized by a single parameter p, representing the probability of success.
• The Bernoulli distribution is a fundamental building block for modelling binary events and
serves as the basis for the binomial distribution, which models the number of successes in a
fixed number of independent Bernoulli trials.
Dataset Specification
In the context of our practical project, we utilized carefully generated datasets that align with specific
probability distributions. These datasets were designed to capture the characteristics of three key
probability distributions, namely the Normal, Poisson, and Bernoulli distributions.
• We crafted a dataset to closely mimic the properties of the normal distribution. This dataset
was created by generating random data points that follow a bell-shaped curve. The data is
characterized by a defined mean (average) and standard deviation (spread). By leveraging
this dataset, we aimed to demonstrate the application of the normal distribution in modelling
real-world phenomena, such as measurements, where data often exhibits a symmetrical
distribution around a central value.
• To simulate the Poisson distribution, we generated a dataset that emulates the occurrence of
events within a fixed interval. The data in this dataset is characterized by a known average
event rate (λ). The dataset was crafted to illustrate the properties of the Poisson distribution,
which is commonly used to model rare and unpredictable events, such as customer arrivals at
a store, machine failures, or accidents.
• For the Bernoulli distribution, we created a dataset representing binary outcomes with two
possible results, typically labelled as "success" and "failure." The probability of success (p)
was predefined in the dataset, allowing us to illustrate the concept of a Bernoulli trial and
how it serves as the foundation for more complex models like the binomial distribution. This
dataset highlights the application of the Bernoulli distribution in modelling events with only
two possible outcomes, such as coin flips or product defect detection.
Program:
Normal_distribution.py
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.title("Normal Distribution")
plt.show()
Poisson_distribution.py
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
Bernoulli_distribution.py
from scipy.stats import bernoulli
# Parameter for the Bernoulli distribution (probability of success)
p = 0.3
plt.title("Bernoulli Distribution")
plt.xticks([0, 1], ['Failure', 'Success'])
plt.show()
Output:
Normal Distribution:
• This figure represents the Poisson distribution, which models the number of events occurring
in a fixed interval of time or space.
• The histogram in blue displays a simulated dataset generated from a Poisson distribution with
a parameter λ (lambda) set to 3.
• The black vertical bars represent the probability mass function (PMF) of the Poisson
distribution. They show the theoretical probabilities of observing a specific number of events
in the given range (0 to 10).
Bernoulli Distribution:
• This figure represents the Bernoulli distribution, which models a binary outcome (e.g.,
success or failure) with a single probability parameter (p).
• The histogram in red displays a simulated dataset generated from a Bernoulli distribution
with a probability of success (p) set to 0.3.
• The black vertical lines represent the probability mass function (PMF) of the Bernoulli
distribution. They show the theoretical probabilities of observing a failure (0) or a success
(1).
Applications
Probability distributions have a wide range of applications across various fields, including statistics,
science, engineering, finance, and more. Here are some common applications of probability
distributions:
1. Statistics and Data Analysis:
• Describing and modelling data distributions.
• Hypothesis testing and confidence interval estimation.
• Regression analysis and curve fitting.
2. Quality Control:
• Monitoring and controlling the quality of manufactured products and processes.
• Detecting defects or deviations from expected standards.
3. Finance:
• Modeling financial market returns and asset prices.
• Risk management and portfolio optimization.
• Option pricing using the Black-Scholes model.
4. Natural Phenomena:
• Modeling various natural phenomena with continuous or discrete data, including:
• Temperature distributions.
• IQ score distributions.
• Errors in measurements.