CS-601 Machine Learning Unit-1 New
CS-601 Machine Learning Unit-1 New
Unit-I
CS601-Machine Learning
1
Syllabus
Unit –I
Introduction to machine learning, scope and limitations, regression,
probability, statistics and linear algebra for machine learning, convex
optimization, data visualization, hypothesis function and testing, data
distributions, data preprocessing, data augmentation, normalizing data sets,
machine learning models, supervised and unsupervised learning.
Course Outcomes:
CO601.1: Student will be able to explain the fundamental concept of machine
learning and apply knowledge of computing & mathematics to machine
learning problems, models, and algorithms.
and algorithms.
Key Concepts:
•Dependent Variable (Y): The variable you're trying to predict or explain.
•Independent Variable(s) (X): The variable(s) that are used to predict the dependent
variable.
•Regression Model: A mathematical model that describes how the independent
variables influence the dependent variable.
•Prediction: Regression can be used to predict the value of the dependent variable
based on the values of the independent variables.
•Relationship: Regression helps to understand and quantify the relationship between
the dependent and independent variables.
CS-601 Machine Learning (Unit-I) 8
Examples of Regression in Machine Learning:
•Predicting house prices:
•Understanding how factors like size, location, and age affect the price of a house.
•Predicting sales:
•Understanding how factors like advertising spending, pricing, and seasonality affect
sales.
• Rotation
• Translation
Problem Statement:
A coin is tossed. What is the probability of getting a head?
Solution:
Total number of equally likely outcomes (n) = 2 (i.e. head or tail)
Number of outcomes favorable to head (h) = 1
Number of outcomes favorable to head (t) = 1
• The "mean" is the "average" you're used to, where you add up all the
numbers and then divide by the number of numbers.
• The "median" is the "middle" value in the list of numbers. To find the median,
your numbers have to be listed in numerical order from smallest to largest,
so you may have to rewrite your list before you can find the median.
• The "mode" is the value that occurs most often/frequent. If no number in the
list is repeated, then there is no mode for the list.
1, 2, 4, 7
mean: 3.5
median: 3
mode: none
Q1 Q2 Q3 Q4
Mean
26
Linear Algebra
28
Convex Optimization:
Convex Set
• Convex optimization in machine learning involves finding the best solution (minimum or maximum)
of a problem where the objective function is convex and all constraints are also convex.
Examples:
• Linear Programming: Where the objective function and constraints are linear.
• Support Vector Machines (SVMs)
• Least Squares Problems: Where the objective is to minimize the sum of squared errors.
29
Data Visualization
• Data visualization is an important skill in applied statistics and machine learning.
• Statistics does indeed focus on quantitative descriptions and estimations of data. Data visualization provides an
important suite of tools for gaining a qualitative understanding.
• This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt
data, outliers, and much more.
• With a little domain knowledge, data visualizations can be used to express and demonstrate key relationships in
plots and charts that are more visceral to yourself and stakeholders than measures of association or significance.
There are five key plots that need to know well for basic data visualization. They are:
• Line Plot
• Bar Chart
• Histogram Plot
• Scatter Plot
With knowledge of these plots, you can quickly get a qualitative understanding of most data that you come across.
30
Data Visualization
Line Plot
A line plot is generally used to present observations collected at regular intervals.
The x-axis represents the regular interval, such as time. The y-axis shows the
observations, ordered by the x-axis and connected by a line.
31
Data Visualization
Bar Chart
A bar chart is generally used to present relative quantities for multiple categories.
A bar chart can be created by calling the bar() function and passing the category names
for the x-axis and the quantities for the y-axis.
Bar charts can be useful for comparing multiple point quantities or estimations.
32
Data Visualization
Histogram Plot
A histogram plot is generally used to summarize the distribution of a data sample.
The x-axis represents discrete bins or intervals for the observations. For example
observations with values between 1 and 10 may be split into five bins, the values [1,2]
would be allocated to the first bin, [3,4] would be allocated to the second bin, and so
on.
The y-axis represents the frequency or count of the number of observations in the
dataset that belong to each bin.
33
Data Visualization
Scatter Plot
A scatter plot is generally used to summarize the relationship between two paired data samples.
Paired data samples means that two measures were recorded for a given observation, such as the
weight and height of a person.
The x-axis represents observation values for the first sample, and the y-axis represents the
observation values for the second sample. Each point on the plot represents a single observation.
Scatter plots are useful for showing the association or correlation between two variables. A
correlation can be quantified, such as a line of best fit that too can be drawn as a line plot on the
same chart, making the relationship clearer.
34
Hypothesis function and testing
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing
is basically an assumption that we make about the population parameter. The equations used to represent the methods are
called Hypothesis function.
For drawing some inferences, we have to make some assumptions that lead to two terms that are used in the hypothesis
testing.
• Null hypothesis: It is regarding the assumption that there is no anomaly pattern or believing according to the assumption
made.
• Alternate hypothesis: Contrary to the null hypothesis, it shows that observation is the result of real effect.
2. Z Test
3. ANOVA Test
4. Chi-Square Test
CS-601 Machine Learning (Unit-I) 35
Hypothesis Testing
• All hypotheses are tested using a four-step process:
• The first step is for the analyst to state the two hypotheses so that only one can
be right.
• The next step is to formulate an analysis plan, which outlines how the data will
be evaluated.
• The third step is to carry out the plan and physically analyze the sample data.
• The fourth and final step is to analyze the results and either reject the null
hypothesis, or state that the null hypothesis is plausible, given the data.
• The alternative hypothesis would be denoted as "Ha" and be identical to the null
hypothesis, except with the equal sign struck-through, meaning that it does not
equal 50%.
• Type II error: A Type II error occurs when the researcher fails to reject a null
hypothesis that is false. The probability of committing a Type II error is called
Beta, and is often denoted by β. The probability of not committing a Type II error
is called the Power of the test.
Supervised machine learning is a machine learning technique that uses labeled data to train models to
predict outcomes.
How it works
A model is trained using input data and desired output values.
The model compares its predicted value to the actual value.
The model updates its solution based on the difference between the predicted and actual values.
The model repeats this process for each labeled example in the dataset.
The model learns the mathematical relationship between the features and the label.
Key Components: