0% found this document useful (0 votes)
15 views

Chapter 1 (6)

The document outlines a course on Data Science Fundamentals with a focus on K-means clustering and Python programming. It introduces key concepts of data science, including types of data, analytics methods, and machine learning techniques. The course aims to equip learners with the foundational knowledge and skills necessary to apply data science in real-world scenarios.

Uploaded by

yosefmuluye42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Chapter 1 (6)

The document outlines a course on Data Science Fundamentals with a focus on K-means clustering and Python programming. It introduces key concepts of data science, including types of data, analytics methods, and machine learning techniques. The course aims to equip learners with the foundational knowledge and skills necessary to apply data science in real-world scenarios.

Uploaded by

yosefmuluye42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Welcome

Data Science
Fundamentals with Python
(ITec 5032
)
_________________________________________
@ 2024 FTVT Institute All Rights Reserved

OF/FTI/ALL /18 Issue No: 1


PPT 1
Course description
The objective of this course is to introduce you to one of the
emerging technology - data science. There are different data
science techniques and this course is prepared to introduce
you to one of the commonly used basic data science
technique – K-means clustering from different perspectives
(ML, Mathematics, Statistics and Programming).
Parallelly you will be introduced to python programming
basics in data science.

2
Chapter 1:
Foundations of Data
Science

3
1.1. Introduction
Chapter Objectives:
- Data Science
The objective of this chapter material is to gently introduce
you to Data Science through some real-world examples of
where Data Science is used, and also by highlighting some
of the main concepts involved.

How Data Science might be applied to real-world situations.


How the K-Means clustering algorithm works.
What we mean by data.
What we mean by machine learning.
How to use python programming for data science

4
DataWe
science
are living in the age of “data science and
advanced analytics”, where almost everything in
our daily lives is digitally recorded as data . e.g.
The data can be structured (?), semi-structured (?),
or unstructured (?), which increases day by day.
Data science is typically a “concept to unify
statistics, data analysis, and their related
methods” to understand and analyze the actual
phenomena with data, which can be a discovery,
prediction, service, suggestion, insight into
decision-making, thought, model, paradigm, tool,
or system.
5
… Data science
With our lives been increasingly digitized, constantly
on smartphones, electronic payments, we've all
become data-generating machines. This data says a
lot about you.
when gathered from large populations, this data
becomes even more valuable. It gives the companies
and governments collecting it the ability to analyze
not only what we've done, but also the ability to
predict what you're about to do in future.

6
Data science
Data science is all about extracting intelligence from
data.
Data science draws on methods from statistics and
computer science to process the vast amounts of data
that we generate in our daily lives and to turn this
into something meaningful and valuable.
whatever you do in today's world, you generate data.
example:

7
… Data science
Most of us have no idea about exactly how our data is
being used. The algorithms used by social media
companies, banks, and health insurers can seem like
mysterious black boxes.
What are the Different applications of data science ?
one of the most commonly used applications of data
science: data clustering, using the K-means algorithm .

8
what do we mean by data?
The word "data" is often used interchangeably
with information.
With respect to data science
“data can refer a collection of factual
information that can be used by a computer as a
basis for calculation.”
Data can be represented either as a number
such as age or a price, or as a category such as
hair color or type of fruit.

9
… data
Data becomes particularly interesting when we're
dealing with large volumes of it. Take information on
the hair color and age of a handful of people. With
only a few examples, not much can be said. But if
this is scaled up to 10,000 people, then patterns
emerge in the data which can be useful to, for
example, a marketing company that's trying to
market new hair care products.
They might use this to decide which age group they
should market their products to.
10
Types of data
There are two broad types of data,
i. Categorical and
ii. Numeric.
i. Categorical
is when a textual description or label is used to
represent specific categories of objects.
Categories include things like
primary colors: red, green, and blue; or
fruit: banana, apple, orange.

11
ii. Numerical
Numeric data can be continuous. Something that is measured on a
continuous scale, like air temperatures at the North Pole or the weight of
a cow.
Numeric data can also be discreet. Something that is countable like the
number of beans in a jar, or the number of people who survived the
sinking of the Titanic.
Ultimately, all data is stored on a computer as a discrete binary number.
This means that when we're working with categorical data or continuous
numeric data, the computer is actually mapping this to some discrete
value behind the scenes
12
Check point
Note: You can select more than one.
Question 1
Which of the following is best represented
as categorical data?
oVolvo, Citroen, Honda, Toyota
o1.02, 0.30, 0.03, 12.10
oThe length of a swimming pool
oA person's age in months

13
… Check point
Question 2
Which of the following might best be
represented by continuous numeric data?
oTypes of fruit
oCost of coffee
oA, B, C, D ,E
oHeight of a bird in flight
o1.03, 2.01, 13/2, 19, 101.10, 1/3

14
DataData
science
science is typically a “concept to unify statistics, data
analysis, and their related methods” to understand and
analyze the actual phenomena with data, which can be a
discovery, prediction, service, suggestion, insight into
decision-making, thought, model, paradigm, tool, or system.
The popularity of “Data science” is increasing day-by-day,
which is shown in the next figure of google trends data over
the last 5 years.

15
In the fig the average is 71 for machine learning,
60 for data science, 30 for data analytics, and 12
for data mining. This shows how data science has
become popular when it comes to the popularity of
data science using recent advanced data analytics
technology such as Machine Learning is more
popular.
16
Data science
Usually, data science is the field of applying
advanced analytics methods and scientific
concepts to derive useful business information
from data.
Advanced analytics is a step forward in offering
a deeper understanding of data and helping to
analyze granular data while , Basic analytics
offer a description of data in general.

17
Data science: Types of
analytics
In the field of data science, several types of analytics are popular
i. Descriptive analytics" which answers the question of what happened;
ii. "Diagnostic analytics" which answers the question of why did it happen;
iii. "Predictive analytics" which predicts what will happen in the future; and
iv. "Prescriptive analytics" which prescribes what action should be taken,

Although the area of “data science” is huge, mainly


focus on analytics is deriving useful insights through advanced analytics,
where the results are used to make smart decisions in various real-world
application areas.

18
…. Types of analytics

19
Data
Neuralscience
network, or deep learning analysis can provide
deeper knowledge about data, and thus can be used
to develop data-driven intelligent applications.
More specifically, regression analysis, classification,
clustering analysis, association rules, time-series
analysis, sentiment analysis, behavioral patterns,
anomaly detection, factor analysis, log analysis, and
deep learning which is originated from the artificial
neural network, are playing major role.

20
DataData
science
analysis” related terms of data by
refers to the processing
conventional (e.g., classic statistical, empirical,
or logical) theories, technologies, and tools for
extracting useful information and for practical
purposes .
Data analytics”, on the other hand, refers to the
theories, technologies, instruments, and
processes that allow for an in-depth
understanding and exploration of actionable data
insight. Statistical and mathematical analysis of
the data is the major concern in this process.

21
Data science related terms
Data mining” also referred as knowledge mining
from data, knowledge extraction, knowledge
discovery from data (KDD), data/pattern analysis,
data archaeology, and data dredging. It is the
process of discovering interesting patterns and
knowledge from large amounts of data.
Big data: massive, high dimensional,
heterogeneous, complex, unstructured,
incomplete, noisy, and erroneous” . Several
unique features including volume, velocity,
variety, veracity, value (5Vs), and complexity are
used to understand and describe big data.
22
Data science related terms
Machine learning”, a branch of artificial intelligence (AI), is
one of the major techniques used in advanced analytics which
can automate analytical model building. This is focused on the
premise that systems can learn from data, recognize trends,
and make decisions, with minimal human involvement.
Deep Learning” is a subfield of machine learning that
discusses algorithms inspired by the human brain’s structure
and the function called artificial neural networks

23
…… Data science
Unlike the above data-related terms, “Data science” is an
umbrella term that encompasses advanced data analytics, data
mining, machine, and deep learning modeling, and several other
related disciplines like statistics, to extract insights or useful
knowledge from the datasets and transform them into
actionable business strategies.
data science from the disciplinary perspective can be defined as
“a new interdisciplinary field that synthesizes and builds on
statistics, informatics, computing, communication,
management, and sociology to study data and its environments
to transform data to insights and decisions by following a data-
to-knowledge-to-wisdom thinking and methodology.

24
Data How
science
data science can play a significant role in the real-world
business process?

25
How data science can play a significant
role …Understanding business problems: getting a clear
understanding of the problem , o understand and identify the
business problems, the data scientists formulate relevant
questions while working with the end-users and other
stakeholders.
Understanding data: real-world data sets are often noisy,
missing values, have inconsistencies, or other data issues,
which are needed to handle effectively. what data is available
and how it aligns to the business problem could be the first
step in data understanding, what data would be best needed
and the best ways to acquire it.

26
How data science can play a significant
role .. Data pre-processing and exploration: examines a broad data collection to
discover initial trends, attributes, points of interest, etc. in an unstructured
manner to construct meaningful summaries of the data. visualizing and
interpreting the data through graphical representation such as a chart, plot,
histogram. use data summarization and visualization to audit the quality of
the data .
Machine learning modeling and evaluation: Once the data is prepared for
building the model, data scientists design a model, algorithm, or set of
models, to address the business problem. Model building is dependent on
what type of analytics. Data scientists typically separate training and test
subsets of the given dataset usually dividing in the ratio of 80:20. This is to
observe whether the model performs well or not on the data, to maximize
the model performance. (Model validation and assessment metrics: error
rate, accuracy, true positive, false positive, true negative, false negative,
precision, recall, f-score, ROC (receiver operating characteristic curve)
analysis, applicability analysis)
27
More on ML
advanced analytics” can be defined as the autonomous or
semi-autonomous analysis of data or content using
advanced techniques and methods to discover deeper
insights, make predictions, or produce recommendations,
where machine learning-based analytical modeling is
considered as the key technologies in the area.
wide range of methods such as regression and classification
analysis, association rule analysis, time-series analysis,
behavioral analysis, log analysis, and so on can be applied.

28
ML
Fig. A general
structure of a
machine
learning based
predictive
model .

29
… Machine Learning
Machine learning is the term used to describe a
series of processes in which a computer learns
from evidence or learns from lots of examples of
data to help it to certain data-based tasks.
Common to all machine learning algorithms is a
training step. Training is where the computer
learns something about the world or a particular
problem, based on data drawn from that world.
.

30
… Machine Learning
Training may allow the computer to build some
internal representation, or model, about that world
Alternatively, the computer can be trained to
search for patterns in the data to help structure
the world.
The outcome of training is an algorithm that can
be used for a variety of tasks such as predicting
future events, automatically recognizing objects,
or structuring data in a manageable way.

31
examples of machine
learning
Example1 (regression)
Machine learning can be used to predict the future.
Take earthquake prediction. This is important
applications in predicting natural disasters and
helping people plan to minimize the impact.
When trained on information such as time, location,
and magnitude of historical earthquakes in a region,
a machine learning algorithm may be used by
geologists to work out the probability of another
earthquake occurring at a certain time and place in
the future. This sort of machine learning is known
as regression.
32
Example 2-(classification)
Machine learning can also be used to recognize and
classify previously unseen objects.
Given a large database of possible road signs, a
machine learning algorithm in a self-driving car may
be able to correctly recognize a stop sign in an
unfamiliar location or recognize a stop sign that is
drawn slightly different from the others it's seen.
This information can be used to direct the car to
take appropriate action, such as stop. This sort of
machine learning where the output is a discrete
label, a stop sign, is known as classification.
33
Example 3-(Clustering)
Machine learning can also be used to structure lots of
data into manageable chunks or clusters.
Information from thousands of people's shopping
habits for example, can be used to link some groups
of those people into specific clusters for more
targeted marketing. Clustering is an example of
unsupervised learning. It allows us to find patterns of
behavior in data, based on similarities in that data.

34
The same approach may make use of data drawn
from social media posts to automatically cluster
people into groups of similar political affiliation, or
use sequences of information from samples of DNA
to cluster people into groups with similar genetics.
In summary, machine learning is really just about
building algorithms that can help a computer to
learn from a body of data so that may make sense
or make predictions about new and previously
unseen data.
Simply Machine learning is when a machine learns
from examples.

35
1.3. Supervised Vs.
Unsupervised Learning
Machine learning algorithms fall into two
categories,
i. supervised learning and
ii. unsupervised learning.
What do I mean by supervised?
Consider a parent teaching an infant the difference
between dogs and cats. Every time the child sees a
dog, the parent will point out "dog", and similarly
for a cat.

36
…. i. supervised learning
All that is needed are lots of examples of data,
images of animals that had been pre-labelled
as either cat or dog by the child supervisor or
parent.
Supervised learning is essentially this,
providing the computer with lots of training
data images of dogs and cats in this example,
alongside the class labels that we have
assigned either dog or cat.

37
…. i. supervised learning
When completely new data is then presented, the
trained algorithm can infer the most likely
corresponding class or value. Trained algorithm,
will make a decision on whether it's either a dog or
a cat. This is an example of classification.
Alternatively, the algorithm can give some
measure indicating the degree of dogishness, or
catishness. This is an example of regression, or a
continuous valued output or a probability is given.

38
ii. unsupervised learning.
Unsupervised learning is when we don't know beforehand the
structure of the data.
We don't use any labels. In our child learning example, the
parent never tells the child what it is they're looking at.
The child may be exposed to lots of dogs and cats, but they
have to learn the similarities and differences between
examples of the creatures based on some other criteria. Over
time, the child may well learn traditional dichotomy dog
versus cat. But equally, they could form a different clustering.
For example, degree of furriness versus non furry. They might
not even be limited to just two classes or clusters. Perhaps,
the data suggests that three or more clusters may be
appropriate.

39
… ii. unsupervised
For example, if the similarity criteria learning.
was color of fur.
This division of data into groups based on some
measure of similarity is why this type of unsupervised
learning is referred to as data clustering.
Data clustering is a very powerful tool and
exemplifies many of the most important aspects of
machine learning and data science.
In this course, we will explore clustering in greater
detail, and in particular, the most common and useful
clustering algorithm, K-means clustering.

40
Check point
Select all the can apply (Note you can select more
than one)
Question 1
Which of the following are commonly
associated with supervised learning?
oclassification
oclustering
oregression

41
Check point
Question 2
Which of the following is true:
oClustering organizes data using pre-selected
labelling information.
oRegression is a supervised method for modelling
and predicting continuous valued data.
oClassification is the process of making a
decision based on data, and returning a
categorical or discrete output.

42
Neural Networks and Deep
Learning
Deep learning is a form of machine learning that uses artificial
neural networks to create a computational architecture that learns
from data by combining multiple processing layers, such as the
input, hidden, and output layers.
The key benefit of deep learning over conventional machine
learning methods is that it performs better in a variety of
situations, particularly when learning from large datasets
The most common deep learning algorithms are: multilayer
perceptron (MLP) , convolutional neural network (CNN or ConvNet) ,
long short term memory recurrent neural network (LSTM-RNN).

43
Neural Networks …..
Fig. An
artificial
neural
network
modeling with
multiple
processing
layers.

44
How data science can play a significant
role
Data product and automation: is typically the output of any data
science activity . A data product, or data-enabled or guide, which
can be a discovery, prediction, service, suggestion, insight into
decision-making, thought, model, paradigm, tool, application, or
system that process data and generate results. Businesses can use
the results of such data analysis.

45
K-Means
loading ……………………………….

46
K-Means …..

47
1.4. K-Means Clustering
Data clustering is a method of unsupervised
machine learning, where data is separated into
groups or clusters based on some similarity
measure.
K-means clustering is probably the most common
example of data separation into groups or clusters
based on some similarity measure.
To show how K-Means Clustering works, lets start
with example.

48
… k-Means clustering (Example)

• Imagine a one-dimensional axis, a line representing income. Each dot


shown in this line represents a population of people with that level of
income.
• Say we want to uncover some pattern in this data. Specifically, can
we find out something about the relationships between these people
based only on their level of income? Using traditional statistics, we
can see something that the overall average income.

49
But the average income information fails to capture the
fact that there are clearly two groupings of income here,
which we might label as wealthy and everyone else.

The K-means algorithm is pretty good at finding such


patterns without us having to tell it beforehand.

50
steps of K-means
K-means has five basic steps and works as follows.
Step one.
First, we select the number of clusters we want to look for.
This is the k in K-means. Here we choose k equals two. The
algorithm then randomly selects k points on our data axis.
Note that it doesn't matter that these points do not
necessarily correspond to existing data. These points are
called our data centers or centroids.
Step two.
The distance from each data point to each of our k
centroids is calculated. In this case, distance is simply a
measure of the difference in income between the points.
51
Step three.
Clusters are formed by assigning each data point to either
centroid one or two, depending on which is closest.
Step four.
This is the update step. The average value calculated over the
members of each cluster is then set as the new centroid. We
ignore and dispense with the previous centroid value.
Finally, step five.
We then recursively run steps two to four recalculating the
centroids until eventually the centroid positions do not change.
When the centroids remains stable like this, the algorithm is
said to have converged. In our example, we were able to
discover two clusters. But if we were to say k equals to three
looking for three clusters, we can find another pattern in the
data, which we can then map to wealthy, average, and poor.

52
K-mean steps are Most useful cases involve more than one
dimension or feature.
The same basic principle can be applied to two-dimensions.
The distance measure between points here might be a simple
Euclidean distance.
It turns out that K-means can be applied to any number of
dimensions, provided there is sufficient data to train the
algorithm.
K-means converges to what is known as a local minimum.
This basically means that although the algorithm seems to
have found the best groupings. A better result may yet be
found if the algorithm were to be started again with different
initial centroids positions. It turns out that the selection of
initial centroids in step one is crucial to finding a good
solution.

53
standardization
Having good data to begin with is crucial to the
success of any data science analysis.
If the data going in is bad, then the algorithms won't
work as well as you'd like.
Much of the work carried out by data scientists is
spent cleaning and adjusting the data to make it
usable.
For K-means, it's particularly important that the data
used is compatible between different features.
Continuous value data such as income, times,
weights, can be using the Euclidean distance.
54
… standardization
However, for more than two dimensions of features with
different ranges, for example, if you had income levels
between 1,000 and a million , and weight from 20 kilograms
to a 150 kilograms, it's important to scale that data so that
the two things can be compatible. Usually, this means
adjusting all values to fall between zero and one. This scaling
of the data is sometimes referred to as standardization.
Categorical feature data like oranges, apples, or cat, dog, are
not as easily handled by K-means. If the categories fall in
some kind of scale like very dog versus slightly dog, less dog,
these may be converted into a number range like 1, 0.8, 0.6,
0.4 and K-means can then be used.

55
Check point
Question 1
These are the final cluster assignments from a
run of K-means on 12 data points (the red
marks). What value of K was used in the
algorithm?

56
Question 2
K-means clustering is run on the data shown below.
During the first round of the algorithm, the cluster
centroids are placed in the positions indicated by the
large blue and green crosses.

o(a,b,c,d) and (e,f,g)


o(a,b,c) and (d,e,f,g)
o(a,b,c) and (d) and (e,f,g)
o(a,b) and (c,d) and (e,f) and (g)

57
1.5. Real world data set
Publicly available real-world dataset. This is based on
two sources of information. The World Bank income
inequality index and the Gallup poll of happiness,
covering over a 120 countries.

Assignment (Group)
List some Real world data set available for data
scientist with their location

58
Assignment #1- 5% individual-
select all that can apply
Question 1
The K-means algorithm is an example of:
ounsupervised learning
osupervised learning
odata clustering
oclassification

59
Question 2
The "k" in K-means represents:
othe number of data points in a cluster
othe number of clusters to find in a dataset
onone of the above
othe number of steps in the K-means algorithm
Question 3
Before running K-means, data scaling is applied
to the data in order to:
ostandardize features to make them more
comparable.
omake the plots look nicer.
oremove noisy data.

60
Question 4 (select two that applies)
Supervised machine learning includes:
oclustering
oregression
onone of the above
oclassification
Question 5
K-means is run on a dataset. One of the clusters
contains 6 items of data with the following values: 2,
3, 2, 3, 1, 1. What number corresponds to the data
center, or centroid, of this cluster? (Answer this with
justification)

61
End of Chapter One

62

You might also like