SlideShare a Scribd company logo
Machine
Learning
Source: Introduction to Machine Learning with Python
Authors: Andreas C. Müller and Sarah Guido
Unit - I
Introduction to Machine
Learning with Python
Agenda
Introduction
Why Machine Learning?
Problems Machine Learning
Can Solve
Knowing Your Task and
Knowing Your Data
Agenda
Why Python?
Scikit-learn
Essential Libraries and Tools
A First Application:
Classifying Iris Species
Introduction
Introduction to Machine
learning
What is ML?
§ “Machine learning enables a machine to automatically
lear n from d ata , improve perfor mance fro m
experiences, and predict things without being
explicitly programmed”
§ Machine Learning is said as a subset of artificial
intelligence
§ Concerned with the development of algorithms which allow
a computer to learn from the data and past experiences on
their own
§ First introduced by Arthur Samuel in 1959
§ With the help of sample historical data, which is known as
training data, machine learning algorithms build a
mathematical model that helps in making predictions
or decisions without being explicitly programmed
Why ML?
§ The need for machine learning is increasing
day by day
§ The reason behind the need for machine learning
is that it is capable of doing tasks that are too
complex for a person to implement directly
§ As a human, we have some limitations as we cannot
access the huge amount of data manually --> we
need some computer systems --> comes the
machine learning to make things easy for us
§ With the help of machine learning, we can save
both time and money
Why ML?
§ Rapid increment in the production of data
§ Solving complex problems, which are difficult for
a human
§ Decision making in various sector including
finance
§ Finding hidden patterns and extracting useful
information from data
History
§ About 40-50 years Machine Learning is a science
fiction
§ But today ML is part of our life
§ However, the idea behind machine learning is
so old and has a long history
History
History
§ Machine intelligence in Games -
§ 1952: Arthur Samuel, who was the pioneer of machine learning, created a
program that helped an IBM computer to play a checkers game. It performed
better more it played.
§ 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
§ The first "AI" winter -
§ The duration of 1974 to 1980 was the tough time for AI and ML researchers, and
this duration was called as AI winter.
§ In this duration, failure of machine translation occurred, and people had
reduced their interest from AI, which led to reduced funding by the government to
the researches.
§ Machine Learning from theory to reality -
§ 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
§ 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural
network NETtalk, which was able to teach itself how to correctly pronounce
20,000 words in one week.
§ 1997:The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
History
§ Machine Learning at 21st century -
§ 2006: In the year 2006, computer scientist Geoffrey Hinton has given a
new name to neural net research as "deep learning" and
nowadays, it has become one of the most trending technologies
§ 2012: In 2012, Google created a deep neural network which
learned to recognize the image of humans and cats in YouTube
videos.
§ 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing
Test. It was the first Chabot who convinced the 33% of human judges
that it was not a machine.
§ 2014: DeepFace was a deep neural network created by Facebook,
and they claimed that it could recognize a person with the same
precision as a human can do.
§ 2016: AlphaGo beat the world's number second player Lee sedol at
Go game. In 2017 it beat the number one player of this game Ke Jie.
§ 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system
that was able to learn the online trolling. It used to read millions of
comments of different websites to learn to stop online trolling
Block
Diagram of
ML
§ Machine Learning system learns from historical data,
builds the prediction models, and whenever it receives
new data, predicts the output for it
§ Machine learning has changed our way of thinking about
the problem
§ Block Diagram of ML -
Prerequisites
§ Mathematics -
§ Linear Algebra
§ Probability
§ Calculus
§ Derivatives
§ Single variable and multivariate functions
§ Programming Languages -
§ Python
§ R
§ C++
§ Java
§ Java Script
§ Matlab
Applications
of ML
Machine
Learning
Life-Cycle
Introduction
§ Machine learning is about extracting knowledge from data
§ It is a research field at the intersection of
§ Statistics
§ Mathematics
§ Artificial intelligence
§ Computer science
§ Also known as predictive analytics or statistical
learning
§ Ubiquitous in everyday life
§ Examples –
§ From automatic recommendations of which movies to watch
§ To what food to order
§ Which products to buy
§ To personalized online radio
§ Recognizing your friends in your photos
§ Many modern websites and devices have machine learning
algorithms at their core
Introduction
§ Outside of commercial applications, machine
learning has had a tremendous influence on the
way data-driven research is done today
§ Diverse scientific problems such as
§ Understanding stars
§ Finding distant planets
§ Discovering new particles
§ Analyzing DNA sequences
§ Providing personalized cancer treatments
Why
Machine
Learning
Hand coded Rules
advantages
Hand coded Rules
disadvantages
Hand coded Rules Example
Why
Machine
Learning?
Why Machine Learning?
§ Many systems used hand coded rules of “if” and “else”
decisions to process data or adjust to user input.
§ Example –
§ Spam filter -->Blacklist of words
§ Possible with hand coded rules
§ Hand coded rules advantages
§ Expert-designed rule system
§ Manually crafting decision rules is feasible --> for
some applications --> in which humans have a good
understanding of the process to model
Why
Machine
Learning
§ Disadvantages of Hand coded rules
§ The logic required to make a decision is specific
to a single domain and task --> Changing the task
even slightly might require a rewrite of the whole
system
§ Designing rules requires a deep understanding of
how a decision should be made by a human expert
Why
Machine
Learning
§ Example - Face recognition
§ Unsolved problem until as recently as 2001
§ The way pixels are perceived by the computer is very
different from how humans perceive a face
§ Not possible with hand coded rules
§ Basically impossible for a human to come up with a
good set of rules to describe what constitutes a face
in a digital image.
§ Hence, the need for ML
Problems
Machine
Learning Can
Solve
Supervised learning
Disadvantages of supervised
learning
Example of supervised
learning
Problems
Machine
Learning Can
Solve
Unsupervised learning
Disadvantages of
unsupervised learning
Example of unsupervised
learning
Problems
Machine
Learning Can
Solve
Supervised learning
§ Decision-making processes by generalizing from known
examples
§ Supervised learning -
§ The user provides the algorithm with pairs of desired inputs
and outputs, and the algorithm finds a way to produce the
desired output given an input
§ Example –
§ Spam classification
§ User provides the algorithm with a large number of emails
(which are the input), together with information about whether
any of these emails are spam (which is the desired output).
§ Given a new email, the algorithm will then produce a prediction
as to whether the new email is spam.
Problems
Machine
Learning Can
Solve
§ Supervised Learning
§ Machine learning algorithms that learn from input/output pairs are
called supervised learning algorithms.
§ Disadvantages of supervised learning
§ creating a dataset of inputs and outputs is often a laborious manual
process
§ Examples of supervised machine learning tasks
include:
§ Identifying the zip code from handwritten digits on an envelope
§ Determining whether a tumor is benign based on a medical image
§ Detecting fraudulent activity in credit card transactions
Problems
Machine
Learning Can
Solve
§ Unsupervised algorithms
§ only the input data is known, and no known output data is given to the
algorithm
§ Disadvantages of Unsupervised learning
§ Usually harder to understand and evaluate
§ Example –
§ Identifying topics in a set of blog posts [Information Analysis]
§ Large collection of text data --> Summarize it --> Find prevalent themes in it
§ Might not know beforehand what these topics are, or how many topics there might be
§ Segmenting customers into groups with similar preferences
[Segmentation/Clustering]
§ Given set of customer records --> Identify which customers are similar --> groups
customers with similar preferences
§ shopping site --> Group into bookworms, gamers, parents etc.,
§ Detecting abnormal access patterns to a website
§ To find the abuse or bugs, it is often helpful to find access patters that are different
from the norm.
Important
Points
Note 1:
§ No machine learning algorithm will be able to make a prediction on
data for which it has no information
§ Example -
§ Can we predict gender using last name as the only feature?
Note 2:
§ For both supervised and unsupervised learning tasks representation of
input data that a computer can understand is important
§ Think of your data as a table
§ Row -
§ Each data point that you want to reason about is a row
§ Each entity or row is known as a Sample
§ Column -
§ Each property that describes that data point is a column
§ The Columns are also called as Features
KnowingYour
Task and
KnowingYour
Data
KnowingYour
Task and
KnowingYour
Data
§ The most important part in the machine learning process
is understanding the data you are working with and how
data relates to the task you want to solve
§ Dont ever throw the data to a randomly choosen ML
algorithm
§ Very important to understand what is in your dataset
before you begin building a model
§ Each algorithm is different in terms of what kind of data
and what problem setting it works best for
KnowingYour
Task and
KnowingYour
Data
Always have an answer for the following
Questions to build a ML model
§ What question(s) am I trying to answer? Do I think the
data collected can answer that question?
§ What is the best way to phrase my question(s) as a
machine learning problem?
§ Have I collected enough data to represent the problem
I want to solve?
§ What features of the data did I extract, and will these
enable the right predictions?
§ How will I measure success in my application?
§ How will the machine learning solution interact with
other parts of my research or business product?
§ Many people spend a lot of time building complex
machine learning solutions, only to find out they don’t
solve the right problem
Why Python
Why python ?
Why Python
§ Python has become the lingua franca for many data
science applications
§ Combines the power of general-purpose programming
languages with the ease of use of domain-specific
scripting languages
§ Python has libraries for –
§ Data loading
§ Visualization
§ Statistics
§ Natural language processing
§ Image processing
§ This vast toolbox provides data scientists with a large
array of general- and special-purpose functionality
Why Python
§ Main advantages of using Python is
§ the ability to interact directly with the code, using a
terminal or other tools like the Jupyter Notebook
§ Machine learning and data analysis are fundamentally
iterative processes, in which the data drives the analysis
§ Python have tools that allow quick iteration and easy
interaction
§ As a general-purpose programming language, Python also
allows for the
§ Creation of complex graphical user interfaces (GUIs)
§ Web services
§ Integration into existing systems
Scikit -learn
scikit-learn
Scikit -learn
§ Open source project
§ Free to use and distribute, and anyone can easily obtain the
source code to see what is going on behind
§ Constantly being developed and improved
§ Very active user community
§ It contains a number of state-of-the-art machine
lear ning algorithms, as well as comprehensive
documentation about each algorithm
§ Scikit-learn is a very popular tool, and the most prominent
python library for machine learning
§ It is widely used in industry and academia
§ It has a wealth of tutorials
§ It has number of code snippets available online
Scikit -learn
Scikit-learn depends on two other Python packages
§ NumPy
§ SciPy
§ For plotting and interactive development we should install
§ Matplotlib
§ Ipython
§ Jupyter notebook
Prepackaged
Python
Distributions
Anaconda
Enthought Canopy
Python(x,y)
Anaconda
Anaconda -
§ A Python distribution made for –
§ large-scale data processing
§ predictive analytics
§ scientific computing
§ Anaconda comes with -
§ Numpy
§ Scipy
§ Matplotlib
§ Pandas
§ Ipython
§ Jupyter notebook
§ Scikitlearn
§ Available on
§ Mac OS
§ Windows
§ Linux
§ Includes the commercial Intel MKL library for free -
§ significant improvements for many algorithms in scikit-learn
Enthought
Canopy
§ Another python distribution for scientific computing
§ Enthought Canopy comes with -
§ Numpy
§ Scipy
§ Matplotlib
§ Pandas
§ Ipython
§ Free version does not come with scikit-learn
§ Academic, degree-granting institution can request an
academic license and get free access to the paid
subscription version of Enthought Canopy
Python(x,y)
Python(x,y)
§ Free python distribution for scientific computing
§ Python(x,y) comes with -
§ Numpy
§ Scipy
§ Matplotlib
§ Pandas
§ Ipython
§ Scikit-learn
§ Free version does not come with scikit-learn
§ use pip to install all of these packages:
Essential
Libraries
andTools
Jupyter Notebook
Numpy
Scipy
Matplotlib
Pandas
Print versions
Jupyter
Notebooks
§ Jupyter Notebook
§ Jupyter notebook is an interactive environment for
running code in the browser.
§ Great tool for exploratory data analysis
§ Widely used by data scientists
§ The Jupyter Notebook makes it easy to incorporate code,
text, and images
numpy
§ NumPy (Numpy arrays)
§ NumPy is one of the fundamental packages for scientific computing in Python
§ It contains functionality for –
§ Multidimensional arrays
§ High-level mathematical functions such as linear algebra operations
§ The fourier transform
§ Pseudorandom number generators
§ In scikit-learn, the NumPy array is the fundamental data structure
§ Any data you’re using will have to be converted to a NumPy array
§ All elements of the numpy array must be of the same type.
§ Example –
§ Output :
SciPy
SciPy
§ Collection of functions for scientific computing in Python
§ Other functionality
§ Advanced linear algebra (Matrices) routines
§ Mathematical function optimization (sqrt(), sqaure() etc.,)
§ Signal processing (Fast Fourier Transforms)
§ Special mathematical functions (mod(), remainder(), ...)
§ Statistical distributions (mean(), max(), std(), ...)
§ Scikit-learn draws from SciPy’s collection of functions for
implementing its algorithms
scipy
scipy.sparse
§ The most important part of SciPy for us is scipy.sparse
§ this provides sparse matrices, which are another
representation that is used for data in scikit-learn.
§ Sparse matrices are used whenever we want to store a 2D
array that contains mostly zeros
§ Example 1-
Output :
scipy
CSR format - Compressed Sparse Row
§ Example 2-
Output :
Essential
Libraries
andTools
COO format ?
§ Example 3-
Output :
Matplotlib
Matplotlib
§ Primary scientific plotting library in Python
§ Provides functions for making publication-quality visualizations
such as
§ Line charts
§ Histograms
§ Scatter plots
§ Matplotlib for all our visualizations
§ Visualization of data makes us to understand different
aspects of the data and is helpful in easy understanding
of data --> leads to easy analysis of data
§ Inside the Jupyter Notebook, you can show figures directly in the
browser by using the %matplotlib notebook and %matplotlib
inline commands
Matplotlib
§ Example 1 –
Pandas
Pandas
§ Python library for data wrangling and analysis
§ It is built around a data structure called the DataFrame
§ Pandas DataFrame is a table, similar to an Excel
spreadsheet
§ Pandas provides a great range of methods to modify
and operate on this table
§ Allows SQL-like queries and joins of tables
§ pandas allows each column to have a separate type
§ Integers
§ Dates
§ Floating-point numbers
§ Strings
Pandas
§ Pandas has the ability to ingest from a great variety of
file formats and databases, like -
§ SQL
§ Excel files
§ comma-separated values (CSV) files
§ Example 1-
Import
§ Import various libraries
§ import numpy as np
§ import matplotlib.pyplot as plt
§ import pandas as pd
§ import mglearn
§ from IPython.display import display
§ Print versions –
A First
Application:
Classifying
Iris Species
Meet the Data
Measuring success: Training and
Testing Data
Scatter plot
Pair plot
K Nearest Neighbors
Making Predictions
Evaluating the model
A First
Application:
Classifying
Iris Species
§ Hobby botanist is interested in distinguishing the species
of some iris flowers that she has found.
§ She has collected some measurements associated with
each iris:
§ The length and width of the petals
§ And the length and width of the sepals,
§ All measured in centimeters
§ Setosa, versicolor, or virginica
§ The goal is to build a machine learning model that can
learn from the measurements of these irises whose species
is known, so that we can predict the species for a new iris.
A First
Application:
Classifying
Iris Species
§ We have measurements for which we know the correct species of
iris
§ This is a supervised learning problem
§ This is an example of a classification problem
§ Different species of irises are called classes
§ Three-class classification problem
§ Particular data point, the species it belongs to is called its label
IRIS:
Meet the
Data
§ Iris dataset, a classical dataset in machine learning and
statistics
§ It is included in scikit-learn in the datasets module
§ load_iris function:
§ It is very similar to a dictionary. It contains keys and values
IRIS:
Meet the
Data
Output :
Output :
Output:
IRIS:
Meet the
Data
§ Input:
§ Output :
§ Input:
§ Output :
§ Input:
§ Output :
IRIS:
Meet the
Data
§ Input:
§ Output :
§ Input:
§ Output :
§ Input:
§ Output :
IRIS:
Meet the
Data
§ Input:
§ Output :
IRIS:
Measuring
Success
Measuring Success:Training and Testing Data
§ Before we apply our model to new measurements, we
need to know whether it actually works (i.e., we should
know whether we can trust its predictions)
§ Cannot use the data we used to build the model to
evaluate it
§ Our model can always simply remember the whole
training set, and will therefore always predict the
correct label for any point in the training set
§ Remembering does not indicate to us whether our
model will generalize well
§ To assess the model’s performance, we show it new
data
IRIS:
Measuring
Success
§ This is usually done by splitting the labeled data we
have collected (here, our 150 flower measurements) into
two parts
§ One part of the data is used to build our machine learning
model, and is called the training data
§ The rest of the data will be used to assess how well the
model works; this is called the test data, test set, or hold-out
set
§ scikit-learn contains a function that shuffles the dataset and
splits it for you: the train_test_split function
§ This function extracts 75% of the rows in the data as the
training set, together with the corresponding labels for this
data. The remaining 25% of the data, together with the
remaining labels, is declared as the test set
§ Note:
§ Deciding how much data you want to put into the training and
the test set respectively is somewhat arbitrary, but using a test
set containing 25% of the data is a good rule of thumb
IRIS:
Measuring
Success
§ Data is usually denoted with a capital X
§ Labels are denoted by a lowercase y
§ This is inspired by the standard formulation f(x)=y in
mathematics, where x is the input to a function and y is the
output
§ use a capital X because the data is a two-dimensional
array (a matrix) and a lowercase y because the target is
a one-dimensional array (a vector)
§ Example 1 –
IRIS:
Measuring
Success
§ Before making the split, the train_test_split function
shuffles the dataset using a pseudorandom number
generator
§ To make sure that we will get the same output if we run the
s a m e f u n c t i o n s eve ra l t i m e s , we p rov i d e t h e
pseudorandom number generator with a fixed seed using
the random_state parameter
§ This will make the outcome deterministic, so this line will
always have the same outcome
IRIS:
Measuring
Success
§ The output of the train_test_split function is X_train, X_test,
y_train, and y_test, which are all NumPy arrays.
§ X_train contains 75% of the rows of the dataset
§ X_test contains the remaining 25%
§ Input :
§ Output:
§ Input :
§ Output:
IRIS:
Inspect your
data
§ One of the best ways to inspect data is to visualize it
§ Scatter plot
§ A scatter plot of the data puts one feature along the x-axis and
another along the y-axis, and draws a dot for each data point
§ Disadvantges of scatter plot :
§ Allows us to plot only two (or maybe three) features at a time
§ Difficult to plot datasets with more than three features this way
Pair plot
§ Which looks at all possible pairs of features
§ First convert the NumPy array into a pandas DataFrame
§ pandas has a function to create pair plots
§ Disadvantages of pair plot:
§ a pair plot does not show the interaction of all of features at once, so
some interesting aspects of the data may not be revealed
IRIS:
Pair Plot
Example
IRIS:
Building
First Model
K-Nearest Neighbors
§ k-nearest neighbors classifier, which is easy to understand
§ Estimator Classes:
§ All machine learning models in scikit-learn are implemented
in their own classes, which are called Estimator classes
§ The k-nearest neighbors classification algorithm is implemented
in the KneighborsClassifier class in the neighbors module
§ Before we can use the model, we need to instantiate the class into
an object
§ The most important parameter of KNeighborsClassifier is the
number of neighbors, which we will set to 1
§ Example 1 -
IRIS:
Building
First Model
§ The knn object encapsulates the algorithm that will be used to
build the model from the training data, as well the algorithm to
make predictions on new data points
§ To build the model on the training set, we call the fit method of
the knn object which takes as arguments
§ NumPy array X_train containing the training data
§ NumPy array y_train of the corresponding training labels
§ Example 2 –
§ Output :
§ The fit method returns the knn object itself
IRIS:
Building
First Model
§ Most models in scikit-learn have many parameters, but the
majority of them are either speed optimizations or for very
special use cases
Making predictions
§ Imagine we found an iris in the wild with a sepal length of 5 cm, a
sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of
0.2 cm.What species of iris would this be?
§ Example –
§ Output :
§ To make a prediction, we call the predict method of the knn
object
IRIS:
Building
First Model
§ Example -
§ Output :
§ Our model predicts that this new iris belongs to the class 0,
meaning its species is setosa.
§ But how do we know whether we can trust our model?
IRIS:
Evaluating
the Model
Evaluating the model
§ This is where the test set that we created earlier comes in
§ This data was not used to build the model, but we do know
what the correct species is for each iris in the test set
§ Input -
§ Output :
§ Input -
§ Output:
IRIS:
Evaluating
the Model
Score method :
§ Input –
§ Output :
§ Test set accuracy is about 0.97, which means that the
prediction for 97% of the irises in the test set were correct
§ High level of accuracy means that our model may
be trustworthy enough to use
Thank you
Ad

More Related Content

Similar to Machine Learning - Implementation with Python - 1 (20)

Hala GPT - Samer Desouky.pdf
Hala GPT - Samer Desouky.pdfHala GPT - Samer Desouky.pdf
Hala GPT - Samer Desouky.pdf
Samer Desouky
 
AI - Artificial Intelligence - Implications for Libraries
AI - Artificial Intelligence - Implications for LibrariesAI - Artificial Intelligence - Implications for Libraries
AI - Artificial Intelligence - Implications for Libraries
Brian Pichman
 
Deep learning
Deep learningDeep learning
Deep learning
AnimaSinghDhabal
 
MACHINE LEARNING PPT.pptx for the machine learning studnets
MACHINE LEARNING PPT.pptx for the machine learning studnetsMACHINE LEARNING PPT.pptx for the machine learning studnets
MACHINE LEARNING PPT.pptx for the machine learning studnets
AadityaRathi4
 
Introduction of machine learning.pptx
Introduction of machine learning.pptxIntroduction of machine learning.pptx
Introduction of machine learning.pptx
Dr.Shweta
 
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
Edureka!
 
How to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? EdurekaHow to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? Edureka
Edureka!
 
AI_in_6_Hours_lyst1728638806090-invert.pdf
AI_in_6_Hours_lyst1728638806090-invert.pdfAI_in_6_Hours_lyst1728638806090-invert.pdf
AI_in_6_Hours_lyst1728638806090-invert.pdf
magadhcyberworld
 
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
RudrakshAmar
 
Lect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdfLect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdf
HassanElalfy4
 
Introduction ML - Introduçao a Machine learning
Introduction ML - Introduçao a Machine learningIntroduction ML - Introduçao a Machine learning
Introduction ML - Introduçao a Machine learning
julianaantunes58
 
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MAHIRA
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
Jose Quesada
 
Machine-Learning-and-Robotics.pptx
Machine-Learning-and-Robotics.pptxMachine-Learning-and-Robotics.pptx
Machine-Learning-and-Robotics.pptx
shohel rana
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Introduction_to_Machine_Learning_KUMAR.pdf
Introduction_to_Machine_Learning_KUMAR.pdfIntroduction_to_Machine_Learning_KUMAR.pdf
Introduction_to_Machine_Learning_KUMAR.pdf
ssuser012286
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Darshan Ambhaikar
 
Machine learing
Machine learingMachine learing
Machine learing
Abu Saleh Muhammad Shaon
 
Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learning
HarshitBarde
 
Artificial intelligence slides beginners
Artificial intelligence slides beginners Artificial intelligence slides beginners
Artificial intelligence slides beginners
Antonio Fernandes
 
Hala GPT - Samer Desouky.pdf
Hala GPT - Samer Desouky.pdfHala GPT - Samer Desouky.pdf
Hala GPT - Samer Desouky.pdf
Samer Desouky
 
AI - Artificial Intelligence - Implications for Libraries
AI - Artificial Intelligence - Implications for LibrariesAI - Artificial Intelligence - Implications for Libraries
AI - Artificial Intelligence - Implications for Libraries
Brian Pichman
 
MACHINE LEARNING PPT.pptx for the machine learning studnets
MACHINE LEARNING PPT.pptx for the machine learning studnetsMACHINE LEARNING PPT.pptx for the machine learning studnets
MACHINE LEARNING PPT.pptx for the machine learning studnets
AadityaRathi4
 
Introduction of machine learning.pptx
Introduction of machine learning.pptxIntroduction of machine learning.pptx
Introduction of machine learning.pptx
Dr.Shweta
 
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
Edureka!
 
How to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? EdurekaHow to use Artificial Intelligence with Python? Edureka
How to use Artificial Intelligence with Python? Edureka
Edureka!
 
AI_in_6_Hours_lyst1728638806090-invert.pdf
AI_in_6_Hours_lyst1728638806090-invert.pdfAI_in_6_Hours_lyst1728638806090-invert.pdf
AI_in_6_Hours_lyst1728638806090-invert.pdf
magadhcyberworld
 
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
AI in 6 Hours this pdf contains a general idea of how AI will be asked in the...
RudrakshAmar
 
Lect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdfLect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdf
HassanElalfy4
 
Introduction ML - Introduçao a Machine learning
Introduction ML - Introduçao a Machine learningIntroduction ML - Introduçao a Machine learning
Introduction ML - Introduçao a Machine learning
julianaantunes58
 
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MACHINE LEARNING PRESENTATION (ARTIFICIAL INTELLIGENCE)
MAHIRA
 
Future of data science as a profession
Future of data science as a professionFuture of data science as a profession
Future of data science as a profession
Jose Quesada
 
Machine-Learning-and-Robotics.pptx
Machine-Learning-and-Robotics.pptxMachine-Learning-and-Robotics.pptx
Machine-Learning-and-Robotics.pptx
shohel rana
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Introduction_to_Machine_Learning_KUMAR.pdf
Introduction_to_Machine_Learning_KUMAR.pdfIntroduction_to_Machine_Learning_KUMAR.pdf
Introduction_to_Machine_Learning_KUMAR.pdf
ssuser012286
 
Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learning
HarshitBarde
 
Artificial intelligence slides beginners
Artificial intelligence slides beginners Artificial intelligence slides beginners
Artificial intelligence slides beginners
Antonio Fernandes
 

More from University College of Engineering Kakinada, JNTUK - Kakinada, India (6)

Chandu cyber security career path
Chandu cyber security career pathChandu cyber security career path
Chandu cyber security career path
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 1
Object Oriented Programming using C++ - Part 1Object Oriented Programming using C++ - Part 1
Object Oriented Programming using C++ - Part 1
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 2
Object Oriented Programming using C++ - Part 2Object Oriented Programming using C++ - Part 2
Object Oriented Programming using C++ - Part 2
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 5
Object Oriented Programming using C++ - Part 5Object Oriented Programming using C++ - Part 5
Object Oriented Programming using C++ - Part 5
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 4
Object Oriented Programming using C++ - Part 4Object Oriented Programming using C++ - Part 4
Object Oriented Programming using C++ - Part 4
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 3
Object Oriented Programming using C++ - Part 3Object Oriented Programming using C++ - Part 3
Object Oriented Programming using C++ - Part 3
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Ad

Recently uploaded (20)

Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Decision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdfDecision Trees in Artificial-Intelligence.pdf
Decision Trees in Artificial-Intelligence.pdf
Saikat Basu
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
 
Ad

Machine Learning - Implementation with Python - 1

  • 1. Machine Learning Source: Introduction to Machine Learning with Python Authors: Andreas C. Müller and Sarah Guido
  • 2. Unit - I Introduction to Machine Learning with Python
  • 3. Agenda Introduction Why Machine Learning? Problems Machine Learning Can Solve Knowing Your Task and Knowing Your Data
  • 4. Agenda Why Python? Scikit-learn Essential Libraries and Tools A First Application: Classifying Iris Species
  • 6. What is ML? § “Machine learning enables a machine to automatically lear n from d ata , improve perfor mance fro m experiences, and predict things without being explicitly programmed” § Machine Learning is said as a subset of artificial intelligence § Concerned with the development of algorithms which allow a computer to learn from the data and past experiences on their own § First introduced by Arthur Samuel in 1959 § With the help of sample historical data, which is known as training data, machine learning algorithms build a mathematical model that helps in making predictions or decisions without being explicitly programmed
  • 7. Why ML? § The need for machine learning is increasing day by day § The reason behind the need for machine learning is that it is capable of doing tasks that are too complex for a person to implement directly § As a human, we have some limitations as we cannot access the huge amount of data manually --> we need some computer systems --> comes the machine learning to make things easy for us § With the help of machine learning, we can save both time and money
  • 8. Why ML? § Rapid increment in the production of data § Solving complex problems, which are difficult for a human § Decision making in various sector including finance § Finding hidden patterns and extracting useful information from data
  • 9. History § About 40-50 years Machine Learning is a science fiction § But today ML is part of our life § However, the idea behind machine learning is so old and has a long history
  • 11. History § Machine intelligence in Games - § 1952: Arthur Samuel, who was the pioneer of machine learning, created a program that helped an IBM computer to play a checkers game. It performed better more it played. § 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel. § The first "AI" winter - § The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this duration was called as AI winter. § In this duration, failure of machine translation occurred, and people had reduced their interest from AI, which led to reduced funding by the government to the researches. § Machine Learning from theory to reality - § 1959: In 1959, the first neural network was applied to a real-world problem to remove echoes over phone lines using an adaptive filter. § 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in one week. § 1997:The IBM's Deep blue intelligent computer won the chess game against the chess expert Garry Kasparov, and it became the first computer which had beaten a human chess expert.
  • 12. History § Machine Learning at 21st century - § 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to neural net research as "deep learning" and nowadays, it has become one of the most trending technologies § 2012: In 2012, Google created a deep neural network which learned to recognize the image of humans and cats in YouTube videos. § 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the first Chabot who convinced the 33% of human judges that it was not a machine. § 2014: DeepFace was a deep neural network created by Facebook, and they claimed that it could recognize a person with the same precision as a human can do. § 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In 2017 it beat the number one player of this game Ke Jie. § 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to learn the online trolling. It used to read millions of comments of different websites to learn to stop online trolling
  • 13. Block Diagram of ML § Machine Learning system learns from historical data, builds the prediction models, and whenever it receives new data, predicts the output for it § Machine learning has changed our way of thinking about the problem § Block Diagram of ML -
  • 14. Prerequisites § Mathematics - § Linear Algebra § Probability § Calculus § Derivatives § Single variable and multivariate functions § Programming Languages - § Python § R § C++ § Java § Java Script § Matlab
  • 17. Introduction § Machine learning is about extracting knowledge from data § It is a research field at the intersection of § Statistics § Mathematics § Artificial intelligence § Computer science § Also known as predictive analytics or statistical learning § Ubiquitous in everyday life § Examples – § From automatic recommendations of which movies to watch § To what food to order § Which products to buy § To personalized online radio § Recognizing your friends in your photos § Many modern websites and devices have machine learning algorithms at their core
  • 18. Introduction § Outside of commercial applications, machine learning has had a tremendous influence on the way data-driven research is done today § Diverse scientific problems such as § Understanding stars § Finding distant planets § Discovering new particles § Analyzing DNA sequences § Providing personalized cancer treatments
  • 19. Why Machine Learning Hand coded Rules advantages Hand coded Rules disadvantages Hand coded Rules Example
  • 20. Why Machine Learning? Why Machine Learning? § Many systems used hand coded rules of “if” and “else” decisions to process data or adjust to user input. § Example – § Spam filter -->Blacklist of words § Possible with hand coded rules § Hand coded rules advantages § Expert-designed rule system § Manually crafting decision rules is feasible --> for some applications --> in which humans have a good understanding of the process to model
  • 21. Why Machine Learning § Disadvantages of Hand coded rules § The logic required to make a decision is specific to a single domain and task --> Changing the task even slightly might require a rewrite of the whole system § Designing rules requires a deep understanding of how a decision should be made by a human expert
  • 22. Why Machine Learning § Example - Face recognition § Unsolved problem until as recently as 2001 § The way pixels are perceived by the computer is very different from how humans perceive a face § Not possible with hand coded rules § Basically impossible for a human to come up with a good set of rules to describe what constitutes a face in a digital image. § Hence, the need for ML
  • 23. Problems Machine Learning Can Solve Supervised learning Disadvantages of supervised learning Example of supervised learning
  • 24. Problems Machine Learning Can Solve Unsupervised learning Disadvantages of unsupervised learning Example of unsupervised learning
  • 25. Problems Machine Learning Can Solve Supervised learning § Decision-making processes by generalizing from known examples § Supervised learning - § The user provides the algorithm with pairs of desired inputs and outputs, and the algorithm finds a way to produce the desired output given an input § Example – § Spam classification § User provides the algorithm with a large number of emails (which are the input), together with information about whether any of these emails are spam (which is the desired output). § Given a new email, the algorithm will then produce a prediction as to whether the new email is spam.
  • 26. Problems Machine Learning Can Solve § Supervised Learning § Machine learning algorithms that learn from input/output pairs are called supervised learning algorithms. § Disadvantages of supervised learning § creating a dataset of inputs and outputs is often a laborious manual process § Examples of supervised machine learning tasks include: § Identifying the zip code from handwritten digits on an envelope § Determining whether a tumor is benign based on a medical image § Detecting fraudulent activity in credit card transactions
  • 27. Problems Machine Learning Can Solve § Unsupervised algorithms § only the input data is known, and no known output data is given to the algorithm § Disadvantages of Unsupervised learning § Usually harder to understand and evaluate § Example – § Identifying topics in a set of blog posts [Information Analysis] § Large collection of text data --> Summarize it --> Find prevalent themes in it § Might not know beforehand what these topics are, or how many topics there might be § Segmenting customers into groups with similar preferences [Segmentation/Clustering] § Given set of customer records --> Identify which customers are similar --> groups customers with similar preferences § shopping site --> Group into bookworms, gamers, parents etc., § Detecting abnormal access patterns to a website § To find the abuse or bugs, it is often helpful to find access patters that are different from the norm.
  • 28. Important Points Note 1: § No machine learning algorithm will be able to make a prediction on data for which it has no information § Example - § Can we predict gender using last name as the only feature? Note 2: § For both supervised and unsupervised learning tasks representation of input data that a computer can understand is important § Think of your data as a table § Row - § Each data point that you want to reason about is a row § Each entity or row is known as a Sample § Column - § Each property that describes that data point is a column § The Columns are also called as Features
  • 30. KnowingYour Task and KnowingYour Data § The most important part in the machine learning process is understanding the data you are working with and how data relates to the task you want to solve § Dont ever throw the data to a randomly choosen ML algorithm § Very important to understand what is in your dataset before you begin building a model § Each algorithm is different in terms of what kind of data and what problem setting it works best for
  • 31. KnowingYour Task and KnowingYour Data Always have an answer for the following Questions to build a ML model § What question(s) am I trying to answer? Do I think the data collected can answer that question? § What is the best way to phrase my question(s) as a machine learning problem? § Have I collected enough data to represent the problem I want to solve? § What features of the data did I extract, and will these enable the right predictions? § How will I measure success in my application? § How will the machine learning solution interact with other parts of my research or business product? § Many people spend a lot of time building complex machine learning solutions, only to find out they don’t solve the right problem
  • 33. Why Python § Python has become the lingua franca for many data science applications § Combines the power of general-purpose programming languages with the ease of use of domain-specific scripting languages § Python has libraries for – § Data loading § Visualization § Statistics § Natural language processing § Image processing § This vast toolbox provides data scientists with a large array of general- and special-purpose functionality
  • 34. Why Python § Main advantages of using Python is § the ability to interact directly with the code, using a terminal or other tools like the Jupyter Notebook § Machine learning and data analysis are fundamentally iterative processes, in which the data drives the analysis § Python have tools that allow quick iteration and easy interaction § As a general-purpose programming language, Python also allows for the § Creation of complex graphical user interfaces (GUIs) § Web services § Integration into existing systems
  • 36. Scikit -learn § Open source project § Free to use and distribute, and anyone can easily obtain the source code to see what is going on behind § Constantly being developed and improved § Very active user community § It contains a number of state-of-the-art machine lear ning algorithms, as well as comprehensive documentation about each algorithm § Scikit-learn is a very popular tool, and the most prominent python library for machine learning § It is widely used in industry and academia § It has a wealth of tutorials § It has number of code snippets available online
  • 37. Scikit -learn Scikit-learn depends on two other Python packages § NumPy § SciPy § For plotting and interactive development we should install § Matplotlib § Ipython § Jupyter notebook
  • 39. Anaconda Anaconda - § A Python distribution made for – § large-scale data processing § predictive analytics § scientific computing § Anaconda comes with - § Numpy § Scipy § Matplotlib § Pandas § Ipython § Jupyter notebook § Scikitlearn § Available on § Mac OS § Windows § Linux § Includes the commercial Intel MKL library for free - § significant improvements for many algorithms in scikit-learn
  • 40. Enthought Canopy § Another python distribution for scientific computing § Enthought Canopy comes with - § Numpy § Scipy § Matplotlib § Pandas § Ipython § Free version does not come with scikit-learn § Academic, degree-granting institution can request an academic license and get free access to the paid subscription version of Enthought Canopy
  • 41. Python(x,y) Python(x,y) § Free python distribution for scientific computing § Python(x,y) comes with - § Numpy § Scipy § Matplotlib § Pandas § Ipython § Scikit-learn § Free version does not come with scikit-learn § use pip to install all of these packages:
  • 43. Jupyter Notebooks § Jupyter Notebook § Jupyter notebook is an interactive environment for running code in the browser. § Great tool for exploratory data analysis § Widely used by data scientists § The Jupyter Notebook makes it easy to incorporate code, text, and images
  • 44. numpy § NumPy (Numpy arrays) § NumPy is one of the fundamental packages for scientific computing in Python § It contains functionality for – § Multidimensional arrays § High-level mathematical functions such as linear algebra operations § The fourier transform § Pseudorandom number generators § In scikit-learn, the NumPy array is the fundamental data structure § Any data you’re using will have to be converted to a NumPy array § All elements of the numpy array must be of the same type. § Example – § Output :
  • 45. SciPy SciPy § Collection of functions for scientific computing in Python § Other functionality § Advanced linear algebra (Matrices) routines § Mathematical function optimization (sqrt(), sqaure() etc.,) § Signal processing (Fast Fourier Transforms) § Special mathematical functions (mod(), remainder(), ...) § Statistical distributions (mean(), max(), std(), ...) § Scikit-learn draws from SciPy’s collection of functions for implementing its algorithms
  • 46. scipy scipy.sparse § The most important part of SciPy for us is scipy.sparse § this provides sparse matrices, which are another representation that is used for data in scikit-learn. § Sparse matrices are used whenever we want to store a 2D array that contains mostly zeros § Example 1- Output :
  • 47. scipy CSR format - Compressed Sparse Row § Example 2- Output :
  • 49. Matplotlib Matplotlib § Primary scientific plotting library in Python § Provides functions for making publication-quality visualizations such as § Line charts § Histograms § Scatter plots § Matplotlib for all our visualizations § Visualization of data makes us to understand different aspects of the data and is helpful in easy understanding of data --> leads to easy analysis of data § Inside the Jupyter Notebook, you can show figures directly in the browser by using the %matplotlib notebook and %matplotlib inline commands
  • 51. Pandas Pandas § Python library for data wrangling and analysis § It is built around a data structure called the DataFrame § Pandas DataFrame is a table, similar to an Excel spreadsheet § Pandas provides a great range of methods to modify and operate on this table § Allows SQL-like queries and joins of tables § pandas allows each column to have a separate type § Integers § Dates § Floating-point numbers § Strings
  • 52. Pandas § Pandas has the ability to ingest from a great variety of file formats and databases, like - § SQL § Excel files § comma-separated values (CSV) files § Example 1-
  • 53. Import § Import various libraries § import numpy as np § import matplotlib.pyplot as plt § import pandas as pd § import mglearn § from IPython.display import display § Print versions –
  • 54. A First Application: Classifying Iris Species Meet the Data Measuring success: Training and Testing Data Scatter plot Pair plot K Nearest Neighbors Making Predictions Evaluating the model
  • 55. A First Application: Classifying Iris Species § Hobby botanist is interested in distinguishing the species of some iris flowers that she has found. § She has collected some measurements associated with each iris: § The length and width of the petals § And the length and width of the sepals, § All measured in centimeters § Setosa, versicolor, or virginica § The goal is to build a machine learning model that can learn from the measurements of these irises whose species is known, so that we can predict the species for a new iris.
  • 56. A First Application: Classifying Iris Species § We have measurements for which we know the correct species of iris § This is a supervised learning problem § This is an example of a classification problem § Different species of irises are called classes § Three-class classification problem § Particular data point, the species it belongs to is called its label
  • 57. IRIS: Meet the Data § Iris dataset, a classical dataset in machine learning and statistics § It is included in scikit-learn in the datasets module § load_iris function: § It is very similar to a dictionary. It contains keys and values
  • 59. IRIS: Meet the Data § Input: § Output : § Input: § Output : § Input: § Output :
  • 60. IRIS: Meet the Data § Input: § Output : § Input: § Output : § Input: § Output :
  • 62. IRIS: Measuring Success Measuring Success:Training and Testing Data § Before we apply our model to new measurements, we need to know whether it actually works (i.e., we should know whether we can trust its predictions) § Cannot use the data we used to build the model to evaluate it § Our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set § Remembering does not indicate to us whether our model will generalize well § To assess the model’s performance, we show it new data
  • 63. IRIS: Measuring Success § This is usually done by splitting the labeled data we have collected (here, our 150 flower measurements) into two parts § One part of the data is used to build our machine learning model, and is called the training data § The rest of the data will be used to assess how well the model works; this is called the test data, test set, or hold-out set § scikit-learn contains a function that shuffles the dataset and splits it for you: the train_test_split function § This function extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels, is declared as the test set § Note: § Deciding how much data you want to put into the training and the test set respectively is somewhat arbitrary, but using a test set containing 25% of the data is a good rule of thumb
  • 64. IRIS: Measuring Success § Data is usually denoted with a capital X § Labels are denoted by a lowercase y § This is inspired by the standard formulation f(x)=y in mathematics, where x is the input to a function and y is the output § use a capital X because the data is a two-dimensional array (a matrix) and a lowercase y because the target is a one-dimensional array (a vector) § Example 1 –
  • 65. IRIS: Measuring Success § Before making the split, the train_test_split function shuffles the dataset using a pseudorandom number generator § To make sure that we will get the same output if we run the s a m e f u n c t i o n s eve ra l t i m e s , we p rov i d e t h e pseudorandom number generator with a fixed seed using the random_state parameter § This will make the outcome deterministic, so this line will always have the same outcome
  • 66. IRIS: Measuring Success § The output of the train_test_split function is X_train, X_test, y_train, and y_test, which are all NumPy arrays. § X_train contains 75% of the rows of the dataset § X_test contains the remaining 25% § Input : § Output: § Input : § Output:
  • 67. IRIS: Inspect your data § One of the best ways to inspect data is to visualize it § Scatter plot § A scatter plot of the data puts one feature along the x-axis and another along the y-axis, and draws a dot for each data point § Disadvantges of scatter plot : § Allows us to plot only two (or maybe three) features at a time § Difficult to plot datasets with more than three features this way Pair plot § Which looks at all possible pairs of features § First convert the NumPy array into a pandas DataFrame § pandas has a function to create pair plots § Disadvantages of pair plot: § a pair plot does not show the interaction of all of features at once, so some interesting aspects of the data may not be revealed
  • 69. IRIS: Building First Model K-Nearest Neighbors § k-nearest neighbors classifier, which is easy to understand § Estimator Classes: § All machine learning models in scikit-learn are implemented in their own classes, which are called Estimator classes § The k-nearest neighbors classification algorithm is implemented in the KneighborsClassifier class in the neighbors module § Before we can use the model, we need to instantiate the class into an object § The most important parameter of KNeighborsClassifier is the number of neighbors, which we will set to 1 § Example 1 -
  • 70. IRIS: Building First Model § The knn object encapsulates the algorithm that will be used to build the model from the training data, as well the algorithm to make predictions on new data points § To build the model on the training set, we call the fit method of the knn object which takes as arguments § NumPy array X_train containing the training data § NumPy array y_train of the corresponding training labels § Example 2 – § Output : § The fit method returns the knn object itself
  • 71. IRIS: Building First Model § Most models in scikit-learn have many parameters, but the majority of them are either speed optimizations or for very special use cases Making predictions § Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm.What species of iris would this be? § Example – § Output : § To make a prediction, we call the predict method of the knn object
  • 72. IRIS: Building First Model § Example - § Output : § Our model predicts that this new iris belongs to the class 0, meaning its species is setosa. § But how do we know whether we can trust our model?
  • 73. IRIS: Evaluating the Model Evaluating the model § This is where the test set that we created earlier comes in § This data was not used to build the model, but we do know what the correct species is for each iris in the test set § Input - § Output : § Input - § Output:
  • 74. IRIS: Evaluating the Model Score method : § Input – § Output : § Test set accuracy is about 0.97, which means that the prediction for 97% of the irises in the test set were correct § High level of accuracy means that our model may be trustworthy enough to use