Ai, Ds & ML
Ai, Ds & ML
Benefits:
Contextual
RETAIL commerce is an online
content videos,
articles, reviews, photos- from which
consumers can buys the items featured
within it directly, without beings=
redirected to another site.
Technologies: Optimization, NLP, ML
Companies: PUMA,Bazaar,Ted baker
Conversational Commerce
Actionable Analytics
Benefits:
Computers will with clients in
human languages. Will understand Benefits:
their needs and emotions and assist
1
Course On Machine Learning & Data science
Augmented Reality
How will things look in their
actual place?
Benefits: iOS 11
Understandin
g needs,
Suggesting Python for Data Science:
the best match.
2
Course On Machine Learning & Data science
Python is an open source, general- Use of libraries will help John in the following
purpose programming language. It ways:
supports both structured and object- Faster application development – Libraries
oriented style of programming. It can be promote code reusability and help the developers
utilized for developing wide range of save time and focus on building the functional
logic.
applications including web applications,
data analytics, machine learning Enhance code efficiency – Use of pre-tested
applications etc. libraries enhances the quality and stability of the
application.
DATA TYPES: Python provides various data
Achieve code modularization – Libraries can be
types and data structures for storing and coupled or decoupled based on requirement.
processing data. For handling single
Over the last two decades, python has emerged
values, there are data types like int, float,
as a first-choice tool for tasks that involve
str, and bool. For handling data in groups, scientific computing, including the analysis and
python provides data structures like list, visualization of large datasets. Python has
tuple, dictionary, set, etc. gained popularity, particularly in the field of data
science because of large and active ecosystem of
third-party libraries.
LIBRARIES: Python has a wide range of Few of the popular libraries in data science
libraries and built-in functions which aid include NumPy, Pandas, Matplotlib and Scikit-
in rapid development of Learn.
applications. Python libraries are
collections of pre-written codes to
perform specific tasks. This eliminates the
need of rewriting the code from scratch. NUMPY:
EXAMPLE: Basic eg: A python List can be used to
store a group of elements together in a
John is a software developer. His
sequence. It can contain heterogeneous
project requires developing an application
elements.
that connects to various database servers
like MySQL, Postgre, MongoDB etc. To Following are some examples of List:
implement this requirement from scratch,
item_list = ['Bread', 'Milk', 'Eggs', 'Butter',
John needs to invest his time and effort to
'Cocoa']
understand the underlying architectures of
the respective databases. Instead, John student_marks = [78, 47, 96, 55, 34]
can choose to use pre-defined libraries to hetero_list = [ 1,2,3.0, ‘text’, True, 3+2j]
perform the database operations which
abstracts the complexities involved. To perform operations on the List
elements, one needs to iterate
3
Course On Machine Learning & Data science
%%time 4
block for many libraries available in IMPORTING NUMPY: Numpy library needs
Python. to be imported in the environment before
it can be used as shown below. 'np' is the
Data structures in Numpy
standard alias used for Numpy.
The main data structure of NumPy is the
Numpy array can be created by using
ndarray or n-dimensional array.
array() function. The array() function in
The ndarray is a multidimensional
container of elements of the same type as
depicted below. It can easily deal
with matrix and vector operations.
1. As the array size increases, Numpy
can execute more parallel Numpy returns an array object named
operations, thereby making ndarray.
computation faster. When the array
size gets close to 5,000,000, NumPy Syntax: np.array(object, dtype)
gets around 120 times faster than object – A python object (for example, a
Python List. list)
2. NumPy has many optimized built-in
dtype – data type of object (for example,
mathematical functions. These
integer)
functions help in performing variety
of complex mathematical Example: Consider the following marks
computations faster and with very scored by students:
minimal code.
3. Another great feature of NumPy is
that it has multidimensional array
data structures that can represent
vectors and matrices. This can be
useful as lot of machine learning
algorithms rely on matrix
operations.
5
Course On Machine Learning & Data science
Since each column contains homogenous Here, 3 represents the number of rows
values, Numpy arrays can be used to and 5 represents the number of elements
represent them. in each row.
Let us understand, how to represent the dtype' refers to the data type of the data
car ‘horsepower’ values in a Numpy array. contained by the array. Numpy supports
6
Course On Machine Learning & Data science
7
Course On Machine Learning & Data science
8
Course On Machine Learning & Data science
#creating a list of 5 horsepower values SORTING: The NumPy array can be sorted
by passing the array to the
horsepower = [130, 165, 150, 150, 140]
function sort(array) or by array.sort.
#creating a numpy array from horsepower
So, what is the difference between these
list
two functions though they are used for
horsepower_arr = np.array(horsepower) the same functionality?
x = np.where(horsepower_arr >= 150) The difference is that the array.sort()
print(x) # gives the indices function modifies the original array by
default, whereas the sort(array) function
# With the indices, we can find those does not.
values
horsepower_arr[x]
9
Course On Machine Learning & Data science
BROADCASTING:
Figure 1 NUMPY OPERATORS
"Broadcasting" refers to the term on how
Numpy handles arrays with different
10
Course On Machine Learning & Data science
11
Course On Machine Learning & Data science
12
Course On Machine Learning & Data science
6. img =
imread(os.path.join(data_dir
, 'astronaut.png'))
7.
To view as a matrix, the below command
8. #Slicing out the rocket
must be followed:
print((img)mg) 9. img_slice = img.copy()
img_slice[np.greater_equal(img_slice[:,:,0
],100) &
13
Course On Machine Learning & Data science
1.
end limit. The values are generated based
2. img[0:300,360:480,:] =
img_slice on the step value and by default, the step
value is 1.
3. plt.imshow(img)
Linspace
This method returns the given number of
evenly spaced values, between the given
Ones:
1. #generating 5 random numbers
Returns an array of given shape filled with from a uniform distribution
ones. 2. np.random.rand(5)
3.
Eye:
1. #random integer values
Returns an identity matrix for the given low=1, high=10, number of
shape. values=5
2. np.random.randint(1,10,
size=5)
3.
15
Course On Machine Learning & Data science
PANDAS:
Pandas is an open-source library for real
world data analysis in python. It is built on
top of Numpy. Using Pandas, data can be
16
Course On Machine Learning & Data science
Grouping operations
Sorting operations
Masking operations
Merging operations
Concatenating operations
Visualizing data
Performing operations on the data To get started with Pandas, Numpy and
Pandas needs to be imported. In a
Some of the operations supported by
nutshell, Pandas objects are advanced
pandas for data manipulation are as
versions of NumPy structured arrays in
follows:
which the rows and columns are identified
17
Course On Machine Learning & Data science
OUTPUT:
18
Course On Machine Learning & Data science
OUTPUT:
19
Course On Machine Learning & Data science
OUTPUT:
20
Course On Machine Learning & Data science
data is converted into a Dataframe using Click here to download the json file used
the pd.Dataframe command. in the demo.
OUTPUT:
We can also
arrange data which weren’t that much
OUTPUT:
From an existing file
In most real-world scenarios, the data is in
different file formats like csv, xlsx, json etc.
Pandas supports reading the data from
these files. Below is an example of creating
a DataFrame from a json file.
21
Course On Machine Learning & Data science
import pandas as pd
import numpy as np
df = pd.read_csv('auto_mpg.csv')
The df
axis keyword: print(df)
One of the important parameters used HEAD & TAIL FUNCTIONS:
while performing operations on
DataFrames is 'axis'. Axis takes two To view the first few rows or the last
few rows, the functions that can be used
are: df.head() and df.tail() respectively.
If the number of rows to be viewed is not
passed, then, the head and tail functions
provides five rows by default. Example for
head is given below.
values: 0 and 1.
X=df.head()
axis = 0 represents row specific
Print(X)
operations.
DESCRIBE:
axis = 1 represents column specific
operations. The describe function can be used to
generate a quick summary of data
Reading the data from XYZ custom cars
statistics. It provides the mean, max, min
Pandas can read a variety of files. For and standard deviation values for the
example, a table of fixed width formatted data.
lines (read_fwf), excel sheets (read_excel),
X=df.describe()
22
Course On Machine Learning & Data science
INFO
To know about the datatypes and number
of rows containing null values for
respective columns, the info() function can
be used.
After dropping the rows with null
1. df.info() horsepower values, it can be observed
that the number of rows has been
reduced to 392.
23
Course On Machine Learning & Data science
OUTPUT:
24
Course On Machine Learning & Data science
OUTPUT:
First column and 2nd row value. To select a subset of columns, the
column names can be passed as a list.
Note: While retrieving records using loc,
the upper range of slice is inclusive.
25
Course On Machine Learning & Data science
He Output:
r e
REMOVING:
Output:
26
Course On Machine Learning & Data science
Race car Cars Low weight, df.loc[(df['mpg'] > 29) & (df['horsepower']
specifically High < 93.5) & (df['weight'] < 2500)]
designed for acceleration
race tracks Output: Here this program returns 83
rows x 9 columns of the entire Dataframe.
(83 cars)
Their experienced engineers and
mechanics have come up with the
following parameters for these categories.
27
Course On Machine Learning & Data science
OUTPUT:
# Muscle cars
# Displacement >262, Horsepower > 126,
Weight in range [2800, 3600]
Syntax:
df.loc[(df['displacement'] > 262) &
DataFrame.mask(cond, other = nan,
(df['horsepower'] > 126) & (df['weight']
inplace = False, axis = None)
>=2800) & (df['weight'] <= 3600)]
cond – Where cond is False, keep the
Output: It returns 11x9 table. (11cars)
original value. Where True, replace with
corresponding value from other
other - Entries where cond is True are
replaced with corresponding value
from other.
inplace - Whether to perform the
operation in place on the data.
Race cars and SUVs are classified from axis – alignment axis
the entire dataframe using the d.loc
function.
MASKING OPERATION: The masking
operation replaces values where the
condition is True.
The teacher does not want to reveal the
marks of students who have failed. The
condition is that if a student has scored
marks >= 33, then they have passed,
otherwise failed. The marks of failed Figure 8 Masking Code
29
Course On Machine Learning & Data science
OUTPUT:
30
Course On Machine Learning & Data science
DataFrame.agg(func, axis = 0)
func - Function to use for aggregating the
data. If a function, must either work when
passed a DataFrame or when passed to One of the engineers suggests about
DataFrame.apply. checking the mean, minimum and
maximum horsepower based on number
axis: If 0 or ‘index’: apply function to each of cylinders and model year. For such
column. If 1 or ‘columns’: apply function requirement, the ‘agg’ function can be
to each row. combined with groupby function as shown
below:
GROUPING:
XYZ custom cars want to know the
number of cars manufactured in each
year.
This would require a grouping operation. The teacher wants to combine the marks
Pandas supports a group by feature to of these students.
group our data for aggregate operations. Solution: Using concatenation to combine
the marks-
DataFrame.groupby(by =
column_name, axis, sort)
31
Course On Machine Learning & Data science
A
Pi
vo
t
32
Course On Machine Learning & Data science
Output:
35
Course On Machine Learning & Data science
a x
.p l
o t(
x,
y,
=
data
on
the
vertical axis
36
Course On Machine Learning & Data science
plot.
Though it seems like plot 2 is
embedded in plot 1 due to the placement
A subplot is crested with two different
plots aligned in 1 row and 2 columns. In
this plot, it is seen that the y_label of the
second plot is overlapping with the first
plot. To avoid
this, ‘fig.tight_layout()’ must be added.
OUTPUT:
of the axes, plot 1 and plot 2 are
completely different plots and can only be
TYPES OF PLOTS:
37
Course On Machine Learning & Data science
Box plot
Scatter plot
Bar chart
Histogram
Pie chart
Line chart
There is no data for city mileage, but
BOX PLOT: city mileage is 25% less than the average
A boxplot gives a good indication of mileage i.e., ‘mpg’. Next is to process the
distribution of data about the median. data for city mileage. A new column
Boxplots are a standardized way of ‘city_mileage’ is created. Next, the
displaying the distribution of data based distribution of the average mileage and
on the five-number summary city mileage has to be compared.
(“minimum”, first quartile (Q1), median,
third quartile (Q3), and “maximum”).
First, let us plot the average mileage ‘mpg’
from the data using a boxplot.
Syntax:
ax.boxplot(data) #ax represents axes
SCATTER PLOT:
A scatter plot uses dots or markers to
represent a value on the axes. The scatter
plot is one of the simplest plots which can
accept both quantitative and qualitative
values, with a wide variety of applications
in primitive data analysis.
Several meaningful insights can be drawn
OUTPUT: from a scatter plot. For
example, identifying the type of
correlation between variables before
diving deeper into predictions.
38
Course On Machine Learning & Data science
Syntax:
ax.scatter(x, y, marker) #ax represents axes
x = data on the horizontal axis
y = data on the vertical axis
Visualizing the correlation between
marker = shape of data points (example 'o' for
mileage and horsepower based on the
origin of the cars.
OUTPUT:
40
Course On Machine Learning & Data science
OUTPUT:
From the histogram, it is observed
The details on origin of the cars and their
numbers can be presented to the
stakeholders visually for their easy
understanding.
Let us visualize the data using a pie chart
as follows:
Syntax:
ax.pie(x, labels) #ax represents axes
x = wedge size. one-dimensional array
that most of the cars, have the
labels = sequence of strings providing
horsepower value ranging between 70 and
the label for wedges
110.
PIE CHART:
41
Course On Machine Learning & Data science
addition to color and linestyle, the width Figure can be saved as an imagein the
of the line can be customized to get a local systems using Matplotlib. This will
unique plot. help document the plots easily. The
‘fig.savefig()’ is used for this functionality.
Below is a small example of how linestyle,
width, and color can be used. Let us save the figure from the previous
example as follows:
# To save a figure --
fig.savefig("multiple-axes-plots.jpg",dpi =
200)
Here, we have saved the image as
"multiple-axes-plots.jpg". 'dpi' – Dots Per
Inch indicates the resolution of the image,
higher the number, more will be the
resolution.
While using plt.plot() to create a
plot, ‘plt.savefig()’ can be used to save the
figure as an image
#when graph plotted using plt
plt.savefig("filename.jpg",dpi = 200)
Seaborn
Several others such as texting and
annotate can also be done in matplot line Seaborn is a statistical data visualization
graphs which can be seen in the ppt. library in python. It is integrated to work
with Pandas DataFrames with a straight
https://
forward approach.
infyspringboard.onwingspan.com/web/
en/viewer/web-module/ Seaborn extends the plotting capabilities
lex_auth_013331489980080128217_share of Matplotlib and provides a high-level
d? interface to generate attractive plots that
collectionId=lex_auth_0136097078904913 are visually appealing.
9215&collectionType=Learning Plotly
%20Path&pathId=lex_auth_01333063698
060902494_shared,lex_auth_0133313795 Plotly is another data visualization library
25869568215_shared that is used to generate highly interactive
plots.
43
Course On Machine Learning & Data science
SCIKIT_LEARN:
Scikit-learn (also referred as sklearn) is
a python library widely used for machine
learning. It is characterized by a clean,
uniform and streamlined API.
Machine Learning (ML)is a branch of
artificial intelligence that aims at building
systems that can learn from data, identify
patterns and make decisions with minimal
human intervention. STEP 2 DATA PREPROCESSING
PROBLEM: Engineers at XYZ custom Data Properties:
cars now want to create a machine
df.info()
learning model that can predict
the mpg of any car that comes to their Output:
garage. MPG refers to Miles per gallon.
A linear regression model is to be made
for this problem. The different stages to
be followed in ML Model building is shown
below:
OUTPUT:
44
Course On Machine Learning & Data science
approach is to remove the rows with null #Mpg is the target 0 (as it is to be
values as shown below. predicted)
Since the origin feature is a categorical
variable, get_dummies function can be
used from Pandas to encode it as shown
below:
X = pd.get_dummies(X)
X
From the below image, it can be
observed that the categorical variable
'origin' has been encoded with 0s and 1s.
45
Course On Machine Learning & Data science
Generally, data can be of two types: select the right type of graph
infer information like outliers,
Qualitative/Categorical Data: Data
correlation of variables, redundant
that deals with characteristics and
features etc.
descriptions. It is further
categorised as: Outliers:
1. Binary: Data that is dichotomous. Outliers are the extreme values present
For example, True/False, Yes/No, in the dataset. They affect the properties
1/0 etc. of data like mean and variance which are
2. Nominal: Data with no ordering or used in model building. Hence, they may
ranking. For example, different impact the accuracy of the model.
colours, blood groups,
nationality etc. So, the question that arises is, how to
3. Ordinal: Data with specific order or know if a value is an outlier? And how to
ranking. For example, height (short, deal with such values? Let us find out.
48
Course On Machine Learning & Data science
Quartiles
Quartiles divide the number of data
points into four equal-sized groups, or
quarters.
Following are the steps to find quartiles: Identifying Patterns in Data
sort the dataset in ascending order. Visualising the tuples as scatter plots
find median of the sorted can be useful to spot gaps in the values
dataset (median divides the dataset and hence identify the data points crucial
into two halves - Quartile 2 or Q2). to the dataset. It can help draw decisive
repeat step 2 with the first inferences about the type of predictor or
and second half of the data (this classifier to be used.
gives Q1 and Q3, dividing the
dataset into four equal parts).
With the help of quartiles, a value
called Inter-Quartile Range (IQR) can be
calculated using the formula:
IQR = Q3 - Q1
Inter-Quartile Range A line chart is used to analyse historic
Inter-Quartile Range also called mid- variations and trends in data.
spread, H-spread, or IQR, indicates where A bar chart is a graph with rectangular
most of the data is lying. bars that compares different categories.
As IQR is calculated using the median, the A histogram represents data as
outlying values don not affect it. A rectangular bars. Unlike the bar chart, it
formula is used to calculate the upper is used for continuous data, to obtain
limit and lower limit of this range. Any the frequency distribution of a given data.
data point lying outside these limits is an
outlier.
Upper Limit: Q3 + (1.5 * IQR)
Lower Limit: Q1 – (1.5 * IQR)
49
Course On Machine Learning & Data science
A dist plot or distribution plot, depicts similar values are depicted by the same
the variation in a data distribution. It colours. The colours vary based on
represents the overall distribution of the intensity of the results.
continuous data variables. The dist plot
One example for heat map is to find the
correlation between the variables in a
dataset, as depicted in the figure below.
significance of the words are depicted by 2020. It represents the count of the
the font, font size and colour of the text in spread on the given date. A deeper shade
the cluster. corresponds to a higher value while a
lighter shade marks the safe regions.
Words with greater significance and
occurrence are depicted in a bigger and
bolder font towards the central location of
the cluster and other latent words occupy
peripheral places with smaller fonts and
faded colours. Most insignificant words,
stop words, irrelevant information is
eliminated from the cluster while plotting
it.
A word cloud finds its usage more in
Natural Language Processing.
51
Course On Machine Learning & Data science
52