Unit i Introduction
Unit i Introduction
UNIT I INTRODUCTION
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types
of data - Exploring structure of data - Data quality and remediation - Data Pre-processing
What is Artificial Intelligence
Intelligence is also the ability to learn from the environment and change our behavior
based on the inputs we receive.
Definition2: The theory and development of computer systems able to perform tasks
normally requiring human intelligence, such as visual perception, speech recognition,
decision-making, and translation between languages.
Applications: Voice and Speech recognition, Face recognition, Customer service (Virtual
agents) and Computer vision, Recommendation engines
1
2
To do a task in a proper way, we need to have prior information on one or more things
related to the task. Also, as we keep learning more or in other words acquiring more
information, the efficiency in doing the tasks keep improving.
3
For example, with more knowledge, the ability to do homework with less number of
mistakes increases.
In the same way, information from past rocket launches helps in taking the right precautions and
makes more successful rocket launch.
In all phases of life of a human being, there is an element of guided learning. This learning is
imparted by someone, purely because of the fact that he/she has already gathered the knowledge
by virtue of his/her experience in that field. So guided learning is the process of gaining
information from a person having sufficient knowledge due to the past experience.
Example1: In school, baby starts with basic familiarization of alphabets and digits. Then the
baby learns how to form words from the alphabets and numbers from the digits. Slowly more
complex learning happens in the form of sentences, paragraphs Learning, complex mathematics,
science, etc. The baby is able to learn all these things from his teacher who already has
knowledge on these areas.
Example2: Engineering students get skilled in one of the disciplines like civil, computer
science, electrical, mechanical, etc. medical students learn about anatomy, physiology,
pharmacology, etc. There are some experts, in general the teachers, in the respective field who
have in-depth subject matter knowledge, who help the students in learning these skills.
An essential part of learning also happens with the knowledge which has been imparted by
teacher or mentor at some point of time in some other form/context.
In this method, there is NO direct learning. It is some past information shared on some different
context, which is used as a learning to make decisions.
Example1: a baby can group together all objects of same colour even if his parents have not
specifically taught him to do so. He is able to do so because at some point of time or other his
parents have told him which colour is blue, which is red, which is green, etc.
Example2: A grown-up kid can select one odd word from a set of words because it is a verb and
other words being all nouns. He could do this because of his ability to label the words as verbs or
nouns, taught by his English teacher long back.
Not all things are taught by others. A lot of things need to be learnt only from mistakes made in
the past.
We tend to form a check list on things that we should do, and things that we should not do,
based on our experiences.
Machine Learning VS Human Learning:
ML Works in the following aspects.
data,
association,
equations,
Predictions
Decision Tree
Memory
Human learning is also kind of similar but also different when it comes to emotions and memories.
data,
association,
equations and
emotions
Short memory
Long-term memory
Definition 1: Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its
accuracy.
Explanation
1. Data Input
a. During the machine learning process, knowledge is fed in the form of input data. The
vast pool of knowledge is available from the data input.
b. However, the data cannot be used in the original shape and form.
2. Abstraction
a. Machine will perform knowledge abstraction based on the input data. This is called
model - it is the summarized knowledge representation of the raw data.
b. The model may be in any one of the following forms –
i. Computational blocks like if/else rules
ii. Mathematical equations
iii. Specific data structures like trees or graphs
iv. Logical groupings of similar observations
Note: The choice of the model used to solve a specific learning problem is a human
task. Following are the some of the aspects to be considered for choosing the model –
Once the model is choosen, the next task is to fit the model based on the input data.
The process of fitting the model based on the input data is known as training.
Also, the input data based on which the model is being finalized is known as
training data.
3. Generalization
1. The trained model is aligned with the training data too much, hence may not portray the
actual trend.
2. The test data possess certain characteristics apparently unknown to the training data.
Hence, a precise approach of decision making will not work. So, an approximate or heuristic
approach, much like gut-feeling-based decision-making in human beings, has to be adopted.
This approach has the risk of not making a correct decision.
How do we define a well-posed learning problem that can be solved using Machine
Learning?
For defining a new problem, which can be solved using machine learning, a simple framework, given
below, can be used. This framework also helps in deciding whether the problem is a right candidate to
be solved using machine learning. The framework involves answering three questions:
Step 1: What is the problem?
Describe the problem informally and formally and list assumptions and similar problems.
Step 2: Why does the problem need to be solved?
List the motivation for solving the problem, the benefits that the solution will provide and how
the solution will be used.
Step 3: How would I solve the problem?
Describe how the problem would be solved manually to flush domain knowledge.
Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on enabling
computers to learn from data without being explicitly programmed. It involves creating algorithms that
can analyze patterns in data, identify relationships, and make predictions or decisions. ML systems
improve their performance and accuracy over time as they are exposed to more data and experience.
What is Machine Learning?
Machine Learning is a concept which allows the machine to learn from examples and experience, and
that too without being explicitly programmed. So instead of you writing the code, what you do is you
feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given
data.
Simply says finds pattern in data and uses those patterns to predict the future. It allows us to discover
patterns in existing data and create and make use of a model that identifies those patterns in innovative
data. It has gone mainstream.
Why Machine Learning Strategy?
Machine learning is the foundation of countless important applications, including web search, email
anti-spam, speech recognition, product recommendations, and more. I assume that you or your team
is working on a machine learning application, and that you want to make rapid progress.
Why is Machine Learning so Popular Currently?
• Plenty of data.
• Lots of computer power.
• An effective machine learning algorithm.
As we provide it with more and more examples, it is able to learn more properly so that it can
undertake the task and yield us the output more accurate.
Draft
Predictive Model
we have labeled data
The main types of supervised learning problems include regression and classification
problems
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given below:
Classification
Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable
is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
Classification is a type of supervised learning where a target feature, which is of type
categorical, is predicted for test data based on the information imparted by training data. The
target categorical feature is known as class.
Examples of Typical classification problems
a) Image classification
b) Prediction of disease
c) Win–loss prediction of games
d) Prediction of natural calamity like earthquake, flood, etc.
e) Recognition of handwriting
a) Naïve Bayes
b) Decision Tree
c) K-Nearest Neighbour algorithms
Disadvantages:
o These algorithms are not able to solve complex tasks.
o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by
using medical images and past labelled data with labels for disease conditions. With such a
process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.
Supervised Learning Use Case
Facial Recognition is one of the most popular applications of Supervised Learning and more
specifically –
Artificial Neural Networks.
Convolutional Neural Networks (CNN) is a type of ANN used for identifying the faces of
people. These models are able to draw features from the image through various filters. Finally, if
there is a high similarity score between the input image and the image in the database, a
positive match is provided.
Baidu, China’s premier search engine company has been investing in facial recognition.
While it has already installed facial recognition systems in its security systems, it is
now extending this technology to the major airports of China. Baidu will provide the airports
with facial recognition technology that will provide access to the ground crew and the staff.
Therefore, the passengers do not have to wait in long queues for flight check-in when they can
simply board their flight by scanning their faces.
2. Unsupervised Learning
In the case of an unsupervised learning algorithm, the data is not explicitly labeled
into different classes, that is, there are no labels. The model is able to learn from the data
by finding implicit patterns.
Unsupervised Learning algorithms identify the data based on their densities, structures,
similar segments, and other similar features. Unsupervised Learning Algorithms are
based on Hebbian Learning.
Cluster analysis is one of the most widely used techniques in supervised learning.
Draft
Descriptive Model
The main types of unsupervised learning algorithms include Clustering algorithms
and Association rule learning algorithms.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
Clustering
Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way
to group the objects into a cluster such that the objects with the most similarities remain in one group
and have fewer or no similarities with the objects of other groups. An example of the clustering
algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations among
variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages and Disadvantages of Unsupervised Learning Algorithm
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Applications of Unsupervised Learning
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-
commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,
which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each user
located at a particular location.
Unsupervised Learning Use Case
One of the most popular unsupervised learning techniques is clustering. Using clustering, businesses
are able to capture potential customer segments for selling their products.
Sales companies are able to identify customer segments that are most likely to use their services.
Companies can evaluate the customer segments and then decide to sell their product to
maximize the profits.
One such company that is performing brand marketing analytics using Machine Learning is an
Israeli based startup – Optimove. The goal of this company is to ingest and
process the customer data in order to make it accessible to the marketers.
They take it one step further by providing smart insights to the marketing team, allowing them to
reap the maximum profit out of their product marketing.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data.
As labels are costly, but for corporate purposes, they may have few labels. It is completely different
from supervised and unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in
supervised learning.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
o It is simple and easy to understand the algorithm.
o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement Learning covers more area of Artificial Intelligence which allows machines to
interact with their dynamic environment in order to reach their goals. With this, machines and
software agents are able to evaluate the ideal behavior in a specific context.
With the help of this reward feedback, agents are able to learn the behavior and improve it in the
longer run. This simple feedback reward is known as a reinforcement signal.
The agent in the environment is required to take actions that are based on the current state. This
type of learning is different from Supervised Learning in the sense that the training data in the
former has output mapping provided such that the model is capable of learning the correct answer.
Whereas, in the case of reinforcement learning, there is no answer key provided to the agent when
they have to perform a particular task. When there is no training dataset, it learns from its own
experience.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the
tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would occur
again by avoiding the negative condition.
Reinforcement Learning Use Case
Google’s Active Query Answering (AQA) system makes use of reinforcement learning.It
reformulates the questions asked by the user.
For example, if you ask the AQA bot the question – “What is the birth date of Nikola Tesla” then the
bot would reformulate it into different questions like “What is the birth year of Nikola Tesla”,
“When was Tesla born?” and “When is Tesla’s birthday”.
This process of reformulation utilized the traditional sequence2sequence model, but Google has
integrated reinforcement Learning into its system to better interact with the query based environment
system.
This is a deviation from the traditional seq2seq model such that all the tasks are carried out using
reinforcement learning and policy gradient methods. That is, for a given question q0, we want
to obtain the best possible answer a*.
The goal is to maximize the award a* = argmaxa R(ajq0).
Real-world Use cases of Reinforcement Learning
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs
in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI and
Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help
of Reinforcement Learning by Salesforce company.
Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.
Differences between SUPERVISED vs UNSUPERVISED vs REINFORCEMENT
2 Types: 2 Types:
Types 1. classification 1. clustering No such types
2. regression 2. association
More difficult
to understand Most complex to
Complexity Simple one to understand and implement understand and apply
than
supervised learning
1. Naïve Bayes 1. K-means
2. K-nearest neighbour (KNN) 2. Principal Component
3. Decision tree Analysis (PCA)
Standard 1. Q-learning
4. Linear Regression 3. Self-organizing
algorithms 2. Sarsa
5. Logistic regression map (SOM)
6. Support vector machine 4. Apriori algorithm
(SVM), etc. 5. DBSCAN, etc.
1. Market basket 1. Self-driving cars
1. Hand writing recognition analysis 2. Intelligent robots
Practical 2. Stock market prediction 2. Recommender 3. AlphaGo Zero (The
Applications 3. Disease prediction systems latest version of
4. Fraud detection, etc. 3. Customer DeepMind's AI
segmentation, etc. system
playing GO)
LANGUAGES/TOOLS
There are several programming languages and tools commonly used for machine learning (ML).
Here are some of the most popular ones:
Programming Languages:
1. Python:
The increasing adoption of machine learning worldwide is a major factor contributing to its
growing popularity. There are 69% of machine learning engineers and Python has become the
favourite choice for data analytics, data science, machine learning, and AI.
Python is the preferred programming language of choice for machine learning for some of the
giants in the IT world including Google, Instagram, Facebook, Dropbox, Netflix, Walt Disney,
YouTube, Uber, Amazon, and Reddit. Python is an indisputable leader and by far the best language
for machine learning today and here’s why:
6. OpenNN
OpenNN is developed by the artificial intelligence company Artelnics. OpenNN is an
advanced analytics firmware library written in C++. The most successful method of machine
learning is the implementation of neural networks. It is high in performance. The execution
speed and memory allocation of this library stand out.
7. Amazon SageMaker
Amazon SageMaker is a fully managed service that allows data researchers and developers to
build, train and implement machine learning models on any scale quickly and easily. Amazon
SageMaker supports open-source web application Jupyter notebooks that help developers share live
code. These notebooks include drivers, packages, and libraries for common deep learning platforms
and frameworks for SageMaker users. Amazon SageMaker optionally encrypts models both during
and during transit through AWS Key Management Service, and API requests are performed over a
secure connection to the socket layer. SageMaker also stores code in volumes that are protected and
encrypted by security groups.
Issues:
Although machine learning is being used in every industry and helps organizations make more
informed and data-driven choices that are more effective than classical methodologies, it still has so
many problems that cannot be ignored. Here are some common issues in Machine Learning that
professionals face to inculcate ML skills and create an application from scratch.
Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.
Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.
Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
2. Poor quality of data
As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be considered as a
major common problem while processing machine learning algorithms.
3. Non-representative training data
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data must
cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for generalized
cases and provides accurate decisions. If there is less training data, then there will be a sampling
noise in the model, called the non-representative training set. It won't be accurate in predictions. To
overcome this, it will be biased against one class or a group. Hence, we should use representative
data in training to protect against being biased and make accurate predictions without any drift.
8. Customer Segmentation
Customer segmentation is also an important issue while developing a machine learning algorithm.
To identify the customers who paid for the recommendations shown by the model and who don't
even check them. Hence, an algorithm is necessary to recognize the customer behavior and trigger a
relevant recommendation for the user based on past experience.
DATA SET
A data set is a collection of related information or records. The information may be on some
entity or some subject area.
Example
Each row of a data set is called a record. Each data set also has multiple attributes, each of
which gives information on a specific characteristic.
TYPES OF DATA
1. Qualitative data
2. Quantitative data
Qualitative data provides information about the quality of an object or information which cannot
be measured.
a. Nominal data
Nominal data is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified.
b. Ordinal data
is numeric data for which not only the order is known, but the exact
difference between values is also known.
Ex: Celsius temperature, date and time.
b. Ratio data
represents numeric data for which exact value can be measured. Ex:
height, weight, age, salary, etc.
Other types of attributes
Attributes can also be categorized into 2 types based on a number of values that can be
assigned.
a. Discrete attributes
b. Continuous attributes
Continuous attributes can assume any possible value which is a real number. Ex:
length, height, weight, price, etc.
NOTE: In general, nominal and ordinal attributes are discrete. On the other hand, interval and
ratio attributes are continuous
TYPES OF DATA
Data has to be converted into numeric representation so that the machines are able to learn
the patterns within data. Understanding the different data types can help us identify correct
preprocessing techniques & convert the data appropriately. Furthermore, it will also enable us to
perform the best visualizations and uncover hidden knowledge.
Structured data
This type of data is usually composed of numbers or words. They are usually stored in
Relational databases and can be easily searched using SQL queries.
Numeric/Quantitative data
As the name suggests, this encompasses data that can be represented through numbers. Examples of
such data are sales price, metric quantities such as temperature, time, length, height & weight of a
person, and so on. Numeric data is further divided into two categories, namely discrete and
continuous.
Discrete
In this category, the data takes on discrete values or whole numbers i.e numbers without decimal
points. Examples are the number of houses in a city, the number of consumers in a grocery store
over the last month, the number of Instagram followers that you have, and so on.
E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.
Continuous
In this category, the data takes on integer values i.e numbers with decimal values. Examples for
continuous numeric data are house prices in the city, sale prices of grocery store items, Instagram
earnings that you received, and so on.
E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.
Categorical/Qualitative data
As the name suggests, this encompasses data that can be represented through words. It usually defines
groups or categories & is therefore known as categorical data. Some examples are the names of all
items in a supermarket, movie ratings (good, average, bad), country of birth of individuals & so
on.
Ordinal
This type of data has an inherent ordering present within the categories. For instance, if you
consider movie ratings with good, average & bad as the different categories, good has a higher
ranking than average which is higher than bad. This needs to be taken into account while
converting this type of data into numbers so that the models can learn this ranking as well. There is a
fixed, finite number of categories/groups. Examples will be movie ratings, student grades,
Employee performance, and so on.
E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc.
Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for
binary classification in deep learning. Although stacks are easy to learn and implement in ML
models but having a good grasp can help in many computer science aspects such as parsing
grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to a stack of
blog content. There is no sense in adding a blog at the bottom of the stack.
However, we can only check the most recent one that has been added. Addition and removal occur at
the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or in other
words, a list is the type of collection of data elements that consist of a value and pointer that point to
the next node in the list.
In a linked list, insertion and deletion are constant time operations and are very efficient, but
accessing a value is slow and often requires scanning. So, a linked list is very significant for a
dynamic array where the shifting of elements is required. Although insertion of an element can be
done at the head, middle or tail position, it is relatively cost consuming. However, linked lists are
easy to splice together and split apart. Also, the list can be converted to a fixed-length array for fast
access.
Queue:
A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario
in real-time programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue
is significant in a program where multiple lists of codes need to be processed.
The queue data structure can be used to record the split time of a car in F1 racing.
2. Non-linear Data Structures
As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All
the elements are arranged and linked with each other in a hierarchal manner, where one element can
be linked with one or more elements.
1) Trees Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only difference of nodes
and their pointers. In a linked list, each node contains a data value with a pointer that points to the
next node in the list, whereas; in a binary tree, each node has two pointers to subsequent nodes
instead of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time
complexity. Similar to the linked list, a binary tree can also be converted to an array on the basis of
tree sorting.
In a binary tree, there are some child and parent nodes shown in the above image. Where the value
of the left child node is always less than the value of the parent node while the value of the right-
side child nodes is always more than the parent node. Hence, in a binary tree structure, data sorting
is done automatically, which makes insertion and deletion efficient.
2) Graphs
A graph data structure is also very much useful in machine learning for link prediction. Graphs are
directed or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have
good exposure to the graph data structure for machine learning and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly useful for
minimizing the run-time algorithms and fast searching the data. It stores data in the form of (key,
value) pair, where the key must be unique; however, the value can be duplicated. Each key
corresponds to or maps a value; hence it is named a Map.
In different programming languages, core libraries have built-in maps or, rather, HashMaps with
different names for each implementation.
In Java: Maps
In Python: Dictionaries
C++: hash_map, unordered_map, etc.
Python Dictionaries are very useful in machine learning and data science as various functions and
algorithms return the dictionary as an output. Dictionaries are also much used for implementing
sparse matrices, which is very common in Machine Learning.
4) Heap data structure:
Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a
tree, but it consists of vertical ordering instead of horizontal ordering. Ordering in a heap DS is
applied along the hierarchy but not across it, where the value of the parent node is always more than
that of child nodes either on the left or right side.
Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly,
the element is inserted at the highest available position. After that, it gets compared with its parent
and promoted until it reaches the correct ranking position. Most of the heaps data structures can be
stored in an array along with the relationships between the elements.
Exploring structure of data is to identify the OUTLIERS, DATA SPREAD, MISSING values, etc.
Outliers are the values which are unusually high or low, compared to the other values.
1) Exploring Numerical data
To understand the nature of numeric variables, we can apply the measures of central
tendency of data, i.e. mean and median.
In statistics, measures of central tendency help us understand the central point of a
set of data.
Mean
Ex: Mean of a set of observations – 21, 89, 34, 67, and 96 is calculated as below.
Median is the value of the element appearing in the middle of an ordered list of data elements.
Ex: Median of a set of observations – 21, 89, 34, 67, and 96 is calculated as below.
The ordered list would be -> 21, 34, 67, 89, and 96. Since there are 5 data elements, the
3rd element in the ordered list is considered as the median. Hence, the median value
of this set of data is 67.
MEAN is likely to get shifted drastically even due to the presence of a small number of
outliers
b) Understanding the data spread
Variance of the data is used to measure the extent of dispersion of data or to find out
how much the different values of a data are spread out.
Standard deviation
For Attribute 1
Mean = 44+46+48+45+47 / 5 = 46
For Attribute 2
Mean = 34+46+59+39+52 / 5 = 46
So it is quite clear from the measure that attribute 1 values are quite concentrated
around the mean while attribute 2 values are extremely spread out.
ii) Measuring different data values position
Any data set has five values –
Minimum, first quartile (Q1),median (Q2), third quartile (Q3), and maximum
When the data values of an attribute are arranged in an increasing order, we have
seen earlier that median gives the central data value, which divides the entire data
set into two halves.
Similarly, if the first half of the data is divided into two halves so that each half
consists of one quarter of the data set, then that median of the first half is known as
first quartile or Q1.
In the same way, if the second half of the data is divided into two halves, then that
median of the second half is known as third quartile or Q3.
Quantiles: Refer to specific points in a data set which divide the data set into equal
parts or equally sized quantities.
Quartile: When the entire data set is which is ordered, splitted into 4 parts is known as
a quartile.
Percentile: When the data set is splitted into 100 parts is known as a percentile.
Plotting and exploring numerical data
Following two techniques are used to plot and explore the numerical data
i) Box plots
The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus
giving the inter-quartile range (IQR).
IQR = Q3-Q1
ii) Histogram
a) The focus of histogram is to plot ranges of data values (acting as ‘bins’), the
number of data elements in each range will depend on the data distribution.
Based on that, the size of each bar corresponding to the different ranges will
vary.
b) The focus of box plot is to divide the data elements in a data set into four equal
portions, such that each portion contains an equal number of data elements.
MODE is only the measure we can apply to explore the categorical data.
mode is also a statistical measure for central tendency of a data.
Mode of a data is the data value which appears most often.
a) Scatter plot
For example, in a data set there are two attributes – attr_1 and attr_2. We
want to understand the relationship between two attributes, i.e. with a change
in value of one attribute, say attr_1, how does the value of the other attribute,
say attr_2, changes.
We can draw a scatter plot, with attr_1 mapped to x-axis and attr_2 mapped in y-axis.
So, every point in the plot will have value of attr_1 in the x-coordinate and value of
attr_2 in the y coordinate.
It has a matrix format that presents a summarized view of the bivariate frequency distribution.
A cross-tab, very much like a scatter plot, helps to understand how much the data values
of one attribute changes with the change in data values of another attribute.
1) DATA QUALITY
Success of machine learning depends largely on the quality of data. A data which has the
right quality helps to achieve better prediction accuracy, in case of supervised learning.
The data may not reflect normal or regular quality due to incorrect selection of sample set.
If we are selecting a sample set of sales transactions from a festive period and trying to
use that data to predict sales in future. In this case, the prediction will be far apart from
the actual scenario, just because the sample set has been selected in a wrong time.
In many cases, a person or group of persons are responsible for the collection of
data to be used in a learning activity.
In this manual process, there is the possibility of wrongly recording data either
in terms of value (say 20.67 is wrongly recorded as 206.7 or 2.067) or in terms
of a unit of measurement (say cm. is wrongly recorded as m. or mm.).
This may result in data elements which have abnormally high or low value
from other elements. Such records are termed as outliers.
2) DATA REMEDIATION
The issues in data quality need to be remediated, if the right amount of efficiency has
to be achieved in the learning activity.
(1) For Incorrect sample set selection – Remedy is “Proper sampling technique”
In a data set, one or more data elements may have missing values in
multiple records.
There are multiple strategies to handle missing value of data elements.
Some of those strategies are:
If there are data points similar to the ones with missing attribute
values, then the attribute values from those similar data points
can be planted in place of the missing value.
Ex:The weight of a Russian student having age 12 years and
height 5 ft. is missing. Then the weight of any other Russian
student having age close to 12 years and height close to 5 ft. can
be assigned.
DATA PRE-PROCESSING
Two techniques are applied as part of data pre-processing
1) Dimensionality reduction
2) Feature subset selection
1) Dimensionality reduction
High-dimensional data sets need a high amount of computational
space and time. At the same time, not all features are useful – they
As part of this process, few features will be eliminated which are irrelevant.
A feature is considered as irrelevant if it plays an insignificant role (or
contributes almost no information) in classifying or grouping together a set of
data instances.
Categorical features: Features whose explanations or values are taken from a defined set of
possible explanations or values. Categorical values can be colors of a house; types of animals;
months of the year; True/False; positive, negative, neutral, etc. The set of possible categories
that the features can fit into is predetermined.
Numerical features: Features with values that are continuous on a scale, statistical, or integer-
related. Numerical values are represented by whole numbers, fractions, or percentages.
Numerical features can be house prices, word counts in a document, time it takes to travel
somewhere, etc.
Data Preprocessing Steps
Let’s take a look at the established steps you’ll need to go through to make sure
your data is successfully preprocessed.
Discreditation: Discretization pools data into smaller intervals. It’s somewhat similar to
binning, but usually happens after data has been cleaned. For example, when calculating
average daily exercise, rather than using the exact minutes and seconds, you could join
together data to fall into 0-15 minutes, 15-30, etc.
Concept hierarchy generation: Concept hierarchy generation can add a hierarchy within
and between your features that wasn’t present in the original data. If your analysis contains
wolves and coyotes, for example, you could add the hierarchy for their genus: Canis.
4. Data reduction
The more data you’re working with, the harder it will be to analyze, even after cleaning and
transforming it. Depending on your task at hand, you may actually have more data than you
need. Especially when working with text analysis, much of regular human speech is
superfluous or irrelevant to the needs of the researcher. Data reduction not only makes the
analysis easier and more accurate, but cuts down on data storage.
It will also help identify the most important features to the process at hand.
Attribute selection: Similar to discreditation, attribute selection can fit your data into
smaller pools. It, essentially, combines tags or features, so that tags like male/female and
professor could be combined into male professor/female professor.
Numerosity reduction: This will help with data storage and transmission. You can use a
regression model, for example, to use only the data and variables that are relevant to your
analysis.
We can use data cleaning to simply remove these rows, as we know the data was improperly
entered or is otherwise corrupted.
Name Age Company
Karen Lynch 57 CVS Health
Tim Cook 60 Apple
Short Answers