0% found this document useful (0 votes)
2 views57 pages

Unit i Introduction

The document provides an overview of human learning and machine learning, detailing their definitions, types, and processes. It explains how machine learning mimics human learning through data and algorithms, categorizing it into supervised, unsupervised, and reinforcement learning. Additionally, it discusses applications of machine learning, the importance of data quality, and the distinctions between human and machine learning capabilities.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views57 pages

Unit i Introduction

The document provides an overview of human learning and machine learning, detailing their definitions, types, and processes. It explains how machine learning mimics human learning through data and algorithms, categorizing it into supervised, unsupervised, and reinforcement learning. Additionally, it discusses applications of machine learning, the importance of data quality, and the distinctions between human and machine learning capabilities.

Uploaded by

rajikarthi2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

1

UNIT I INTRODUCTION
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types
of data - Exploring structure of data - Data quality and remediation - Data Pre-processing
What is Artificial Intelligence

Definition1: As per Oxford Dictionary - “Ability to learn, understand and think”.

 Intelligence is also the ability to learn from the environment and change our behavior
based on the inputs we receive.

Definition2: The theory and development of computer systems able to perform tasks
normally requiring human intelligence, such as visual perception, speech recognition,
decision-making, and translation between languages.

 At its simplest form, artificial intelligence is a field, which combines computer


science and robust datasets, to enable problem-solving.

 Artificial intelligence leverages computers and machines to mimic the problem-


solving and decision-making capabilities of the human mind.

Applications: Voice and Speech recognition, Face recognition, Customer service (Virtual
agents) and Computer vision, Recommendation engines

1
2

Traditional programming vs Machine Learning

Traditional Programming Example: Device for Activity Recognition (SMARTWATCH)

What is (Human) Learning and Why do we need to learn?

Learning is typically referred to as the process of gaining information through observation.

To do a task in a proper way, we need to have prior information on one or more things
related to the task. Also, as we keep learning more or in other words acquiring more
information, the efficiency in doing the tasks keep improving.
3

For example, with more knowledge, the ability to do homework with less number of
mistakes increases.

In the same way, information from past rocket launches helps in taking the right precautions and
makes more successful rocket launch.

Thus, with more learning, tasks can be performed more efficiently.

TYPES OF HUMAN LEARNING


Human Learning happens in one of the three ways –

(1) Learning under expert guidance


(2) Learning guided by knowledge gained from experts
(3) Learning by self

(1) Learning under expert guidance

In all phases of life of a human being, there is an element of guided learning. This learning is
imparted by someone, purely because of the fact that he/she has already gathered the knowledge
by virtue of his/her experience in that field. So guided learning is the process of gaining
information from a person having sufficient knowledge due to the past experience.

Example1: In school, baby starts with basic familiarization of alphabets and digits. Then the
baby learns how to form words from the alphabets and numbers from the digits. Slowly more
complex learning happens in the form of sentences, paragraphs Learning, complex mathematics,
science, etc. The baby is able to learn all these things from his teacher who already has
knowledge on these areas.

Example2: Engineering students get skilled in one of the disciplines like civil, computer
science, electrical, mechanical, etc. medical students learn about anatomy, physiology,
pharmacology, etc. There are some experts, in general the teachers, in the respective field who
have in-depth subject matter knowledge, who help the students in learning these skills.

(2) Guided by knowledge gained from experts

An essential part of learning also happens with the knowledge which has been imparted by
teacher or mentor at some point of time in some other form/context.

In this method, there is NO direct learning. It is some past information shared on some different
context, which is used as a learning to make decisions.

Example1: a baby can group together all objects of same colour even if his parents have not
specifically taught him to do so. He is able to do so because at some point of time or other his
parents have told him which colour is blue, which is red, which is green, etc.
Example2: A grown-up kid can select one odd word from a set of words because it is a verb and
other words being all nouns. He could do this because of his ability to label the words as verbs or
nouns, taught by his English teacher long back.

(3) Learning by self

In many situations, humans are left to learn on their own.

 A classic example is a baby learning to walk through obstacles. He bumps on to


obstacles and falls down multiple times till he learns that whenever there is an obstacle, he needs
to cross over it.
 He faces the same challenge while learning to ride a cycle as a kid or drive a car as an
adult.

Not all things are taught by others. A lot of things need to be learnt only from mistakes made in
the past.

We tend to form a check list on things that we should do, and things that we should not do,
based on our experiences.
Machine Learning VS Human Learning:
ML Works in the following aspects.
 data,
 association,
 equations,
 Predictions
 Decision Tree
 Memory
Human learning is also kind of similar but also different when it comes to emotions and memories.
 data,
 association,
 equations and
 emotions
 Short memory
 Long-term memory

WHAT IS MACHINE LEARNING

Definition 1: Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its
accuracy.

Example1: In context of image classification,


 E represents the past data with images having labels or assigned classes (for example
whether the image is of a class cat or a class dog or a class elephant etc.),
 T is the task of assigning class to new, unlabelled images and
 P is the performance measure indicated by the percentage of images correctly classified.
Example2: In context of the learning to play checkers,
 E represents the experience of playing the game,
 T is the task of playing checkers and
 P is the performance measure indicated by the percentage of games won by the player.

Type of problems to be solved using Machine Learning


The problems related to
(1) forecast
(2) prediction
(3) analysis of a trend
(4) understanding the different segments or groups of objects, etc.
Type of problems NOT to be solved using Machine Learning
Machine Learning should NOT be applied to
1) tasks in which humans are very effective
2) tasks in which frequent human intervention is needed (Ex: Air traffic control)
3) tasks that are very simple which can be implemented using traditional programming
paradigms (Ex: Price calculator engine)
4) the situations where training data is NOT sufficient.

How do machines learn (Process of Machine Learning)?


The basic machine learning process can be divided into three parts.
1. Data Input: Past data or information is utilized as a basis for future decision-making
2. Abstraction (Training the Model): The input data is represented in a broader way through
the underlying algorithm
3. Generalization (Future Decisions/Testing the model for accuracy): The abstracted
representation is generalized to form a framework for making decisions

Explanation
1. Data Input
a. During the machine learning process, knowledge is fed in the form of input data. The
vast pool of knowledge is available from the data input.
b. However, the data cannot be used in the original shape and form.
2. Abstraction
a. Machine will perform knowledge abstraction based on the input data. This is called
model - it is the summarized knowledge representation of the raw data.
b. The model may be in any one of the following forms –
i. Computational blocks like if/else rules
ii. Mathematical equations
iii. Specific data structures like trees or graphs
iv. Logical groupings of similar observations

Note: The choice of the model used to solve a specific learning problem is a human
task. Following are the some of the aspects to be considered for choosing the model –

i. The type of the problem to be solved


ii. Nature of the input data
iii. Domain of the problem

Once the model is choosen, the next task is to fit the model based on the input data.
The process of fitting the model based on the input data is known as training.
Also, the input data based on which the model is being finalized is known as
training data.
3. Generalization

This is the key part and quite difficult to achieve.


In this, we will apply the model to take decision on a set of unknown data, usually called as test
data.

But, with test data we may encounter two problems –

1. The trained model is aligned with the training data too much, hence may not portray the
actual trend.
2. The test data possess certain characteristics apparently unknown to the training data.

Hence, a precise approach of decision making will not work. So, an approximate or heuristic
approach, much like gut-feeling-based decision-making in human beings, has to be adopted.
This approach has the risk of not making a correct decision.

How do we define a well-posed learning problem that can be solved using Machine
Learning?
For defining a new problem, which can be solved using machine learning, a simple framework, given
below, can be used. This framework also helps in deciding whether the problem is a right candidate to
be solved using machine learning. The framework involves answering three questions:
Step 1: What is the problem?
Describe the problem informally and formally and list assumptions and similar problems.
Step 2: Why does the problem need to be solved?
List the motivation for solving the problem, the benefits that the solution will provide and how
the solution will be used.
Step 3: How would I solve the problem?
Describe how the problem would be solved manually to flush domain knowledge.
Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on enabling
computers to learn from data without being explicitly programmed. It involves creating algorithms that
can analyze patterns in data, identify relationships, and make predictions or decisions. ML systems
improve their performance and accuracy over time as they are exposed to more data and experience.
What is Machine Learning?

Machine Learning is a concept which allows the machine to learn from examples and experience, and
that too without being explicitly programmed. So instead of you writing the code, what you do is you
feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given
data.

Simply says finds pattern in data and uses those patterns to predict the future. It allows us to discover
patterns in existing data and create and make use of a model that identifies those patterns in innovative
data. It has gone mainstream.
Why Machine Learning Strategy?

Machine learning is the foundation of countless important applications, including web search, email
anti-spam, speech recognition, product recommendations, and more. I assume that you or your team
is working on a machine learning application, and that you want to make rapid progress.
Why is Machine Learning so Popular Currently?
• Plenty of data.
• Lots of computer power.
• An effective machine learning algorithm.

Top Machine Learning Companies


It is becoming an important part of our everyday life. It is really utilized in financial procedures,
medical examinations, logistics, posting, and a variety of different fast-rising industries.
• Google: Neural Networks and Machines
• Tesla: Autopilot
• Amazon: Echo Speaker Alexa
• Apple: Personalized Hey Siri
• TCS: Machine First Delivery Model with Robotics
• Facebook: Chatbot Army etc.
How does Machine Learning Work?Machine Learning algorithm is trained using a training data set
to create a model. When new input data is introduced to the ML algorithm, it makes a prediction on
the basis of the model. The prediction is evaluated for accuracy and if the accuracy is acceptable, the
Machine Learning algorithm is deployed. If the accuracy is not acceptable, the Machine Learning
algorithm is trained again and again with an augmented raining data set

Types of Machine Learning


Machine learning is sub-categorized to three types:

 Supervised Learning – Train Me!

 Unsupervised Learning – I am self sufficient in learning

 Reinforcement Learning – My life My rules! (Hit & Trial)

Types of Machine Learning


1. Supervised Learning
 Supervised Learning is the most popular paradigm for performing machine learning
operations. It is widely used for data where there is a precise mapping between input-
output data.
 The dataset, in this case, is labeled, meaning that the algorithm identifies the features
explicitly and carries out predictions or classification accordingly.
 As the training period progresses, the algorithm is able to identify the relationships
between the two variables such that we can predict a new outcome.

Resulting Supervised learning algorithms are task-oriented.

As we provide it with more and more examples, it is able to learn more properly so that it can
undertake the task and yield us the output more accurate.
Draft
 Predictive Model
 we have labeled data
 The main types of supervised learning problems include regression and classification
problems
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given below:
 Classification
 Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable
is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
Classification is a type of supervised learning where a target feature, which is of type
categorical, is predicted for test data based on the information imparted by training data. The
target categorical feature is known as class.
Examples of Typical classification problems

a) Image classification
b) Prediction of disease
c) Win–loss prediction of games
d) Prediction of natural calamity like earthquake, flood, etc.
e) Recognition of handwriting

Few Popular ML algorithms which help in solving classification problems

a) Naïve Bayes
b) Decision Tree
c) K-Nearest Neighbour algorithms

Some popular classification algorithms are given below:


 Random Forest Algorithm
 Decision Tree Algorithm
 Logistic Regression Algorithm
 Support Vector Machine Algorithm
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
 Simple Linear Regression Algorithm
 Multivariate Regression Algorithm
 Decision Tree Algorithm
 Lasso Regression
Advantages and Disadvantages of Supervised Learning Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:
o These algorithms are not able to solve complex tasks.
o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:

Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by
using medical images and past labelled data with labels for disease conditions. With such a
process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.
Supervised Learning Use Case
Facial Recognition is one of the most popular applications of Supervised Learning and more
specifically –
Artificial Neural Networks.
Convolutional Neural Networks (CNN) is a type of ANN used for identifying the faces of
people. These models are able to draw features from the image through various filters. Finally, if
there is a high similarity score between the input image and the image in the database, a
positive match is provided.
Baidu, China’s premier search engine company has been investing in facial recognition.
While it has already installed facial recognition systems in its security systems, it is
now extending this technology to the major airports of China. Baidu will provide the airports
with facial recognition technology that will provide access to the ground crew and the staff.
Therefore, the passengers do not have to wait in long queues for flight check-in when they can
simply board their flight by scanning their faces.

2. Unsupervised Learning
 In the case of an unsupervised learning algorithm, the data is not explicitly labeled
into different classes, that is, there are no labels. The model is able to learn from the data
by finding implicit patterns.
 Unsupervised Learning algorithms identify the data based on their densities, structures,
similar segments, and other similar features. Unsupervised Learning Algorithms are
based on Hebbian Learning.
 Cluster analysis is one of the most widely used techniques in supervised learning.

Draft
 Descriptive Model
 The main types of unsupervised learning algorithms include Clustering algorithms
and Association rule learning algorithms.
Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:
 Clustering
 Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way
to group the objects into a cluster such that the objects with the most similarities remain in one group
and have fewer or no similarities with the objects of other groups. An example of the clustering
algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations among
variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages and Disadvantages of Unsupervised Learning Algorithm
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Applications of Unsupervised Learning
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-
commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,
which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each user
located at a particular location.
Unsupervised Learning Use Case
One of the most popular unsupervised learning techniques is clustering. Using clustering, businesses
are able to capture potential customer segments for selling their products.
Sales companies are able to identify customer segments that are most likely to use their services.
Companies can evaluate the customer segments and then decide to sell their product to
maximize the profits.
One such company that is performing brand marketing analytics using Machine Learning is an
Israeli based startup – Optimove. The goal of this company is to ingest and
process the customer data in order to make it accessible to the marketers.
They take it one step further by providing smart insights to the marketing team, allowing them to
reap the maximum profit out of their product marketing.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data.
As labels are costly, but for corporate purposes, they may have few labels. It is completely different
from supervised and unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in
supervised learning.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
o It is simple and easy to understand the algorithm.

o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement Learning covers more area of Artificial Intelligence which allows machines to
interact with their dynamic environment in order to reach their goals. With this, machines and
software agents are able to evaluate the ideal behavior in a specific context.
With the help of this reward feedback, agents are able to learn the behavior and improve it in the
longer run. This simple feedback reward is known as a reinforcement signal.

The agent in the environment is required to take actions that are based on the current state. This
type of learning is different from Supervised Learning in the sense that the training data in the
former has output mapping provided such that the model is capable of learning the correct answer.
Whereas, in the case of reinforcement learning, there is no answer key provided to the agent when
they have to perform a particular task. When there is no training dataset, it learns from its own
experience.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the
tendency that the required behaviour would occur again by adding something. It enhances the
strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would occur
again by avoiding the negative condition.
Reinforcement Learning Use Case
Google’s Active Query Answering (AQA) system makes use of reinforcement learning.It
reformulates the questions asked by the user.

For example, if you ask the AQA bot the question – “What is the birth date of Nikola Tesla” then the
bot would reformulate it into different questions like “What is the birth year of Nikola Tesla”,
“When was Tesla born?” and “When is Tesla’s birthday”.
This process of reformulation utilized the traditional sequence2sequence model, but Google has
integrated reinforcement Learning into its system to better interact with the query based environment
system.
This is a deviation from the traditional seq2seq model such that all the tasks are carried out using
reinforcement learning and policy gradient methods. That is, for a given question q0, we want
to obtain the best possible answer a*.
The goal is to maximize the award a* = argmaxa R(ajq0).
Real-world Use cases of Reinforcement Learning
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:

The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs
in order to minimize average job slowdown.
o Robotics:

RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI and
Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help
of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning Advantages


o It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.

Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.
Differences between SUPERVISED vs UNSUPERVISED vs REINFORCEMENT

Parameters Supervised Unsupervised Reinforcement


Used when there is no
Used when we know how to Used when there is no idea about the class or
classify given data, or in other idea about the class or label of a particular
When it is used words classes or labels are label of a particular data. The model has to
available data. find pattern in the
data.
The model has to do
the classification - it
Type of work to be The model has to predict the The model has to find will get rewarded if the
performed output pattern in the data. classification is
correct, else get
punished.
Any unknown and
Labelled training data is needed. unlabelled data set is The model learns and
Model building Model is built based on training given to the model as updates itself through
data. input and records are reward/punishment.
grouped.
Performance can be evaluated Difficult to measure
Model is evaluated by
based on how many whether the model did
Model means of the reward
misclassifications have been done something useful or
Performance function after it had
based on a comparison between interesting. Homogeneity
some time to learn
predicted and actual values is the only measure.

2 Types: 2 Types:
Types 1. classification 1. clustering No such types
2. regression 2. association
More difficult
to understand Most complex to
Complexity Simple one to understand and implement understand and apply
than
supervised learning
1. Naïve Bayes 1. K-means
2. K-nearest neighbour (KNN) 2. Principal Component
3. Decision tree Analysis (PCA)
Standard 1. Q-learning
4. Linear Regression 3. Self-organizing
algorithms 2. Sarsa
5. Logistic regression map (SOM)
6. Support vector machine 4. Apriori algorithm
(SVM), etc. 5. DBSCAN, etc.
1. Market basket 1. Self-driving cars
1. Hand writing recognition analysis 2. Intelligent robots
Practical 2. Stock market prediction 2. Recommender 3. AlphaGo Zero (The
Applications 3. Disease prediction systems latest version of
4. Fraud detection, etc. 3. Customer DeepMind's AI
segmentation, etc. system
playing GO)

Problems that CAN NOT be solved using Machine Learning

 Machine learning should not be applied to following tasks -


a) In which humans are very effective or frequent human intervention is needed
Ex: Air traffic control
b) For very simple tasks which can be implemented using traditional programming
paradigms
c) For situations where training data is not sufficient, machine learning can not be used
effectively.
 Machine Learning should be used only when the business process has some lapses.

APPLICATIONS OF MACHINE LEARNING (ML)


Machine learning is transforming various industries by enabling systems to learn from data,
make predictions, and automate complex tasks. Here are key application areas:
1. Healthcare
 Disease Diagnosis: Detecting diseases like cancer, diabetes, and COVID-19 through
medical imaging and patient data analysis.
 Predictive Analytics: Forecasting disease outbreaks and patient outcomes.
 Personalized Medicine: Tailoring treatments based on individual patient data.
 Drug Discovery: Accelerating drug development using predictive models.
2. Finance
 Fraud Detection: Identifying fraudulent transactions in real-time.
 Credit Scoring: Assessing creditworthiness using customer financial data.
 Algorithmic Trading: Optimizing trading strategies based on market data.
 Risk Management: Predicting and mitigating financial risks.
3. E-commerce & Retail
 Recommendation Systems: Suggesting products based on user behavior (e.g., Amazon,
Netflix).
 Customer Segmentation: Grouping customers for targeted marketing.
 Dynamic Pricing: Adjusting prices in real-time based on demand and competition.
 Inventory Management: Optimizing stock levels using demand forecasting.
4. Autonomous Vehicles
 Self-Driving Cars: Enabling autonomous navigation, obstacle detection, and traffic
management.
 Traffic Prediction: Predicting traffic patterns for route optimization.
 ADAS (Advanced Driver Assistance Systems): Enhancing driver safety with features like
lane detection and collision avoidance.
5. Natural Language Processing (NLP)
 Chatbots and Virtual Assistants: Powering systems like Siri, Alexa, and Google
Assistant.
 Sentiment Analysis: Understanding customer opinions from text data.
 Language Translation: Real-time language translation (e.g., Google Translate).
 Text Summarization: Automatically generating concise summaries of long documents.
6. Manufacturing & Industry 4.0
 Predictive Maintenance: Forecasting equipment failures to reduce downtime.
 Quality Control: Automated defect detection using image analysis.
 Supply Chain Optimization: Enhancing logistics and production planning.
7. Education
 Personalized Learning: Adapting educational content to individual learning styles.
 Automated Grading: Evaluating exams and assignments using AI.
 Student Performance Prediction: Identifying students at risk of dropping out.
8. Marketing & Advertising
 Targeted Advertising: Delivering personalized ads based on user behavior.
 Customer Churn Prediction: Identifying users likely to stop using a service.
 A/B Testing: Optimizing marketing strategies by analyzing user responses.
9. Gaming
 Game AI: Creating intelligent, adaptive game characters.
 Procedural Content Generation: Automatically generating game levels and content.
 Player Behavior Analysis: Understanding player patterns to improve
engagement.
10. Cybersecurity
 Intrusion Detection: Identifying unusual network activity indicative of
cyberattacks.
 Malware Detection: Classifying and mitigating malware threats.
 User Authentication: Enhancing security through behavioral biometrics.
11. Energy & Utilities
 Energy Demand Forecasting: Predicting energy consumption for efficient distribution.
 Smart Grids: Optimizing energy usage and distribution.
 Renewable Energy Optimization: Enhancing the efficiency of solar and wind energy
systems.

LANGUAGES/TOOLS
There are several programming languages and tools commonly used for machine learning (ML).
Here are some of the most popular ones:
Programming Languages:
1. Python:
The increasing adoption of machine learning worldwide is a major factor contributing to its
growing popularity. There are 69% of machine learning engineers and Python has become the
favourite choice for data analytics, data science, machine learning, and AI.
Python is the preferred programming language of choice for machine learning for some of the
giants in the IT world including Google, Instagram, Facebook, Dropbox, Netflix, Walt Disney,
YouTube, Uber, Amazon, and Reddit. Python is an indisputable leader and by far the best language
for machine learning today and here’s why:

 Extensive Collection of Libraries and Packages


Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires continuous data
processing and Python has in-built libraries and packages for almost every task.
This helps machine learning engineers reduce development time and improve
productivity when working with complex machine learning applications. The best part of
these libraries and packages is that there is zero learning curve, once you know the basics of
Python programming, you can start using these libraries.
1. Working with textual data – use NLTK, SciKit, and NumPy
2. Working with images – use Sci-Kit image and OpenCV
3. Working with audio – use Librosa
4. Implementing deep learning – use TensorFlow, Keras, PyTorch
5. Implementing basic machine learning algorithms – use Sci-Kit- learn .
6. Want to do scientific computing – use Sci-Py
7. Want to visualise the data clearly – use Matplotlib, Sci-Kit, and Seaborn
 Code Readability
 Flexibility
2. R Programming Language
R is an open-source programming language making it a highly cost-effective choice for
machine learning projects of any size. R is an incredible programming language for machine
learning written by a statistician for statisticians. R language can also be used by non-
programmer including data miners, data analysts, and statisticians.
A critical part of a machine learning engineer’s day-to-day job roles is understanding
statistical principles so they can apply these principles to big data. R programming language is a
fantastic choice when it comes to crunching large numbers and is the preferred choice for
machine learning applications that use a lot of statistical data. With user-friendly IDE’s like
RStudio and various tools to draw graphs and manage libraries – R is a must-have programming
language in a machine learning engineer’s toolkit.
R has an exhaustive list of packages for machine learning –
1. MICE for dealing with missing values.
2. CARET for working with classification and regression problems.
3. PARTY and rpart for creating data partitions.
4. randomFOREST for creating decision trees.
5. dplyr and tidyr for data manipulation.
6. ggplot2 for creating beautiful visualisations.
7. Rmarkdown and Shiny for communicating insights through reports.
3. Java and JavaScript
Though Python and R continue to be the favourites of machine learning enthusiasts, Java is
gaining popularity among machine learning engineers who hail from a Java development
background as they don’t need to learn a new programming language like Python or R to implement
machine learning. Many organizations already have huge Java codebases, and most of the open-
source tools for big data processing like Hadoop, Spark are written in Java.
 Java has plenty of third party libraries for machine learning. JavaML is an inbuilt machine
learning library that provides a collection of machine learning algorithms implemented in
Java.
 Also, you can use Arbiter Java library for hyper parameter tuning which is an integral part of
making ML algorithms run effectively or you can use Deeplearning4J library
which supports popular machine learning algorithms like K-Nearest Neighbor and Neuroph
and lets you create neural networks or can also use Neuroph for neural networks
 Java Virtual Machine is one of the best platforms for machine learning as engineers can write
the same code on multiple platforms. JVM also helps machine learning engineers create
custom tools at a rapid pace and has various IDE’s that help improve overall productivity.
Java works best for speed critical machine learning projects as it is fast executing.
4. Julia
Julia is a high-performance, general-purpose dynamic programming language emerging as a
potential competitor for Python and R with many predominant features exclusively for machine
learning.
Why use Julia for machine learning?
 Julia is particularly designed for implementing basic mathematics and scientific queries that
underlies most machine learning algorithms.
 Julia code is compiled at Just-in-Time or at run time using the LLVM framework. This gives
machine learning engineers great speed without any handcrafted profiling techniques or
optimization techniques solving all the performance problems.
 Julia’s code is universally executable. So, once written a machine learning application it can
be compiled in Julia natively from other languages like Python or R in a wrapper like PyCall
or RCall.
 Scalability, as discussed, is crucial for machine learning engineers and Julia makes it easier to
be deployed quickly at large clusters. With powerful tools like TensorFlow, MLBase.jl,
Flux.jl, SciKitlearn.jl, and many others that utilise the scalability provided by Julia, it is an
apt choice for machine learning applications.
 Offer support for editors like Emacs and VIM and also IDE’s like Visual studio and Juno.
5. LISP
Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications. LISP is a
dynamically typed programming language that has influenced the creation of many machine
learning programming languages like Python, Julia, and Java. LISP works on Read-
Eval-Print-Loop (REPL) and has the capability to code, compile, and run code in 30+
programming languages.
The first AI chatbot ELIZA was developed using LISP and even today machine learning
practitioners can use it to create chatbots for eCommerce.

Machine Learning Tools


What is Machine Learning Tool?
Machine learning tools are artificial intelligence-algorithmic applications that provide
systems with the ability to understand and improve without considerable human input. It enables
software, without being explicitly programmed, to predict results more accurately.
Machine learning tools (Caffee 2, Scikit-learn, Keras, Tensorflow, etc.) are defined as the
artificial intelligence algorithmic applications that give the system the ability to understand and
improve without being explicitly programmed as these tools are capable of performing complex
processing tasks such as the awareness of images, speech-to-text, generating natural languages,
etc. These tools are used for applications in which training wheels (where the individual
schedules input and the desired output) are used the termed as supervised algorithm while the
tools without training wheels are unsupervised algorithms and the selection of these machine
learning tools entirely depends upon the type of algorithm that needs to be used for the
application.
Machine Learning Tools consists of:
• Preparation and data collection
• Building models
• Application deployment and training
Local Tools for Telecommunication and Remote Learning
We can compare machine learning tools with local and remote. You can download and
install a local tool and use it locally, but a remote tool runs on an external server.
1. Local Tools
You can download, install and run a local tool in your local environment.
Characteristics of Local Tools are as follows:
• Adapted for data and algorithms in memory.
• Configuration and parameterization execution control.
• Integrate your systems to satisfy your requirements.
Examples of Local Tools are Shogun, Golearn for Go, etc.
2. Remote Tools
This tool is hosted from the server and called to your local environment. These instruments are
often called Machine Learning as a Service (MLaaS).
• Customized for larger datasets to run on a scale.
• Execute multiple devices, multiple nuclei, and shared storage.
• Simpler interfaces provide less configuration control and parameterizing of the
algorithm.
Examples of these Tools Are Machine Learning in AWS, Prediction in Google, Apache Mahout, etc.
TOOLS FOR MACHINE LEARNING
Given below are the different tools for machine learning:
1. TensorFlow
This is a machine learning library from Google Brain of Google’s AI organization released in
2015. Tensor Flow allows you to create your own libraries. We can also use C++ and python
language because of flexibility. An important characteristic of this library is that data flow diagrams
are used to represent numerical computations with the help of nodes and edges. Mathematical
operations are represented by nodes, whereas edges denote multidimensional data arrays on which
operations are performed. TensorFlow is used by many famous companies like eBay, Twitter,
Dropbox, etc. It also provides great development tools, especially in Android.
2. Keras
Keras is a deep-learning Python library that can run on top of Theano, TensorFlow.
Francois Chollet, a member of the Google Brain team, developed it to give data scientists the ability
to run machine learning programs fast. Because of using the high-level, understandable interface of
the library and dividing networks into sequences of separate modules, rapid prototyping is possible.
It is more popular because of the user interface, ease of extensibility, and modularity. It runs on CPU
as well as GPU.
3. Scikit-learn
Scikit-learn, which was first released in 2007, is an open-source library for machine learning.
Python is a scripting language of this framework and includes several models of machine learning
such as classification, regression, clustering, and reduction of dimensionality.
Scikit- learn is designed on three open-source projects — Matplotlib, NumPy, and SciPy. Scikit-
learn provides users with n number of machine learning algorithms. The framework library focuses
on data modeling but not on loading, summarizing, manipulating data.
4. Caffe2
Caffe2 is an updated version of Caffe. It is a lightweight, open-source machine learning tool
developed by Facebook. It has an extensive machine learning library to run complex models. Also,
it supports mobile deployment. This library has C++ and Python API, which allows developers to
prototype first, and optimization can be done later.

5. Apache Spark MLlib


Apache Spark MLlib is a distributed framework for machine learning. The Spark core is
developed at the top. Apache sparks MLlib is nine-time faster than disk-based implementation. It is
used widely as an open-source project which makes focuses on machine learning to make it easy.
Apache Spark MLlib has a library for scalable vocational training. MLlib includes algorithms for
regression, collaborative filters, clustering, decisions trees, pipeline APIs of higher levels.

6. OpenNN
OpenNN is developed by the artificial intelligence company Artelnics. OpenNN is an
advanced analytics firmware library written in C++. The most successful method of machine
learning is the implementation of neural networks. It is high in performance. The execution
speed and memory allocation of this library stand out.
7. Amazon SageMaker
Amazon SageMaker is a fully managed service that allows data researchers and developers to
build, train and implement machine learning models on any scale quickly and easily. Amazon
SageMaker supports open-source web application Jupyter notebooks that help developers share live
code. These notebooks include drivers, packages, and libraries for common deep learning platforms
and frameworks for SageMaker users. Amazon SageMaker optionally encrypts models both during
and during transit through AWS Key Management Service, and API requests are performed over a
secure connection to the socket layer. SageMaker also stores code in volumes that are protected and
encrypted by security groups.
Issues:
Although machine learning is being used in every industry and helps organizations make more
informed and data-driven choices that are more effective than classical methodologies, it still has so
many problems that cannot be ignored. Here are some common issues in Machine Learning that
professionals face to inculcate ML skills and create an application from scratch.

1. Inadequate Training Data


The major issue that comes while using machine learning algorithms is the lack of quality as
well as quantity of data. Although data plays a vital role in the processing of machine learning
algorithms, many data scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms. For example, a simple task requires
thousands of sample data, and an advanced task such as speech or image recognition needs millions
of sample data examples. Further, data quality is also important for the algorithms to work ideally,
but the absence of data quality is also found in Machine Learning applications. Data quality can be
affected by some factors as follows:

 Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as
accuracy in classification tasks.
 Incorrect data- It is also responsible for faulty programming and results obtained in machine
learning models. Hence, incorrect data may affect the accuracy of the results also.
 Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
2. Poor quality of data
As we have discussed above, data plays a significant role in machine learning, and it must be
of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean data lead to less
accuracy in classification and low-quality results. Hence, data quality can also be considered as a
major common problem while processing machine learning algorithms.
3. Non-representative training data
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training data must
cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for generalized
cases and provides accurate decisions. If there is less training data, then there will be a sampling
noise in the model, called the non-representative training set. It won't be accurate in predictions. To
overcome this, it will be biased against one class or a group. Hence, we should use representative
data in training to protect against being biased and make accurate predictions without any drift.

4. Overfitting and Underfitting


Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set. It negatively affects the performance of
the model. Let's understand with a simple example where we have a few training data sets such as
1000 mangoes, 1000 apples, 1000 bananas, and 5000 papayas. Then there is a considerable
probability of identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason behind overfitting
is using non-linear methods used in machine learning algorithms as they build non-realistic data
models. We can overcome overfitting by using linear and parametric algorithms in the machine
learning models.
Methods to reduce overfitting:
Increase training data in a dataset.
Reduce model complexity by simplifying the model by selecting one with
fewer parameters
Ridge Regularization and Lasso Regularization
Early stopping during the training phase Reduce
the noise
Reduce the number of attributes in training data. Constraining
the model.
Underfitting:
Underfitting is just the opposite of overfitting. Whenever a machine learning model is trained
with fewer amounts of data, and as a result, it provides incomplete and inaccurate data and destroys
the accuracy of the machine learning model.
Underfitting occurs when our model is too simple to understand the base structure of the
data, just like an undersized pant. This generally happens when we have limited data into the data
set, and we try to build a linear model with non-linear data. In such scenarios, the complexity of the
model destroys, and rules of the machine learning model become too easy to be applied on this data
set, and the model starts doing wrong predictions as well.
Methods to reduce Underfitting:
Increase model complexity Remove
noise from the data
Trained on increased and better features Reduce
the constraints
Increase the number of epochs to get better results.
5. Monitoring and maintenance
As we know that generalized output data is mandatory for any machine learning model; hence,
regular monitoring and maintenance become compulsory for the same. Different results for different
actions require data change; hence editing of codes as well as resources for monitoring them also
become necessary.
6. Getting bad recommendations
A machine learning model operates under a specific context which results in bad recommendations
and concept drift in the model. Let's understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over time but still machine
learning model showing same recommendations to the customer while customer expectation has
been changed. This incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes. However, we can overcome this by regularly updating and monitoring
data according to the expectations.

7. Lack of skilled resources


Although Machine Learning and Artificial Intelligence are continuously growing in the market, still
these industries are fresher in comparison to others. The absence of skilled resources in the form of
manpower is also an issue. Hence, we need manpower having in-depth knowledge of
mathematics, science, and technologies for developing and managing scientific substances for
machine learning.

8. Customer Segmentation
Customer segmentation is also an important issue while developing a machine learning algorithm.
To identify the customers who paid for the recommendations shown by the model and who don't
even check them. Hence, an algorithm is necessary to recognize the customer behavior and trigger a
relevant recommendation for the user based on past experience.

9. Process Complexity of Machine Learning


The machine learning process is very complex, which is also another major issue faced by machine
learning engineers and data scientists. However, Machine Learning and Artificial Intelligence are
very new technologies but are still in an experimental phase and continuously being changing over
time. There is the majority of hits and trial experiments; hence the probability of error is higher than
expected. Further, it also includes analyzing the data, removing data bias, training data, applying
complex mathematical calculations, etc., making the procedure more complicated and quite tedious.

10. Data Bias


Data Biasing is also found a big challenge in Machine Learning. These errors exist when certain
elements of the dataset are heavily weighted or need more importance than others. Biased data leads
to inaccurate results, skewed outcomes, and other analytical errors. However, we can resolve this
error by determining where data is actually biased in the dataset. Further, take necessary steps to
reduce it.
Methods to remove Data Bias:
Research more for customer segmentation.
Be aware of your general use cases and potential outliers.
Combine inputs from multiple sources to ensure data diversity.
Include bias testing in the development process.
Analyze data regularly and keep tracking errors to resolve them easily.
Review the collected and annotated data.
Use multi-pass annotation such as sentiment analysis, content moderation, and intent recognition.
11. Lack of Explainability
This basically means the outputs cannot be easily comprehended as it is programmed in specific
ways to deliver for certain conditions. Hence, a lack of explain ability is also found in machine
learning algorithms which reduce the credibility of the algorithms.
12. Slow implementations and results
This issue is also very commonly seen in machine learning models. However, machine learning
models are highly efficient in producing accurate results but are time-consuming. Slow
programming, excessive requirements' and overloaded data take more time to provide accurate
results than expected. This needs continuous maintenance and monitoring of the model for
delivering accurate results.
13. Irrelevant features
Although machine learning models are intended to give the best possible outcome, if we feed
garbage data as input, then the result will also be garbage. Hence, we should use relevant features in
our training sample. A machine learning model is said to be good if training data has a good set of
features or less to no irrelevant features.
MACHINE LEARNING ACTIVITIES (OR) MACHINE LEARNING LIFE CYCLE

Following figure depicts the four-step process of machine learning.

Activities in Machine Learning:


Following are the typical preparation activities done once the input data comes into the
machine learning system:

1. Understand the type of data in the given input data set.


2. Explore the data to understand the nature and quality.
3. Explore the relationships amongst the data elements, e.g. interfeature relationship.
4. Find potential issues in data.
5. Do the necessary remediation, e.g. impute missing data values, etc., if needed.
6. Apply pre-processing steps, as necessary.
7. Once the data is prepared for modelling, then the learning tasks start off. As a part of it, do
the following activities:
a) The input data is first divided into parts – the training data and the test data
(called holdout). This step is applicable for supervised learning only.
b) Consider different models or learning algorithms for selection.
c) Train the model based on the training data for supervised learning problem and
apply to unknown data.
d) Directly apply the chosen unsupervised model on the input data for unsupervised
learning problem.
8. After the model is selected, trained (for supervised learning), and applied on input data, the
performance of the model is evaluated.
9. Based on options available, specific actions can be taken to improve the performance of the
model, if possible.

Step# Step Name Activities Involved


1. Understand the type of data in the given input data set
2. Explore the data to understand data quality
3. Explore the relationship amongst data elements,
Inter- feature relationship
4. Find potential issues in data
Step1 Preparing to Model 5. Remediate data, if needed
6. Apply following pre-processing steps, as necessary:
a. Dimensionality reduction
b. Feature subset selection

1. Data partitioning / holdout


Step2 Learning 2. Model selection
3. Cross-validation
1. Examine the model performance, e.g. confusion
Step 3 Performance evaluation matrix in case of classification
2. Visualize performance trade-offs using ROC curves
1. Tuning the model
Step 4 Performance Improvement 2. Ensembling
3. Bagging
4. Boosting
BASIC TYPES OF DATA IN MACHINE LEARNING

DATA SET

A data set is a collection of related information or records. The information may be on some
entity or some subject area.
Example

Each row of a data set is called a record. Each data set also has multiple attributes, each of
which gives information on a specific characteristic.
TYPES OF DATA

Data can broadly be divided into following two types:

1. Qualitative data
2. Quantitative data

1. Qualitative data or Categorical data

Qualitative data provides information about the quality of an object or information which cannot
be measured.

Ex: a) Quality of performance of students: GOOD, AVERAGE and POOR


b) Name and Roll Number Types
of qualitative data
Qualitative data can be further subdivided into two types as follows:
a. Nominal data
b. Ordinal data

a. Nominal data

Nominal data is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified.

Ex: i. Blood group : A, B, O, AB, etc.


ii. Nationality: India, American, British, etc.
iii. Gender: Male, Female, Other

b. Ordinal data

Ordinal data = Nominal data + Natural ordering

Ex: i. Customer satisfaction : Very Happy, Happy, Unhappy


ii. Grades: A,B,C, etc.
iii. Hardness of metal: Very Hard, Hard, Soft, etc.

2. Quantitative data or Numeric data

It relates to information about the quantity of an object – hence it can be measured.


Ex: Marks – can be measured using scale of measurement.
Types of quantitative data

Quantitative data can be further subdivided into two types as follows:


a. Interval data
b. Ratio data
a. Interval data

is numeric data for which not only the order is known, but the exact
difference between values is also known.
Ex: Celsius temperature, date and time.

b. Ratio data
represents numeric data for which exact value can be measured. Ex:
height, weight, age, salary, etc.
Other types of attributes

Attributes can also be categorized into 2 types based on a number of values that can be
assigned.
a. Discrete attributes

Discrete attributes can assume a finite or countably infinite number of values.


Ex: Roll number, street number, rank of students, etc.
i) Binary attribute is a special type of discrete attribute which can have only two values
– male/female, positive/negative, yes/no, etc.

b. Continuous attributes

Continuous attributes can assume any possible value which is a real number. Ex:
length, height, weight, price, etc.
NOTE: In general, nominal and ordinal attributes are discrete. On the other hand, interval and
ratio attributes are continuous
TYPES OF DATA
Data has to be converted into numeric representation so that the machines are able to learn
the patterns within data. Understanding the different data types can help us identify correct
preprocessing techniques & convert the data appropriately. Furthermore, it will also enable us to
perform the best visualizations and uncover hidden knowledge.
Structured data
This type of data is usually composed of numbers or words. They are usually stored in
Relational databases and can be easily searched using SQL queries.

Numeric/Quantitative data
As the name suggests, this encompasses data that can be represented through numbers. Examples of
such data are sales price, metric quantities such as temperature, time, length, height & weight of a
person, and so on. Numeric data is further divided into two categories, namely discrete and
continuous.
Discrete
In this category, the data takes on discrete values or whole numbers i.e numbers without decimal
points. Examples are the number of houses in a city, the number of consumers in a grocery store
over the last month, the number of Instagram followers that you have, and so on.
E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.
Continuous
In this category, the data takes on integer values i.e numbers with decimal values. Examples for
continuous numeric data are house prices in the city, sale prices of grocery store items, Instagram
earnings that you received, and so on.
E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.

Categorical/Qualitative data
As the name suggests, this encompasses data that can be represented through words. It usually defines
groups or categories & is therefore known as categorical data. Some examples are the names of all
items in a supermarket, movie ratings (good, average, bad), country of birth of individuals & so
on.
Ordinal
This type of data has an inherent ordering present within the categories. For instance, if you
consider movie ratings with good, average & bad as the different categories, good has a higher
ranking than average which is higher than bad. This needs to be taken into account while
converting this type of data into numbers so that the models can learn this ranking as well. There is a
fixed, finite number of categories/groups. Examples will be movie ratings, student grades,
Employee performance, and so on.
E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc.

Fig: Rating (Good, Average, Poor), An Example Of Ordinal Data Type


Nominal
This type of data has categories that don’t have any particular order or ranking associated with them.
The total number of categories is usually finite in this type of data as well. Examples will be the
country of birth of individuals, all items in a supermarket, educational degrees of individuals, and so
on.
E.G., Male Or Female (Gender), Race, Country, Etc.

Fig: Gender (Female, Male), An Example Of Nominal Data Type

Interval Data Type:


This Is Numeric Data Which Has Proper Order And The Exact Zero Means The True Absence Of A
Value Attached. Here Zero Means Not A Complete Absence But Has Some Value. This Is The
Local Scale.
E.G., Temperature Measured In Degree Celsius, Time, Sat Score, Credit Score, PH, Etc. Difference
Between Values Is Familiar. In This Case, There Is No Absolute Zero. Absolute

Fig: Temperature, An Example Of Interval Data Type


IV. Ratio Data Type:
This Quantitative Data Type Is The Same As The Interval Data Type But Has The Absolute Zero.
Here Zero Means Complete Absence And The Scale Starts From Zero. This Is The Global Scale.
E.G., Temperature In Kelvin, Height, Weight, Etc.
Fig: Weight, An Example Of Ratio Data Type
Unstructured data
This type of data is usually composed of everything else including texts, images, videos,
speech/audio, time series, and so on. They are usually stored in non-relational databases and cannot
be searched easily.
Image
As the name suggests, this type of data usually consists of image files of different types. A
key attribute of this data type is the presence of spatial features/relationships within images that need
to be understood to extract insightful information from the images. Examples are images of all items
in the grocery store, photos of all students in a college, and so on.
Video
This type of data is also pretty self-explanatory as it consists of videos in different
formats. A distinguishing factor with this data type is the relationships between different frames in
the video with respect to positions, movements of objects/people etc. need to be taken into account
to better obtain information from the videos.
Audio/Time series
This type of data has a sequence of ordered data points each having a timestamp. The most
salient feature in this data is the relationship between the different data points such as periodic
patterns, seasonal behaviors, and so on. For example, if you consider the temperature recorded in a
city over last year, looking at the changes over time, we can easily identify that winter months are
colder and summer months are hotter. This type of insight is basic but can only be observed if you
look at the data points with their timestamps. Figure 2 shows an example visualization of time series
data.
Text
This type of data has textual data composed of multiple words occurring such that they make
sense as a whole. The most important attribute is understanding the overall context and relationships
between different words within a sentence as well as understanding that each word can have
multiple meanings as well as associations with other words.

EXPLORING STRUCTURE OF DATA


The data structure used for machine learning is quite similar to other software development
fields where it is often used. Machine Learning is a subset of artificial intelligence that includes
various complex algorithms to solve mathematical problems to a great extent. Data structure helps to
build and understand these complex problems. Understanding the data structure also helps you to
build ML models and algorithms in a much more efficient way than other ML professionals.
What is Data Structure?
The data structure is defined as the basic building block of computer programming that helps
us to organize, manage and store data for efficient search and retrieval. In other words, the data
structure is the collection of data type 'values' which are stored and organized in such a way that it
allows for efficient access and modification.

Types of Data Structure


The data structure is the ordered sequence of data, and it tells the compiler how a programmer is using
the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.

1. Linear Data structure:


The linear data structure is a special type of data structure that helps to organize and manage data in a
specific order where the elements are attached adjacently.
There are mainly 4 types of linear data structure as follows:
Array:
An array is one of the most basic and common data structures used in Machine Learning. It is also
used in linear algebra to solve complex mathematical problems. You will use arrays constantly in
machine learning, whether it's:
 To convert the column of a data frame into a list format in pre-processing analysis
 To order the frequency of words present in datasets.
 Using a list of tokenized words to begin clustering topics.
 In word embedding, by creating multi-dimensional matrices.
An array contains index numbers to represent an element starting from 0. The lowest index is
arr[0] and corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the Python array is quite
different from than array in other programming languages, the Python list is more popular as it
includes the flexibility of data types and their length. If anyone is using Python in ML algorithms,
then it's better to kick your journey from array initially.
Python Array method:

Stacks:
Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out). It is used for
binary classification in deep learning. Although stacks are easy to learn and implement in ML
models but having a good grasp can help in many computer science aspects such as parsing
grammar, etc.
Stacks enable the undo and redo buttons on your computer as they function similar to a stack of
blog content. There is no sense in adding a blog at the bottom of the stack.
However, we can only check the most recent one that has been added. Addition and removal occur at
the top of the stack.
Linked List:
A linked list is the type of collection having several separately allocated nodes. Or in other
words, a list is the type of collection of data elements that consist of a value and pointer that point to
the next node in the list.
In a linked list, insertion and deletion are constant time operations and are very efficient, but
accessing a value is slow and often requires scanning. So, a linked list is very significant for a
dynamic array where the shifting of elements is required. Although insertion of an element can be
done at the head, middle or tail position, it is relatively cost consuming. However, linked lists are
easy to splice together and split apart. Also, the list can be converted to a fixed-length array for fast
access.

Queue:
A Queue is defined as the "FIFO" (first in, first out). It is useful to predict a queuing scenario
in real-time programs, such as people waiting in line to withdraw cash in the bank. Hence, the queue
is significant in a program where multiple lists of codes need to be processed.
The queue data structure can be used to record the split time of a car in F1 racing.
2. Non-linear Data Structures
As the name suggests, in Non-linear data structures, elements are not arranged in any sequence. All
the elements are arranged and linked with each other in a hierarchal manner, where one element can
be linked with one or more elements.
1) Trees Binary Tree:
The concept of a binary tree is very much similar to a linked list, but the only difference of nodes
and their pointers. In a linked list, each node contains a data value with a pointer that points to the
next node in the list, whereas; in a binary tree, each node has two pointers to subsequent nodes
instead of just one.
Binary trees are sorted, so insertion and deletion operations can be easily done with O(log N) time
complexity. Similar to the linked list, a binary tree can also be converted to an array on the basis of
tree sorting.

In a binary tree, there are some child and parent nodes shown in the above image. Where the value
of the left child node is always less than the value of the parent node while the value of the right-
side child nodes is always more than the parent node. Hence, in a binary tree structure, data sorting
is done automatically, which makes insertion and deletion efficient.

2) Graphs
A graph data structure is also very much useful in machine learning for link prediction. Graphs are
directed or undirected concepts with nodes and ordered or unordered pairs. Hence, you must have
good exposure to the graph data structure for machine learning and deep learning.
3) Maps
Maps are the popular data structure in the programming world, which are mostly useful for
minimizing the run-time algorithms and fast searching the data. It stores data in the form of (key,
value) pair, where the key must be unique; however, the value can be duplicated. Each key
corresponds to or maps a value; hence it is named a Map.
In different programming languages, core libraries have built-in maps or, rather, HashMaps with
different names for each implementation.
 In Java: Maps
 In Python: Dictionaries
 C++: hash_map, unordered_map, etc.
Python Dictionaries are very useful in machine learning and data science as various functions and
algorithms return the dictionary as an output. Dictionaries are also much used for implementing
sparse matrices, which is very common in Machine Learning.
4) Heap data structure:
Heap is a hierarchically ordered data structure. Heap data structure is also very much similar to a
tree, but it consists of vertical ordering instead of horizontal ordering. Ordering in a heap DS is
applied along the hierarchy but not across it, where the value of the parent node is always more than
that of child nodes either on the left or right side.

Here, the insertion and deletion operations are performed on the basis of promotion. It means, firstly,
the element is inserted at the highest available position. After that, it gets compared with its parent
and promoted until it reaches the correct ranking position. Most of the heaps data structures can be
stored in an array along with the relationships between the elements.

Dynamic array data structure:


This is one of the most important types of data structure used in linear algebra to solve 1-D, 2-D, 3-D
as well as 4-D arrays for matrix arithmetic. Further, it requires good exposure to Python libraries
such as Python NumPy for programming in deep learning.
How is Data Structure used in Machine Learning?
For a Machine learning professional, apart from knowledge of machine learning skills, it is required
to have mastery of data structure and algorithms.
When we use machine learning for solving a problem, we need to evaluate the model performance,
i.e., which model is fastest and requires the smallest amount of space and resources with accuracy.
Moreover, if a model is built using algorithms, comparing and contrasting two algorithms to
determine the best for the job is crucial to the machine learning professional. For such cases, skills in
data structures become important for ML professionals.
With the knowledge of data structure and algorithms with ML, we can answer the following
questions easily:
 How much memory is required to execute?
 How long will it take to run?
 With the business case on hand, which algorithm will offer the best0020performance?
EXPLORING STRUCTURE OF DATA

Exploring structure of data is to identify the OUTLIERS, DATA SPREAD, MISSING values, etc.

Outliers are the values which are unusually high or low, compared to the other values.
1) Exploring Numerical data

a) Understanding the central tendency

 To understand the nature of numeric variables, we can apply the measures of central
tendency of data, i.e. mean and median.
 In statistics, measures of central tendency help us understand the central point of a
set of data.

Mean

is a sum of all data values divided by the count of data elements.

Ex: Mean of a set of observations – 21, 89, 34, 67, and 96 is calculated as below.

Median is the value of the element appearing in the middle of an ordered list of data elements.
Ex: Median of a set of observations – 21, 89, 34, 67, and 96 is calculated as below.
The ordered list would be -> 21, 34, 67, 89, and 96. Since there are 5 data elements, the
3rd element in the ordered list is considered as the median. Hence, the median value
of this set of data is 67.
MEAN is likely to get shifted drastically even due to the presence of a small number of
outliers
b) Understanding the data spread

i) Measuring data dispersion


ii) Measuring different data values position
i) Measuring data dispersion

Variance of the data is used to measure the extent of dispersion of data or to find out
how much the different values of a data are spread out.

Standard deviation

Larger value of variance or standard deviation indicates more dispersion in the


data and vice versa.

Example: Consider the data values of two attributes

For Attribute 1

Mean = 44+46+48+45+47 / 5 = 46

Median = 46 (after arranging into sorted order 44,45,46,47,48)


Variance =

For Attribute 2

Mean = 34+46+59+39+52 / 5 = 46

Median = 46 (after arranging into sorted order 34,39,46,52,59)


Variance =

So it is quite clear from the measure that attribute 1 values are quite concentrated
around the mean while attribute 2 values are extremely spread out.
ii) Measuring different data values position
Any data set has five values –

Minimum, first quartile (Q1),median (Q2), third quartile (Q3), and maximum

 When the data values of an attribute are arranged in an increasing order, we have
seen earlier that median gives the central data value, which divides the entire data
set into two halves.

 Similarly, if the first half of the data is divided into two halves so that each half
consists of one quarter of the data set, then that median of the first half is known as
first quartile or Q1.

 In the same way, if the second half of the data is divided into two halves, then that
median of the second half is known as third quartile or Q3.

 The overall median is also known as second quartile or Q2.

Quantiles: Refer to specific points in a data set which divide the data set into equal
parts or equally sized quantities.
Quartile: When the entire data set is which is ordered, splitted into 4 parts is known as
a quartile.
Percentile: When the data set is splitted into 100 parts is known as a percentile.
Plotting and exploring numerical data
Following two techniques are used to plot and explore the numerical data

i) Box plots

 A box plot is an extremely effective mechanism to get a one-shot view and


understand the nature of the data.

Inter-quartile range (IQR)

 The central rectangle or the box spans from first to third quartile (i.e. Q1 to Q3), thus
giving the inter-quartile range (IQR).
IQR = Q3-Q1
ii) Histogram

 Histogram is another plot which helps in effective visualization of numeric


attributes. It helps in understanding the distribution of a numeric data into series of
intervals, also termed as ‘bins’.

Difference between histogram and box plot is

a) The focus of histogram is to plot ranges of data values (acting as ‘bins’), the
number of data elements in each range will depend on the data distribution.
Based on that, the size of each bar corresponding to the different ranges will
vary.

b) The focus of box plot is to divide the data elements in a data set into four equal
portions, such that each portion contains an equal number of data elements.

General Histogram shapes


 The histogram is composed of a number of bars, one bar appearing for each of the
‘bins’. The height of the bar reflects the total count of data elements whose value
falls within the specific bin value, or the frequency.

2) Exploring Categorical data

 There are not many options for exploring categorical data.

 MODE is only the measure we can apply to explore the categorical data.
mode is also a statistical measure for central tendency of a data.
Mode of a data is the data value which appears most often.

 Count and proportion (percentage) are two parameters used to measure


categorical data.

 Ex: Count of CAR names


 Ex: Percentage of count of data elements (for CAR names)

Exploring relationship between variables

One more important angle of data exploration is to explore relationship between


attributes. There are multiple plots to enable us explore the relationship between
variables. The basic and most commonly used plot is scatter plot and two-way cross-
tabulations.

a) Scatter plot

 A scatter plot helps in visualizing bivariate relationships, i.e. relationship


between two variables. It is a two dimensional plot in which points or dots are
drawn on coordinates provided by values of the attributes.

 For example, in a data set there are two attributes – attr_1 and attr_2. We
want to understand the relationship between two attributes, i.e. with a change
in value of one attribute, say attr_1, how does the value of the other attribute,
say attr_2, changes.

 We can draw a scatter plot, with attr_1 mapped to x-axis and attr_2 mapped in y-axis.

 So, every point in the plot will have value of attr_1 in the x-coordinate and value of
attr_2 in the y coordinate.

 As in a two-dimensional plot, attr_1 is said to be the independent variable and attr_2


as the dependent variable.

MC4301 MACHINE LEARINIG


b) Two-way cross-tabulations

Two-way cross-tabulations (also called cross-tab or contingency table) are used to


understand the relationship of two categorical attributes in a concise way.

It has a matrix format that presents a summarized view of the bivariate frequency distribution.

A cross-tab, very much like a scatter plot, helps to understand how much the data values
of one attribute changes with the change in data values of another attribute.

DATA QUALITY AND REMEDIATION

1) DATA QUALITY

Success of machine learning depends largely on the quality of data. A data which has the
right quality helps to achieve better prediction accuracy, in case of supervised learning.

Two types of data quality issues


1. Certain data elements without a value or data with a missing value.
2. Data elements having value surprisingly different from the other elements, which we term as
outliers.
Few factors which lead to the above quality issues
a) Incorrect sample set selection

The data may not reflect normal or regular quality due to incorrect selection of sample set.

MC4301 MACHINE LEARINIG


Example

If we are selecting a sample set of sales transactions from a festive period and trying to
use that data to predict sales in future. In this case, the prediction will be far apart from
the actual scenario, just because the sample set has been selected in a wrong time.

b) Errors in data collection: resulting in outliers and missing values

 In many cases, a person or group of persons are responsible for the collection of
data to be used in a learning activity.

 In this manual process, there is the possibility of wrongly recording data either
in terms of value (say 20.67 is wrongly recorded as 206.7 or 2.067) or in terms
of a unit of measurement (say cm. is wrongly recorded as m. or mm.).

 This may result in data elements which have abnormally high or low value
from other elements. Such records are termed as outliers.

 It may also happen that the data is not recorded at all.

 In case of a survey conducted to collect data, it is all the more possible as


survey responders may choose not to respond to a certain question. So the data
value for that data element in that responder’s record is missing.

2) DATA REMEDIATION

The issues in data quality need to be remediated, if the right amount of efficiency has
to be achieved in the learning activity.

(1) For Incorrect sample set selection – Remedy is “Proper sampling technique”

(2) Handling outliers


 Outliers are data elements with an abnormally high value which may
impact prediction accuracy, especially in regression models.
 One of the following approaches are used to handle outliers
i. Remove outliers: If the number of records which are
outliers is not many, a simple approach may be to remove
them.
ii. Imputation: One other way is to impute the value with mean or
median or mode. The value of the most similar data element may
also be used for imputation.
iii. Capping: For values that lie outside the 1.5|×| IQR limits, we can
cap them by replacing those observations below the lower limit
with the value of 5th percentile and those that lie above the upper
limit, with the value of 95th percentile.

MC4301 MACHINE LEARINIG


(3) Handling missing values

 In a data set, one or more data elements may have missing values in
multiple records.
 There are multiple strategies to handle missing value of data elements.
Some of those strategies are:

i. Eliminate records having a missing value of data elements

 In case the proportion of data elements having missing


values is within a tolerable limit, a simple but effective
approach is to remove the records having such data
elements.
 This is possible if the quantum of data left after removing
the data elements having missing values is sizeable.

ii. Imputing missing values

 Imputation is a method to assign a value to the data elements


having missing values. Mean/mode/median is most frequently
assigned value.
 For quantitative attributes, all missing values are imputed
with the mean, median, or mode of the remaining values
under the same attribute.
 For qualitative attributes, all missing values are imputed by
the mode of all remaining values of the same attribute.

iii. Estimate missing values

 If there are data points similar to the ones with missing attribute
values, then the attribute values from those similar data points
can be planted in place of the missing value.
 Ex:The weight of a Russian student having age 12 years and
height 5 ft. is missing. Then the weight of any other Russian
student having age close to 12 years and height close to 5 ft. can
be assigned.
DATA PRE-PROCESSING
 Two techniques are applied as part of data pre-processing
1) Dimensionality reduction
2) Feature subset selection
1) Dimensionality reduction
 High-dimensional data sets need a high amount of computational
space and time. At the same time, not all features are useful – they

MC4301 MACHINE LEARINIG


degrade the performance of machine learning algorithms. Most of the
machine learning algorithms perform better if the dimensionality of
data set, i.e. the number of features in the data set, is reduced.

 Dimensionality reduction refers to the techniques of reducing the


dimensionality of a data set by creating new attributes by combining the
original attributes.

a. The most common approach for dimensionality reduction is known as Principal


Component Analysis (PCA).

 PCA is a statistical technique to convert a set of correlated


variables into a set of transformed, uncorrelated variables called
principal components.

 The principal components are a linear combination of the original


variables. They are orthogonal to each other.

 Since principal components are uncorrelated, they capture the


maximum amount of variability in the data. However, the only
challenge is that the original attributes are lost due to the
transformation.

b. Another commonly used technique which is used for


dimensionality reduction is Singular Value Decomposition
(SVD)
2) Feature subset selection
 Feature subset selection or simply called feature selection, both for
supervised as well as unsupervised learning, try to find out the optimal
subset of the entire feature set which significantly reduces computational
cost without any major impact on the learning accuracy.

 As part of this process, few features will be eliminated which are irrelevant.
A feature is considered as irrelevant if it plays an insignificant role (or
contributes almost no information) in classifying or grouping together a set of
data instances.

What Is Data Preprocessing?


Data preprocessing is a step in the data mining and data analysis process that takes
raw data and transforms it into a format that can be understood and analyzed by computers
and machine learning.
Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy. However,
unstructured data, in the form of text and images must first be cleaned and formatted before

MC4301 MACHINE LEARINIG


analysis.
Data Preprocessing Importance
When using data sets to train machine learning models, you’ll often hear the phrase
“garbage in, garbage out” This means that if you use bad or “dirty” data to train your model,
you’ll end up with a bad, improperly trained model that won’t actually be relevant to your
analysis.
Good, preprocessed data is even more important than the most powerful algorithms,
to the point that machine learning models trained with bad data could actually be harmful to
the analysis you’re trying to do – giving you “garbage” results.
Depending on your data gathering techniques and sources, you may end up with data
that’s out of range or includes an incorrect feature, like household income below zero or an
image from a set of “zoo animals” that is actually a tree. Your set could have missing values
or fields. Or text data, for example, will often have misspelled words and irrelevant symbols,
URLs, etc.
When you properly preprocess and clean your data, you’ll set yourself up for much
more accurate downstream processes. We often hear about the importance of “data- driven
decision making,” but if these decisions are driven by bad data, they’re simply bad decisions.
Understanding Machine Learning Data Features
Data sets can be explained with or communicated as the “features” that make them
up. This can be by size, location, age, time, color, etc. Features appear as columns in datasets
and are also known as attributes, variables, fields, and characteristics.
A machine learning data feature as “an individual measurable property or
characteristic of a phenomenon being observed”. It’s important to understand what “features”
are when preprocessing your data because you’ll need to choose which ones to focus on
depending on what your business goals are. Later, we’ll explain how you can improve the
quality of your dataset’s features and the insights you gain with processes like feature
selection
First, let’s go over the two different types of features that are used to
describe data: categorical and numerical:

Categorical features: Features whose explanations or values are taken from a defined set of
possible explanations or values. Categorical values can be colors of a house; types of animals;
months of the year; True/False; positive, negative, neutral, etc. The set of possible categories
that the features can fit into is predetermined.
Numerical features: Features with values that are continuous on a scale, statistical, or integer-
related. Numerical values are represented by whole numbers, fractions, or percentages.
Numerical features can be house prices, word counts in a document, time it takes to travel
somewhere, etc.
Data Preprocessing Steps
Let’s take a look at the established steps you’ll need to go through to make sure
your data is successfully preprocessed.

MC4301 MACHINE LEARINIG


1. Data quality assessment
2. Data cleaning
3. Data transformation
4. Data reduction
1. Data quality assessment
Take a good look at your data and get an idea of its overall quality, relevance to your project,
and consistency. There are a number of data anomalies and inherent problems to look out for
in almost any data set, for example:
Mismatched data types: When you collect data from many different sources, it
may come to you in different formats. While the ultimate goal of this entire
process is to reformat your data for machines, you still need to begin with
similarly formatted data. For example, if part of your analysis involves family
income from multiple countries, you’ll have to convert each income amount into a
single currency.
Mixed data values: Perhaps different sources use different descriptors for features
– for example, man or male. These value descriptors should all be made uniform.
Data outliers: Outliers can have a huge impact on data analysis results. For
example if you're averaging test scores for a class, and one student didn’t respond to
any of the questions, their 0% could greatly skew the results.
Missing data: Take a look for missing data fields, blank spaces in text, or
unanswered survey questions. This could be due to human error or incomplete data.
To take care of missing data, you’ll have to perform data cleaning.
2. Data cleaning
Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set. Dating cleaning is the most important
step of preprocessing because it will ensure that your data is ready to go for your downstream
needs.
Data cleaning will correct all of the inconsistent data you uncovered in your data
quality assessment. Depending on the kind of data you’re working with, there are a number
of possible cleaners you’ll need to run your data through.
Missing data
There are a number of ways to correct for missing data, but the two most common are:
Ignore the tuples: A tuple is an ordered list or sequence of numbers or entities. If
multiple values are missing within tuples, you may simply discard the tuples with that
missing information. This is only recommended for large data sets, when a few ignored
tuples won’t harm further analysis.
Manually fill in missing data: This can be tedious, but is definitely necessary when
working with smaller data sets.

MC4301 MACHINE LEARINIG


Noisy data
Data cleaning also includes fixing “noisy” data. This is data that includes unnecessary data
points, irrelevant data, and data that’s more difficult to group together.
Binning: Binning sorts data of a wide data set into smaller groups of more similar
data. It’s often used when analyzing demographics. Income, for example, could be grouped:
$35,000-
$50,000, $50,000-$75,000, etc.
Regression: Regression is used to decide which variables will actually apply to
your analysis. Regression analysis is used to smooth large amounts of data. This will help
you get a handle on your data, so you’re not overburdened with unnecessary data.
Clustering: Clustering algorithms are used to properly group data, so that it can be
analyzed with like data. They’re generally used in unsupervised learning, when not a lot is
known about the relationships within your data.
If you’re working with text data, for example, some things you should consider when
cleaning your data are:
Remove URLs, symbols, emojis, etc., that aren’t relevant to your
analysis Translate all text into the language you’ll be working in
Remove HTML tags
Remove boilerplate email text
Remove unnecessary blank text
between words Remove duplicate data
After data cleaning, you may realize you have insufficient data for the task at hand. At this
point you can also perform data wrangling or data enrichment to add new data sets and run
them through quality assessment and cleaning again before adding them to your original data.
3. Data transformation
With data cleaning, we’ve already begun to modify our data, but data transformation will
begin the process of turning the data into the proper format(s) you’ll need for analysis and
other downstream processes.
This generally happens in one or more of the below:
1. Aggregation
2. Normalization
3. Feature selection
4. Discreditization
5. Concept hierarchy generation
Aggregation: Data aggregation combines all of your data together in a uniform format.
Normalization: Normalization scales your data into a regularized range so that you can

MC4301 MACHINE LEARINIG


compare it more accurately. For example, if you’re comparing employee loss or gain within a
number of companies (some with just a dozenemployees and some with 200+), you’ll have to
scale them within a specified range, like -1.0 to 1.0 or 0.0 to 1.0.
Feature selection: Feature selection is the process of deciding which variables (features,
characteristics, categories, etc.) are most important to your analysis. These features will be
used to train ML models. It’s important to remember,
that the more features you choose to use, the longer the training process and, sometimes, the
less accurate your results, because some feature characteristics may overlap or be less present
in the data.

Discreditation: Discretization pools data into smaller intervals. It’s somewhat similar to
binning, but usually happens after data has been cleaned. For example, when calculating
average daily exercise, rather than using the exact minutes and seconds, you could join
together data to fall into 0-15 minutes, 15-30, etc.
Concept hierarchy generation: Concept hierarchy generation can add a hierarchy within
and between your features that wasn’t present in the original data. If your analysis contains
wolves and coyotes, for example, you could add the hierarchy for their genus: Canis.
4. Data reduction
The more data you’re working with, the harder it will be to analyze, even after cleaning and
transforming it. Depending on your task at hand, you may actually have more data than you
need. Especially when working with text analysis, much of regular human speech is
superfluous or irrelevant to the needs of the researcher. Data reduction not only makes the
analysis easier and more accurate, but cuts down on data storage.
It will also help identify the most important features to the process at hand.
Attribute selection: Similar to discreditation, attribute selection can fit your data into
smaller pools. It, essentially, combines tags or features, so that tags like male/female and
professor could be combined into male professor/female professor.
Numerosity reduction: This will help with data storage and transmission. You can use a
regression model, for example, to use only the data and variables that are relevant to your
analysis.

MC4301 MACHINE LEARINIG


Dimensionality reduction: This, again, reduces the amount of data used to help facilitate
analysis and downstream processes. Algorithms like K-nearest neighbors use pattern
recognition to combine similar data and make it more manageable.
Data Preprocessing Examples
Take a look at the table below to see how preprocessing works. In this example, we have three
variables: name, age, and company. In the first example we can tell that #2 and #3 have been
assigned the incorrect companies.

Name Age Company


Karen Lynch 57 CVS Health
Elon Musk 49 Amazon

Jeff Bezos 57 Tesla


Tim Cook 60 Apple

We can use data cleaning to simply remove these rows, as we know the data was improperly
entered or is otherwise corrupted.
Name Age Company
Karen Lynch 57 CVS Health
Tim Cook 60 Apple
Short Answers

1. Define Artificial Intelligence. Give an example


2. Write the differences between Artificial Intelligence vs Machine Learning
3. Write the differences between Traditional programming and Machine Learning
4. List out the type of problems to be solved using Machine Learning
5. List out the type of problems NOT to be solved using Machine Learning
6. Write the applications of Machine Learning
Long Answers
1. What are the Types of Human Learning? Describe in detail.
2. How do Machines Learn (Process of Machine Learning)? Explain the steps in detail.
3. Discuss and Differentiate Types of Machine Learning
4. Describe the Issues (Ethical Issues) in Machine Learning
5. What are the activities in Machine Learning (OR) Explain Life Cycle of Machine Learning
6. Explain Different Types of Data used in Machine Learning
7. Discuss Exploring the Structure of the data
8. Write short notes on Data Quality and Remediation process
9. What is Pre-Processing? Explain the steps in detail.

MC4301 MACHINE LEARINIG

You might also like