0% found this document useful (0 votes)
38 views

Unit 2b AI Project Cycle

Uploaded by

kvbsd13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Unit 2b AI Project Cycle

Uploaded by

kvbsd13
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

2.

4 AI Project Cycle: Data Exploration


Learning Outcomes
Understand the purpose of data explorations.
Learn ways to visualise data using various graphical tools.

The learning about the second stage of AI Project Cycle - Data Acquisition, gave us two
main points to think about:
1. Data gathering is a complex task due to variety of data and data sources.
2. Data gathered from various sources cannot be used immediately, as it is, to train AI
algorithm.
The second point is the core reason that we need to explore the data. Third stage of AI
project cycle - Data Exploration is dedicated to “tidy up” the raw data collected from
various sources and through various techniques. This “tidied up” data is finally used as
training data.
Data exploration is used to obtain basic understanding of data to determine its suitability
for training AI algorithm.

Structured and unstructured data


Structured data is usually found in symmetrical documents such as tables, spreadsheets,
database tables, xml files, etc. They store the data under fixed number of fields. There are
sets of horizontal records. Structured data is easiest to explore and refine as training data.
Unstructured data does not have a predefined, fixed model. Such data is not organised in
a pre-defined manner. For example, pieces of text, photographs, maps or satellite images,
audio clips and video clips. Handling unstructured
data demands higher level of expertise in exploring it
for preparing training data.
An xml file contains data in a hierarchical tree like
structure defined by mark-up elements. A sample is
given here.

Data Exploration and Missing Values


The data-set you obtain after gathering the data may
have certain values missing in it. There are several
possible reasons for it such as problems with data
extraction, complete information not given by the
person filling the survey form, value missing in the
database, value not read properly by ink character
reader, value deleted due to any reason, value not
1
filled yet while data was being extracted, etc.
Treatment of missing values is necessary to prevent bias in the AI system. For example, in
a social service AI application, due to data missing for a particular region, the predictions
done by the algorithm will miss out that region. This is called biased prediction and it
leads to human rights issues. Hence, missing values need to be taken care of in the data-
sets. Some major ways are listed here:
Remove the records that contain missing values.
Find the missing values and fill the gaps.
Look for the value in another similar record and use that to fill the missing value gap.
Estimate or calculate for the missing value. (E.g. Missing Sale Value = Price * Quantity).
Predict a value for missing value by careful analysis of existing data.

Data Exploration and Information


Many times, a huge set of data is helpful in extracting some meaningful piece of information which is
directly useful to use as training data. That information is not visible directly but hidden in the data-set
and could only be found out after careful analysis of the data. This exercise is called feature engineering.
Feature engineering is the technique of extracting relevant and useful information from the
existing data-sets for the training of the AI algorithm.
For example, in the Library MS scenario, we try to identify and extract only those data features which
help us find out the most frequently issued and the least read books.
Data Exploration through Data Visualisation
Final purpose of data exploration is to understand the data, identifying the relevant and useful data,
discard unwanted data and also, to analyse and conclude for useful information locked in the data-sets.
Data visualisation occurs when we begin to identify patterns, trends and logical relationships among
data values while exploring and analysing the data.
To analyse the data, its visualisation in certain form is very helpful. Data-sets are mostly complex structures
of numbers, text, characters, symbols, etc. The best way to visualise them is to create their graphical
representation. One of the most common and efficient way to visualise data is graphs and charts.
Graphical representation of data can be done in many forms such as:
Charts and graphs (Bar, Column, Pie, Area, Line, etc.)
Process flows (Flowchart, Illustration diagram, etc.)
Patterns (Histogram, Parallel coordinates plot, timeline, etc.)
Ranges (Gantt chart, Open-high-low-close chart, etc.)
Distribution (Bubble chart, density chart, etc.)
Maps and Locations (Bubble map, dot map, population pyramid, etc.)
Comparisons (Bar chart, line graph, etc.)
Relationships (Venn diagram, Parallel coordinates plot, Scattered Column, etc.)
Hierarchy (Tree diagram, tree map, etc.)
2
ACTIVITY: VARIOUS GRAPHICAL REPRESENTATIONS

Visit https://datavizcatalogue.com/ and explore any 5 graphical representations.


Try to explore more about Bar chart, Pie chart, Column chart, Line chart, Parallel Coordinates
Plot, Open-high-low-close chart and Bubble chart.
List the usage of these data representations, how it is drawn (how values are shown) and it
is suitable for which type of data.

Visualising data for various requirements


While working with various data-sets, we must know which kind of visualisation is suitable
to analyse the data for deciding what data-values would be suitable as training data.
Broadly, we work with data for following purposes:
Comparing the values.
Establishing relationships.
Distributions and compositions.

Comparing Values
Simple tables (for smaller data-sets), Bar chart, Column chart, Line chart, Area chart etc.
are common tools.
One of the most common data charts is Column or Bar chart. These charts have 2 axes – one axis
displays the items compared and the other axis displays the values. These charts present a simple,
straightforward, comparative view of the data values. These can be used to find out the items which are
similar in properties. For example, finding out similar furniture items in the product catalogue. Consider
the following data-set:

This data-set lists values of 4 properties – material, size, color and weight of various furniture items. The
values for these properties (except weight) are taken from the property codes defined beside the data-set

3
(like, 100 for Material means Plastic).
Now, consider the Column chart
created on this data-set. You can easily
compare the similar properties by
observing data columns of similar
height.
One limitation of these charts is
their size. With higher number of
values, they become too cluttered to
understand.

ACTIVITY: BOOKS CATEGORY ISSUE TREND

This is the data set of number of books issued in the


4 categories reference books, fiction, journals, and
textbooks. Line chart shows the trend and altering
patterns across the months. Plot a Line chart for this
data-set.

Write down your observation of the chart


about consistent and inconsistent number
of book issues.

Establishing Relationships
Relationships are helpful in
finding how various values
corelate with each other.
Line chart can also be used to
find relationships. The other
charts are Scatter plot and
Bubble chart.
Consider the data-set of
total books issued in each of the 6 months and the
scatter chart on it. Closer the values to each other,
similar they will be. Values of Jan, Mar, and
Apr are at the same level.
4
Analysing Distribution and Composition
Distribution refers to the amount of parts in a whole. For example, percentage performance
of students of a class in a subject or contribution of sale done by each of the salesman in
a team.
Pie chart is one of the most common charts to analyse distribution. Scatter chart can also
be used.
Tree Map chart is a 2-D stacked chart. It is suitable to analyse distributions where there
are two or more sets of variables. It immediately shows the largest, smallest and similar
values. See the Tree Map created on the
similar data set.
Composition charts show
how individual parts make
the whole. These charts
help see the importance
of each part in the whole.
Pie chart is a good tool to
analyse such data. Pie chart
shows each
data value as a
part of a whole in terms of percentage. Consider the
pie chart on the same data set of total issued books
,month-wise.

A histogram shows
a range of values.
Consider the histogram on the same data set of total
issued books This histogram has 4 bins of range 20.
The second bin has no values. Histogram is useful in
minimising the values if the data set is too huge.

Data Visualisation Tools


Google Data Studio: Google Data Studio is an online tool for converting data into customizable informative
reports and dashboards.
Tableau: One of the most preferred tools of data analytics. Tableau has various versions such as online,
desktop, server and mobile, etc. It allows to create visual charts and dashboards dynamically and reveal
the power of data in the most creative and effective way.
Candela: This is a Javascript based, open source tool to create rich data visualisations. It provides a
library of various charts, graphs and plots to integrate with your data-sets and generate dynamic and
scalable visuals of your data.

5
Other data visualisation tools are Datawrapper, Charted, Leaflet, MyHeatMap, RAWGraph, ZohoAnalystics,
Sisence, JupyteR, Infogram, Domo, Microsoft Power BI and SAP Analytics Cloud, etc.

PRACTICAL ACTIVITY: VARIOUS CHARTS

Go to visme.co and
click on any FREE chart
template to use it. Click
over the chart and
change the data in the
data sheet to see the
how chart changes. Try out values given in
activities and examples in this chapter.

2.5 AI Project Cycle: Modelling & Evaluation


Learning Outcomes
Understand rule-based and learning-based approaches of machine learning.
Learn the significance of decision trees.
Draw decision trees.
Appreciate the significance of evaluation.

Having narrowed down to the learning data (Data Exploration) from problem scoping, now is the time to
create models to train the machines in order to get the results (predictions) we desire.

AI, ML and DL Revisited


Before looking into the detailed concept of data modelling for training the machine, let us first have a
distinguished understanding of artificial intelligence, machine learning and deep learning.
Artificial intelligence: This is the field that refers to the development of machines that exhibit human
intelligence.
Machine learning: This is the sub-set of AI which refers to the techniques that enable a machine to learn
from the data and predictions and use that learning further.
Deep learning: This is the advanced level of machine learning. It works on the concept of Artificial
Neural Networks which are composed of multiple layers of nodes. Each node is a functional artificial

6
neuron. This technique enables a machine to learn from immense volume
of data. The data-sets are vast and, many times, dynamic in nature that they
get bigger in volume with passing time. For example, analysing the views
of several thousand users to do their sentiment analysis about a product or
public figure. Or, locating and identifying an image in a collection of several
thousand images. So, AI is the concept of which ML is the sub-set and DL is
advanced level of ML.

AI Modelling Approaches
At the heart of AI is data. The algorithm logics is focussed on the data and relationship between various
data elements. We have already seen during data exploration that most of the data is represented in the
form of numbers, text and dates. At machine level, all kind of data is represented only by numbers and
translated into its binary equivalent – the sequences of 0s and 1s. In addition to abovementioned data
types, there are images, audio, video and multimedia. At machine level these are also represented as
complex sets of numbers. Computers do not understand data like human brain does. For a computer
system, every piece of data has to be represented in some form of arithmetic expression. Which is why AI
data needs to be modelled in mathematical form.
There are broadly two types of AI modelling approaches: Rule-based and Learning-based.

Rule-based Approach
This approach is best suited for the systems that need to work in a restricted application area. They follow
a set of rules in order to accomplish a task. Rules are defined in the system in a well-structured format of
if-else-elseif branches along with the set of facts. For any question asked, the machine picks up the key
data from the question and looks up the rules. Depending on the rules, it finds out the information from
the database and presents to the user. This way, in rule-based approach the intelligence of the machine
is only simulated as defined by the human (developer). This intelligence does not exceed beyond this.
Machine does not gain any learning experience. Its intelligence does not grow unless any new rules and
facts are added to it. Challenges with rule-based systems are:
Chances of addition of contradicting rule while adding new rules.
Upgrade and upkeep of such systems are time-consuming and expensive.
Not versatile to be used in multiple domains since all problems cannot be defined by a structured
set of rules.
Rule-based approach is widely used in industry in various
domains. Expert systems in various fields such as medical
and legal are the examples of rule-based approach of
machine learning. Automated manufacturing, gaming
and education fields also pursue rule-based approach of
modelling for AI.

Learning-based Approach
The limitations of rule-based approach are addressed by learning-based approach of AI modelling. In
this approach, the data is fed to the machine and it is supposed to analyse it to find possible patterns and

7
trends in the data for making any predictions. This is the reason that such systems are suitable to handle
abstract, unstructured and random data. Learning-based approach needs high level of data expertise and
modelling. This approach is complex in nature and needs to be dealt with great care.
Learning-based approach is suitable where step-by-step rule-based learning cannot be applied easily. It
is useful where useful predictions are based on a number of factors which are difficult or not possible
humanly. Predicting customer behaviour, monitoring financial transaction for frauds, medical diagnostics,
legal research and advise etc. are the application areas for learning-based approaches. Machine learning
and deep learning, both are learning-based approaches. Now, let us learn about some popular AI models.

AI Models
An AI model's prime goal is to learn a way to establish a meaningful relationship between the input data
and the predictable output.
Various data values in a data-set or multiple data-sets are mapped and analysed to for near accurate,
useful predictions.
There are various AI models that can be used in various scenarios. Some of the common AI models are
decision tree, regression, bagging, random forests, and neural networks, etc.

Decision Trees
Amongst all the AI models, decision tree is the simplest
yet efficient tool based on rule-based approach. It is
used to classify the data values or predict the outcome
on the basis of input values. Decision tree looks like a
classic binary tree that propagates by answering simple
questions in “Yes” and “No”. A decision tree is the inverted
figure of a real tree where roots are at the top (or, left) and leaves at the bottom (or, right).
Every component of a decision tree is called a node. A node can be categorised as - Root node, Decision
node and Leaf node. Each node is connected with the next nodes through a branch. Branches are called
either Yes branch or No branch. Nodes can be drawn using a rectangle and branches are represented by
straight lines or arrows.
Root node: A decision tree has exactly one root node that stays at the very top. It specifies the starting
condition on which the decision is need to be taken.
Decision node: There are always two or more decision nodes in a tree. Each node specifies a new condition
that emerges from the previous condition. Decision nodes are the outcome of either a decision taken on
the condition in positive (Yes) or negative (No).
Leaf node: There are always several leaf nodes or leaves which depict the end of the decision tree. That
is why leaves constitute the bottom of the decision tree. Leaves specify the final decision taken after
considering all the conditions.
Branches: A node is connected with next node either through a positive branch (Yes branch) or negative
(No branch). In certain cases, multiple braches can be used.

8
Decision Tree by Example
Let us understand decision tree by an example from our Library MS scenario. Consider the following data
set on which decision tree needs to be made. When the development team discussed the requirements
for the solution in detail with the library manager they found out the following:
Journals can be issued but they should not be recommended by the AI-algorithm.
All other types of books should be recommended.
The criteria to be classified as popular book for each type of book is different now (earlier it
was 95% or more for all). This is given as
below:
BOOK TYPE POPULARITY CRITERIA
Reference 98%
Textbook 95%
Fiction 90%
Journal 80%
Notice some sample data given here about two
types of book (Reference & Textbook). If you have to classify the book Java Bible as popular or not then
the decision tree for it is given here. This tree is created considering only two criteria in the context i.e.
Reference book & Textbook.

The complete decision tree for


classifying the books as popular or
not is given here.

9
You can say that if Journals are not to be recommendation then why keep them in the decision tree? We
are including the journals here because their issue percentage is being calculated and they are the part of
the problem scope. But for such a small scenario, you can do away with journals.

Tips on drawing decision trees


While planning for drawing decision trees, follow these tips:
Decision trees are suitable to depict concrete decisions base on a “Yes” or a “No” so carefully analyse
the data-set.
The parameter which affects the prediction or output directly should only be chosen for decision
tree.
If there are any redundant data, eliminate it.
Avoid parameters that have missing values since they might translate into missing nodes in the
decision tree.
If chosen parameter produces more than one decision tree then keep the simplest one.
While drawing the decision tree, keep all the “Yes” branches on one side (usually, left) and “No”
branches on the opposite side (usually, right).
Every node, except the leaves, must have a “Yes” as well as a “No” branch.
Coloured decision trees can be made but choice of colours should not create confusion in
understanding the tree.

ACTIVITY: A QUICK DECISION TREE

Draw a decision tree to predict if a student should opt for Math stream or Arts. To opt for Math
stream, he should secure above 90% in Physics and Chemistry as well as distinction marks in Math
otherwise he/she must opt for Arts stream.

Drawing a Decision Tree


Consider this data-set about 3 parameters or variables in a furniture shop. Create a decision tree to predict
if the sale shall be high, low or moderate. Starting with the root condition: Is Price Range between 1K
and 5K? Also, try to find out and eliminate any redundant records first.

PRICE RANGE MATERIAL COLOUR PRICE RANGE


1000-5000 METAL BLACK HIGH
1000-5000 PLASTIC BROWN MODERATE
6000-10000 METAL BROWN MODERATE
1000-5000 PLASTIC BLACK LOW
6000-10000 PLASTIC BLACK MODERATE
6000-10000 PLASTIC BROWN HIGH
1000-5000 METAL BLACK HIGH
1000-5000 METAL BROWN MODERATE
6000-10000 PLASTIC BLACK MODREATE
6000-10000 METAL BLACK HIGH

10
Let us first look for redundant records. It is easier to find redundant records in sorted data-set.
The sorted data-set looks like this with duplicate records in red. Remove them.

PRICE RANGE MATERIAL COLOUR


1000-5000 METAL BLACK
PRICE RANGE MATERIAL COLOUR
1000-5000 METAL BLACK
1000-5000 METAL BLACK
1000-5000 METAL BROWN 1000-5000 METAL BROWN
1000-5000 PLASTIC BLACK 1000-5000 PLASTIC BLACK
1000-5000 PLASTIC BROWN 1000-5000 PLASTIC BROWN
6000-10000 METAL BLACK 6000-10000 METAL BLACK
6000-10000 METAL BROWN
6000-10000 METAL BROWN
6000-10000 PLASTIC BLACK
6000-10000 PLASTIC BLACK 6000-10000 PLASTIC BROWN
6000-10000 PLASTIC BLACK
6000-10000 PLASTIC BROWN Data set after removing
the redundant records.

Some designers take liberty of not


following “Yes/No” branch rule. They
make decision trees as hierarchical
charts which is also acceptable as
long as the scenario is well explained.

AI Model for Visual or Graphical Data


Imagine a big box filled with a variety of chocolates. You are looking for a particular type. Your brain scans
almost all the chocolates in the box until it finds the one you desire. This is image (or object) detection.
Now, imagine in the box of chocolates, you need to sort the chocolates. For this, your brain will try to
categorise the chocolates that are similar in flavour, size, price or colour or whatever is the criteria. This
is image (or object) classification.
Image detection and classification are possible by our brain due to our ability of image recognition which
is a very high level of brain capability for an AI system to achieve. In visual data models, learning-based
approach is used.
Various application areas like scanning QR codes and bar codes, identifying faces, maps and tags,
converting handwritten text to digital text etc. are common examples of image recognition.
Images are a collection of millions of pixels (picture-elements) – microscopic dots that together make
an image. Each pixel has its own colour information (colour and intensity). Advanced image feature
descriptor technologies such as HAAR, HOG, SIFT etc. are deployed to identify an object in an image. To

11
train an AI system to recognise image needs a rich collection of several different images of an object along
with many images which are “not” that object. The image properties like colour depths, vectors etc. are
used by AI algorithm to train itself into identifying the intended object.

Google Vision, CamFind and MS Kinect


Maruti Automobiles uses Google Vision technology to identify the images of cars uploaded by their
clients to authenticate the car and to detect and discards cars which are not dealt by Maruti.
CamFind app does live product search on the image clicked by the user..
Revisit quickdraw.withgoogle.com and experience how AI guesses what you draw in 20 seconds.

ACTIVITY: PIXEL IT

This activity needs 2 copies each of different images of


same size. Number of images should be half the number of
participants. E.g. for 20 participants, 2 copies of 10 images
each are needed. Randomly, distribute one image to each
participant. They should neither show the images to each
other nor compare the images at this moment.
Then, each participant should draw lines making square
marks on the image as shown here with a coloured pen.
Cut the image along the horizontal marks as shown here.
Counting from top, there is piece 1, piece 2, piece 3, piece 4
and piece 5.
Stick piece 2 with the end of piece 1, piece 3 with piece 2 and
so on to form a long chain as shown here.
Match the image: Now, try matching your image strip with the strips of other participants to find
out who got the second identical image.

Learning: What you did is that you divided the image into 5 strips and matched the set of strips to
identify the identical image. Computers match the images pixel by pixel and identify the matching
image.
Variation: The similar activity can be done by replacing images with pieces of paper on which you
can write a large alphabet or a word in your own handwriting. This is how computers match the
handwriting.

12
Evaluation and Deployment
Evaluation stage follows Modelling stage. At this time, the desired model is trained with the training
data set and ready for testing with the help of testing data set. Testing data set is a completely different,
separate and new entity for the model to be tested. The model has been trained with a separated relatively
larger training data set compared to testing one. In most of the cases the ratio of the size of trainaing and
testing data set is 3:2 but it may depend on case-to-case basis and the domain also.
Testing data set is prepared very carefully by the trained professionals after exploring and cleaning the
raw big data. First the results of testing are compared with those of training. If the results are satisfactory
then the testing results are compared with actual data in the domain. If that is also satisfactory then the
model is considered to be deployed.
In most of the cases, after passing the rigorous evaluation, the model is deployed in the real domain and
is monitored regularly for its performance over a period of time. After that it is signed off as a completely
reliable and independent model.

The Scenario and Confusion Matrix


Evaluation clearly means measurement of a model's reliability in performing as and when required. The
measurement is always done against certain parameters whose values decide the course of evaluation.
Let us understand this in detail.

Scenario
The problem area for which a model has been developed is called scenario. Scenario is the reality in which
the real problem exists and the model has to be deployed. The model has to deal with the scenario the way
it has been trained. Scenario is the source of real data which is fed into the model for processing either at
regular intervals (hourly, daily, weekly etc.) or in real-time. A regular-interval scenario can be a less critical
problem area in which emergent threat is not there such as pollution monitoring in a region needs weekly
or monthly data, studying a cancer patient for research, monitoring the diet and fitness of sportspersons,
monitoring performance of students and their study habits etc. Real-time scenarios represent a non-stop,
critical, emergent, life threatening or disastrous situation such as online transactions, war situation, flood,
monitoring patient in critical care, air, road, sea and rail traffic etc.
Nature of the scenario describes the expectations of a model for its robustness and reliability. The parameters
which are used to monitor and study the scenario determine the real performance of the model.
There are two aspects which are considered here – prediction by the machine and reality of the scenario
during the prediction being made.
There are 2 possibilities when the prediction made by the machine and the reality of the scenario are
compared. These 2 possibilities are listed here:
The prediction matches the reality.
The prediction does not match the reality.
The predictions are done in two terms – Yes and No or True and False or Negative and Positive.
For example, if a Flood Forecasting System (FFC) has to predict for flood then the possibilities are Yes or
No. Another example is to predict whether the Air Quality Index of a region will be fatal next year or not.

13
Here also, answer would be Yes or No.
The predictions which are done in Yes are called Positive predictions. If this prediction matches with the
reality then it is called a True Positive prediction. For example, if the flood forecasting system predicts
that the flood will be there in a particular month in a region and that really happens then it is a True
Positive prediction.
What is a True Negative prediction then?
When the prediction is a No and in real it really isn't, or doesn't occur then it is called True Negative
prediction. For example, if the flood forecasting system predicts that the flood will not be there in a
particular month in a region and flood really does not happen then it is a True Negative prediction.
The reverses of the above two possibilities are called False Positive and False Negative predictions.
The prediction which is done in Yes and does not match with the reality then it is called a False Positive
prediction. For example, if the flood forecasting system predicts that the flood will be there in a particular
month in a region and that really does not happen then it is a False Positive prediction.
The prediction which is done in No and does not match with the reality then it is called a False Negative
prediction. For example, if the flood forecasting system predicts that the flood will not be there in a
particular month in a region and flood does happen then it is a False Negative prediction.

Confusion Matrix
The above permutations can be summarised neatly in a tabular structure called confusion matrix. It is
also known as error matrix. Confusion matrix is a tabular representation to visualise the performance
of an algorithm or model. Confusion matrix are mostly useful for supervised learning models.
Confusion Matrix shows all the 4 permutations. Let us summarise them once again.
True Positive: Prediction is yes and it is true. (E.g. flood predicted and it did occur)
True Negative: Prediction is no and it is true. (E.g. flood not predicted and it didn't occur)
False Positive: Prediction is yes and it is false. (E.g. flood predicted but it didn't occur)
False Negative: Prediction is no and it is false. (E.g. flood not predicted but it occurred)
In the confusion matrix table, reality parameters are kept at the top and prediction parameters on left hand side.

14
Model Deployment
The purpose of evaluation is to assess of if the model is fit to use as desired in real life problem-area.
Once the model is found to be perfectly fit to perform, it is deployed in the real problem-area for all the
beneficiaries use.
Forms of model deployment: Model is deployed in various forms depending on the need. Some
common forms are:
A fully-functional software application.
As part of a software library in the form of modules or packages.
A fully-functional web service.
An on-demand web service where users subscribe only for selective features.
A mobile application.
A combination of software and hardware such as driving simulator, AI-robotics solution, or self-
driven vehicle, etc.

Post deployment optimizations: Over a period of time, the performance of the deployed model is
monitored against certain performance parameters and ethical parameters. Its performance data is
gathered and analysed. Also, the feedback of the users is also collected if the model is easily accessible
and usable by all or not. On the basis of these analyses, further improvements are suggested and done
in the features and working of the model, making it more robust and efficient. This way, in long term,
the model is completely accepted into the system for regular use.

A problem scope is mutual understanding of all stakeholders about what is to be done to solve
that problem.
AI project stages include problem scoping, data acquisition, data exploration, modelling data,
evaluation.
Problem scoping gives a clear vision of the problem which is otherwise very abstract and
undefined.

15
Problem scoping covers 4Ws – Who (stakeholders), What (the problem), Where (Context
of the problem) and Why (Rationale of the solution).
Problem scoping includes identifying and defining the problem and goals to achieve.
Logical relationship between data values generates meaningful information.
Data used to train the machine is called training data.
Data used to evaluate the performance of the machine is called testing data.
Data quality refers to the accuracy and relevance of data.
A system map is a tool to show the relationship among various elements of a problem area
in a graphical form.
Data exploration is used to obtain basic understanding of data to determine its suitability
for training AI algorithm.
Structured data is usually found in symmetrical documents like databases.
Unstructured data does not have a predefined, fixed model like images, videos, etc.
Treatment of missing values is necessary to prevent the AI system being biased.
Feature engineering is the technique of extracting useful information from the data-sets.
One of the most common and efficient way to visualise data is graphs and charts.
Data visualisation includes comparing data, establishing relationships among data values
and finding distribution and composition of data.
ML is the sub-set of AI and Deep learning means learning from immense volume of data.
AI modelling has two approaches – rule-based and learning-based.
Rule-based approach involves a set of rules and facts.
Learning-based approach is useful where useful predictions are based on a number of
factors which are difficult or not possible humanly.
Decision tree is the simplest yet efficient means to model the data using rule-based
approach.
In evaluation stage, the AI model is tested for its accuracy and reliability.
If AI model passes the evaluation then, it is deployed in the problem area.
A model can be deployed as software, mobile app, web service, etc.

Scoping: Defining what to do to solve a problem.


Acquisition: Collection, compilation.
Exploration: Go through, observe, analyse.
Model: Present or arrange in a particular order or format.
Evaluation: To compare and assess for selection.

16
Abstract: Having no definite shape or definition, brief and unclear.
Scope: Working boundary of a problem area, premise of proposed solution.
Resource: Person, tools and infrastructure required to work on a project.
Budget: Projection of expected expenses and allocation of finances to different parts of an
organisation or project.
Context: Problem area that needs the solution.
Problem statement: A short, concise description of the problem.
Goal: A piece of problem-solving purpose to be achieved.
Data feature: Data type such as number, text, date, image, etc.
Training Data: Data used to train the machine.
Testing data: Data used to test the AI model after training.
Visualisation: To create a pictorial representation of data.
Raw data: Original form of data collected from various sources.
XML: eXtensible Markup Language to describe documents structure and data.
Parameter: A variable identifying a category of data such as Price, Colour, Weight, etc.
Abstract: Not defined distinctly, having no definite structure.

CONCEPTUAL SKILLS ASSESSMENT


A. Multiple Choice Questions.

1. For successful completion, a project needs better __________.


a. Imagination b. People

c. Plan d. Organisation

2. The purpose of the project is translated into achievable ________________.


a. Heights b. Output

c. Plan d. Goals

3. Project _______________ enables us to see what is to be done in a project.


a. Detail b. Scope

c. Plan d. Activities

17
4. Existing businesses are suitable for AI implementation due to bulk or immense ____________.
a. Scope b. Data

c. Revenue d. Information
5. Data exploration is followed by ___________________.
a. Modelling data b. Data acquisition

c. Problem scoping d. Evaluation


6. Problem scoping is the __________ stage of AI project cycle.
a. Last b. First

c. Optional d. Random
7. Problem scoping gives a clear vision of otherwise ________________ problem.
a. Distinct b. Specific

c. Large d. Undefined
8. Problem statement includes ____________________________________________________.
a. The problem description b. Proposed solution

c. Both a) and b) d. Problem details


9. “Ravi”, “@” and “10-Jun-2022” are examples of __________.
a. Data feature b. Data format

c. Complex data-type d. Testing data


10. Testing data is ________________.
a. Input b. Prediction

c. Processing d. Big data


11. Training data is _______________.
a. Input b. Prediction

c. Processing d. None of these


12. API stands for Application _______________ Interface.
a. Processing b. Prediction

c. Primary d. Programming
13. Data acquisition gives us _____________ data while data exploration provides us ____________
data.
a. Raw, training b. Raw, testing

c. Clean, graphical d. Clean, dynamic

18
14. Data exploration is necessary because data available from data ________________ cannot be
used directly for machine learning.
a. Scoping b. Acquisition

c. Analysis d. Processing
15. A data-set containing names of 200 students and their marks in 5 subjects is an example of
_______________ data.
a. Structured b. Unstructured

c. Big d. Raw
16. A data-set containing 1000 samples of handwriting images is an example of ______________ data.
a. Structured b. Unstructured

c. Big d. Visual
17. In a data-set, the amount paid is not there in some records. These values are called ____________
values and can be _____________________.
a. Numeric, calculated b. Missing, omitted

c. Missing, calculated d. Numeric, omitted


18. An AI system trained on a data-set with missing values can be _____________.
a. Confused b. Accurate

c. Efficient d. Biased
19. Extracting hidden insight from the existing data is called ______________________.
a. Artificial engineering b. Data engineering

c. Feature engineering d. Data visualisation


20. For a comparative view of data, ________________ chart is suitable.
a. Column b. Bar

c. Both a) and b) d. None of these


21. To view trends and altering patterns, ______________ chart is a good tool.
a. Line b. Pie

c. Stacked column d. Any of these


22. To view relationships, ______________ chart is a good tool.
a. Line b. Scatter plot

c. Waterfall d. Any of these

19
23. To view percentage contribution of parts in a whole, ____________ chart is suitable.
a. Pie b. Waterfall

c. Line d. Bar
24. Stacked column chart is suitable to view ___________________.
a. Distribution b. Composition

c. Both a) and b) d. None of these


25. Which of the following is most suitable to show a process flow?
a. Line chart b. Flowchart

c. Venn diagram d. Gantt chart


26. For a computer system, data has to be represented in some form of __________________ expression.
a. Numerical b. Artificial

c. Pixel d. Logical
27. In rule-based approach, ______________ are defined as sets of if-else.
a. Rules b. Facts

c. Relationships d. Data
28. In rule-based approach, ______________ are defined by the help of data.
a. Rules b. Facts

c. Relationships d. References
29. In rule-based approach, the intelligence of the machine is defined by the _____________.
a. Developer b. Humans

c. Both a) and b) d. Computer


30. In ______________-based approach, machines are on their own to analyse data and see trends and
patterns in it.
a. Rule b. Learning

c. Both a) and b) d. None of these

20
B. Fill in the blank.

Why, Goals, What, Scope, Data, Solution, Project, Bottom, Top, Data exploration, Rule, Stakeholders,
Web scraping, Spreadsheet, Bias, Complex, ANNs, AI, Prediction, ML, Learning

1. Every problem drives us to find ____________.


2. A ________________ is a piece of planned work.
3. ____________ are defined out of the purpose of a project.
4. Project ____________ helps create an efficient project plan.
5. Newer systems are not suitable for AI implementation because they do not have
enough ____________________.
6. ____________ are those affected by the solution of the problem directly or indirectly.
7. In 4W framework, _______________ defines the context or boundary of the problem.
8. Rationale of the solution is defined in _________ part of 4W framework.
9. Training and testing data are the outcome of ____________________ stage of AI project cycle.
10. ________________________ is an example of structured data.
11. Phone recording of a customer complaint is an example of _______________ data.
12. Collecting data from a web site in an organised form is called__________________________.
13. Missing values may cause ____________ in an AI model.
14. Deep learning algorithms work on __________.
15. DL and _____ are the subset of _______.
16. Chess playing algorithm is an example of ________________-based learning.
17. A hate-speech detector algorithm running on ANN is an example of ___________-based
learning.
18. Decision trees are useful tools in doing _____________ on the basis of input variables.
19. Leaf node is found at the ______________ of a decision tree.
20. Root node is found at the ____________ of a decision tree.

C. State whether True or False.


1. A solution cannot be defined without understanding the problem.
2. For day-to-day difficulties, a detailed project plan is necessary.
3. Project goals together map with the purpose of the project.
4. An existing, old business is suitable to implement AI.
5. An AI project cannot be executed in stages.

21
6. Problem scoping can be done at any point of time while developing a solution.
7. Smaller projects do not need problem scoping.
8. Problem is identified in problem scoping.
9. Vision and outcome of a problem-solving exercise are different.
10. A problem statement needs to be in fine details.
11. Line chart is useful in showing the distribution of data values as part of a whole.

D. Match the AI Project Cycle Stages with their purpose.


AI Project Cycle Stages Purpose
1. Problem scoping a. Using data to train AI systems.
2. Data acquisition b. Define the goals of AI system.
3. Data exploration c. Collect and compile relevant data.
4. Modelling data d. Analyse and select suitable AI system.
5. Evaluation e. Well-organised and laid out data.

E. Match the Purpose with correct data visualisation tool.


1. Process flows a. Pie chart
2. Patterns b. Tree diagram
3. Ranges c. Open-high-low-close chart
4. Distribution d. Time line
5. Maps and Locations e. Flowchart
6. Comparisons f. Bar chart
7. Relationships g. Population pyramid
8. Hierarchy th. Scatter chart

F. Short answer type questions.


1. What do you mean by data features and data formats?
2. What do you mean by data acquisition?
3. Name any 4 sources to gather data in an AI Project.
4. What is web scraping?
5. Give three examples of structured data.
6. Give 3 examples of unstructured or complex data.
7. What is the use of a system map?
8. What do you mean by data exploration?

22
9. Give one consequence of missing value in the training data.
10. What do you mean by feature engineering?
11. List any 4 types of data visualisations and their use.
12. How are ML and DL related to AI?
13. Give two example of a rule-based learning system.
14. Give two examples of learning-based machine learning model.
15. Name the final stage of AI Project Cycle. What is the use of this stage?

G. Long answer type questions.


1. Briefly describe the stages of AI Project Cycle.
2. What is the use of training data and testing data?
3. Briefly explain any 4 factors that determine data quality.
4. What is the significance of data explorations after data acquisition?
5. Discuss data feature and explain structured and complex data with examples.
6. Discuss any 5 factors of data quality.
7. Why is it not suitable to use unstructured data directly to train an AI model?
8. Briefly, discuss various ways to handle the problem of the missing values and their impact
on training the AI model.
9. Briefly discuss the use of some data visualisations in comparing, establishing
relationships and analysing distribution.
10. Discussing the two types of learning approaches, give 2 basic differences between
them.
11. Briefly discuss the application of rule-based learning model and learning-based model.
12. Giving a simple example, explain the structure of decision tree.

COMPETENCY-BASED QUESTIONS

A. Consider the following scenario and answer the questions that follow:
An educational institute wants to enhance the learning experience of their students. For this
purpose, they conceived of a learning mobile app which students can use to ask questions regarding
any topic in their course they are pursuing. It should also be able to learn from the questions asked
by the students to relate the topics in the course in such a way that if student asks for one topic, it
displays the links to other related topics also and lists the questions that have been asked on that
topic in past years. This will help students quickly revise the topics and learn on the move anytime
and anywhere. This will also reduce load on the teachers in helping students revise the lessons.
1. Create the Problem Statement Template for the given scenario.
2. What major data feature you suggest should be most useful for the learning app?

23
B. Make the decision tree for the given scenario.
A furniture trader needs to predict whether sale of furniture would be high in a region of the country
considering the 4 parameters given in the data-set below. The sale is higher during festivals. Sale
goes down during rainy seasons due to the problems like transport, damages by water etc. Higher
number of dealers promises higher sale. Sales can be high or low depending on the reviews by the
customers of a region. Create a decision tree on this data-set. Also, try to find out if there are any
redundant records then eliminate them.

CUSTOMER
RAINY NUMBER OF
FESTIVAL REVIEWS IN THE SALE
SEASON DEALERS
REGION

YES NO HIGH SATISFACTORY MODERATE

NO YES LOW EXCELLENT LOW

YES NO HIGH GOOD HIGH

YES NO HIGH SATISFACTORY MODERATE

YES NO HIGH EXCELLENT HIGH

YES YES LOW EXCELLENT MODERATE

YES NO LOW GOOD HIGH

YES NO LOY SATISFACTORY MODERATE

YES NO HIGH GOOD HIGH

NO NO HIGH GOOD MODERATE

NO NO HIGH EXCELLENT HIGH

YES YES LOW SATISFACTORY LOW

YES YES LOW EXCELLENT MODERATE

NO NO HIGH SATISFACTORY MODERATE

NO NO HIGH GOOD LOW

NO NO LOW GOOD LOW

NO NO LOW SATISFACTORY LOW

24
CASE STUDY/SCENARIO-BASED QUESTIONS
Consider the following scenarios and answer the questions that follow them:
A. In this stage the data is prepared to train the AI model by fixing missing values, setting up the data
values in suitable formats, and giving a proper structure to the data. Then data is split into training data
and testing data. The data needs to be relevant according to the problem statement and goals set in the
initial phase of the project. The training and testing of the AI model continues until the model passes the
evaluation phase.
1. In which stage of AI Project Cycle the training data is prepared?
2. Which stage of AI Project Cycle is used to train the AI model?
3. Name the initial and final phase of AI Project Cycle.
4. Training data should have a few missing values. (T/F)
5. Which phase of AI Project Cycle has not been discussed in this scenario?
B. Raj is an AI project manager. For an eye hospital, he has to develop an AI-enabled model and a medical
expert system which scans the patient’s eye through a web cam and generates the diagnosis cataract
(blurry vision due to cloudiness of eye lens). Hospital is not able to cope up with the rising number of
patients and several hundred patient applications are in wait list. The AI model will examine the eye and
generate several parameters which will make the diagnosis report. This report will be sent to an expert
system which reads the data and images of the patient’s eye and suggest possible medication. Expert
system also notifies to the medical officer about the expert doctors on the given diagnosis who are free to
take patient’s case. Patient is also informed through email which doctor has been allocated to him/her. All
the data required to develop the solution is in various files on hospital’s web site and hospital database.
Hospital has data of 2,00,000 cataract patients of past 20 years. Data will be explored and cases of 60%
of the patients will be used to train the AI model. Rest of the data will be used to evaluate the model. This
will help the hospital to deal with the growing number of patients quickly and efficiently.
1. Write the problem statement of this scenario.
2. What value addition will the solution do to the hospital?
3. Using which technique Raj can pull out data from the hospital web site?
4. There is some structured data in the hospital in the form of _______________________.
5. Data of how many patients will be used to test the AI model?
6. If there are any missing values in the training data then what could be the consequences of deploying
the AI model in reality?

LAB ACTIVITY
1. Visit dzone.com/articles/twelve-types-of-artificial-intelligence-ai-problem and find out what kind
of problems AI is addressing presently.
2. Visit the following links and observe how AI works with various types of real-life data.
Document your observations to discuss in class.
experiments.withgoogle.com/objectifier-spatial-programming
experiments.withgoogle.com/ai/bird-sounds/view

25
experiments.withgoogle.com/ai/drum-machine/view
3. Visit https://datavizcatalogue.com/ and try making Bubble Chart, Bar Chart, Calendar, Line
Chart, TimeTable and Tree Diagram with your own datasets.
Take screenshots of the charts and paste them in a Word document.
4. Visit www.toptal.com/designers/data-visualization/data-visualization-tools and find out the
features of at least 10 distinct data visualisations.

Download Dia from dia-installer.de. Use this tool to make decision trees in this chapter.
Standard toolbar: It
provides quick tools like new,
open, save, print and export
diagram etc.
Diagram Canvas: It is the
largest area for drawing.
Toolbox: All the blocks
and connecters you need are
available in the Toolbox. It
also provides options for line
widths and arrow styles.

Click on the desired block in the Toolbox and drag


with mouse on the Canvas. The block will be drawn.
1. Arrow tool: Move the shape.
2. Text edit tool: Type text in the block or press
F2.
3. Connecters to connect the blocks with lines
and arrows.
4. Double click on the
desired block to change its
properties.
Open File menu > Export
option to export your chart
as an image.

www.eduitspl.com
www.youtube.com/edusoftknowledgeverse
26

You might also like