Unit 2b AI Project Cycle
Unit 2b AI Project Cycle
The learning about the second stage of AI Project Cycle - Data Acquisition, gave us two
main points to think about:
1. Data gathering is a complex task due to variety of data and data sources.
2. Data gathered from various sources cannot be used immediately, as it is, to train AI
algorithm.
The second point is the core reason that we need to explore the data. Third stage of AI
project cycle - Data Exploration is dedicated to “tidy up” the raw data collected from
various sources and through various techniques. This “tidied up” data is finally used as
training data.
Data exploration is used to obtain basic understanding of data to determine its suitability
for training AI algorithm.
Comparing Values
Simple tables (for smaller data-sets), Bar chart, Column chart, Line chart, Area chart etc.
are common tools.
One of the most common data charts is Column or Bar chart. These charts have 2 axes – one axis
displays the items compared and the other axis displays the values. These charts present a simple,
straightforward, comparative view of the data values. These can be used to find out the items which are
similar in properties. For example, finding out similar furniture items in the product catalogue. Consider
the following data-set:
This data-set lists values of 4 properties – material, size, color and weight of various furniture items. The
values for these properties (except weight) are taken from the property codes defined beside the data-set
3
(like, 100 for Material means Plastic).
Now, consider the Column chart
created on this data-set. You can easily
compare the similar properties by
observing data columns of similar
height.
One limitation of these charts is
their size. With higher number of
values, they become too cluttered to
understand.
Establishing Relationships
Relationships are helpful in
finding how various values
corelate with each other.
Line chart can also be used to
find relationships. The other
charts are Scatter plot and
Bubble chart.
Consider the data-set of
total books issued in each of the 6 months and the
scatter chart on it. Closer the values to each other,
similar they will be. Values of Jan, Mar, and
Apr are at the same level.
4
Analysing Distribution and Composition
Distribution refers to the amount of parts in a whole. For example, percentage performance
of students of a class in a subject or contribution of sale done by each of the salesman in
a team.
Pie chart is one of the most common charts to analyse distribution. Scatter chart can also
be used.
Tree Map chart is a 2-D stacked chart. It is suitable to analyse distributions where there
are two or more sets of variables. It immediately shows the largest, smallest and similar
values. See the Tree Map created on the
similar data set.
Composition charts show
how individual parts make
the whole. These charts
help see the importance
of each part in the whole.
Pie chart is a good tool to
analyse such data. Pie chart
shows each
data value as a
part of a whole in terms of percentage. Consider the
pie chart on the same data set of total issued books
,month-wise.
A histogram shows
a range of values.
Consider the histogram on the same data set of total
issued books This histogram has 4 bins of range 20.
The second bin has no values. Histogram is useful in
minimising the values if the data set is too huge.
5
Other data visualisation tools are Datawrapper, Charted, Leaflet, MyHeatMap, RAWGraph, ZohoAnalystics,
Sisence, JupyteR, Infogram, Domo, Microsoft Power BI and SAP Analytics Cloud, etc.
Go to visme.co and
click on any FREE chart
template to use it. Click
over the chart and
change the data in the
data sheet to see the
how chart changes. Try out values given in
activities and examples in this chapter.
Having narrowed down to the learning data (Data Exploration) from problem scoping, now is the time to
create models to train the machines in order to get the results (predictions) we desire.
6
neuron. This technique enables a machine to learn from immense volume
of data. The data-sets are vast and, many times, dynamic in nature that they
get bigger in volume with passing time. For example, analysing the views
of several thousand users to do their sentiment analysis about a product or
public figure. Or, locating and identifying an image in a collection of several
thousand images. So, AI is the concept of which ML is the sub-set and DL is
advanced level of ML.
AI Modelling Approaches
At the heart of AI is data. The algorithm logics is focussed on the data and relationship between various
data elements. We have already seen during data exploration that most of the data is represented in the
form of numbers, text and dates. At machine level, all kind of data is represented only by numbers and
translated into its binary equivalent – the sequences of 0s and 1s. In addition to abovementioned data
types, there are images, audio, video and multimedia. At machine level these are also represented as
complex sets of numbers. Computers do not understand data like human brain does. For a computer
system, every piece of data has to be represented in some form of arithmetic expression. Which is why AI
data needs to be modelled in mathematical form.
There are broadly two types of AI modelling approaches: Rule-based and Learning-based.
Rule-based Approach
This approach is best suited for the systems that need to work in a restricted application area. They follow
a set of rules in order to accomplish a task. Rules are defined in the system in a well-structured format of
if-else-elseif branches along with the set of facts. For any question asked, the machine picks up the key
data from the question and looks up the rules. Depending on the rules, it finds out the information from
the database and presents to the user. This way, in rule-based approach the intelligence of the machine
is only simulated as defined by the human (developer). This intelligence does not exceed beyond this.
Machine does not gain any learning experience. Its intelligence does not grow unless any new rules and
facts are added to it. Challenges with rule-based systems are:
Chances of addition of contradicting rule while adding new rules.
Upgrade and upkeep of such systems are time-consuming and expensive.
Not versatile to be used in multiple domains since all problems cannot be defined by a structured
set of rules.
Rule-based approach is widely used in industry in various
domains. Expert systems in various fields such as medical
and legal are the examples of rule-based approach of
machine learning. Automated manufacturing, gaming
and education fields also pursue rule-based approach of
modelling for AI.
Learning-based Approach
The limitations of rule-based approach are addressed by learning-based approach of AI modelling. In
this approach, the data is fed to the machine and it is supposed to analyse it to find possible patterns and
7
trends in the data for making any predictions. This is the reason that such systems are suitable to handle
abstract, unstructured and random data. Learning-based approach needs high level of data expertise and
modelling. This approach is complex in nature and needs to be dealt with great care.
Learning-based approach is suitable where step-by-step rule-based learning cannot be applied easily. It
is useful where useful predictions are based on a number of factors which are difficult or not possible
humanly. Predicting customer behaviour, monitoring financial transaction for frauds, medical diagnostics,
legal research and advise etc. are the application areas for learning-based approaches. Machine learning
and deep learning, both are learning-based approaches. Now, let us learn about some popular AI models.
AI Models
An AI model's prime goal is to learn a way to establish a meaningful relationship between the input data
and the predictable output.
Various data values in a data-set or multiple data-sets are mapped and analysed to for near accurate,
useful predictions.
There are various AI models that can be used in various scenarios. Some of the common AI models are
decision tree, regression, bagging, random forests, and neural networks, etc.
Decision Trees
Amongst all the AI models, decision tree is the simplest
yet efficient tool based on rule-based approach. It is
used to classify the data values or predict the outcome
on the basis of input values. Decision tree looks like a
classic binary tree that propagates by answering simple
questions in “Yes” and “No”. A decision tree is the inverted
figure of a real tree where roots are at the top (or, left) and leaves at the bottom (or, right).
Every component of a decision tree is called a node. A node can be categorised as - Root node, Decision
node and Leaf node. Each node is connected with the next nodes through a branch. Branches are called
either Yes branch or No branch. Nodes can be drawn using a rectangle and branches are represented by
straight lines or arrows.
Root node: A decision tree has exactly one root node that stays at the very top. It specifies the starting
condition on which the decision is need to be taken.
Decision node: There are always two or more decision nodes in a tree. Each node specifies a new condition
that emerges from the previous condition. Decision nodes are the outcome of either a decision taken on
the condition in positive (Yes) or negative (No).
Leaf node: There are always several leaf nodes or leaves which depict the end of the decision tree. That
is why leaves constitute the bottom of the decision tree. Leaves specify the final decision taken after
considering all the conditions.
Branches: A node is connected with next node either through a positive branch (Yes branch) or negative
(No branch). In certain cases, multiple braches can be used.
8
Decision Tree by Example
Let us understand decision tree by an example from our Library MS scenario. Consider the following data
set on which decision tree needs to be made. When the development team discussed the requirements
for the solution in detail with the library manager they found out the following:
Journals can be issued but they should not be recommended by the AI-algorithm.
All other types of books should be recommended.
The criteria to be classified as popular book for each type of book is different now (earlier it
was 95% or more for all). This is given as
below:
BOOK TYPE POPULARITY CRITERIA
Reference 98%
Textbook 95%
Fiction 90%
Journal 80%
Notice some sample data given here about two
types of book (Reference & Textbook). If you have to classify the book Java Bible as popular or not then
the decision tree for it is given here. This tree is created considering only two criteria in the context i.e.
Reference book & Textbook.
9
You can say that if Journals are not to be recommendation then why keep them in the decision tree? We
are including the journals here because their issue percentage is being calculated and they are the part of
the problem scope. But for such a small scenario, you can do away with journals.
Draw a decision tree to predict if a student should opt for Math stream or Arts. To opt for Math
stream, he should secure above 90% in Physics and Chemistry as well as distinction marks in Math
otherwise he/she must opt for Arts stream.
10
Let us first look for redundant records. It is easier to find redundant records in sorted data-set.
The sorted data-set looks like this with duplicate records in red. Remove them.
11
train an AI system to recognise image needs a rich collection of several different images of an object along
with many images which are “not” that object. The image properties like colour depths, vectors etc. are
used by AI algorithm to train itself into identifying the intended object.
ACTIVITY: PIXEL IT
Learning: What you did is that you divided the image into 5 strips and matched the set of strips to
identify the identical image. Computers match the images pixel by pixel and identify the matching
image.
Variation: The similar activity can be done by replacing images with pieces of paper on which you
can write a large alphabet or a word in your own handwriting. This is how computers match the
handwriting.
12
Evaluation and Deployment
Evaluation stage follows Modelling stage. At this time, the desired model is trained with the training
data set and ready for testing with the help of testing data set. Testing data set is a completely different,
separate and new entity for the model to be tested. The model has been trained with a separated relatively
larger training data set compared to testing one. In most of the cases the ratio of the size of trainaing and
testing data set is 3:2 but it may depend on case-to-case basis and the domain also.
Testing data set is prepared very carefully by the trained professionals after exploring and cleaning the
raw big data. First the results of testing are compared with those of training. If the results are satisfactory
then the testing results are compared with actual data in the domain. If that is also satisfactory then the
model is considered to be deployed.
In most of the cases, after passing the rigorous evaluation, the model is deployed in the real domain and
is monitored regularly for its performance over a period of time. After that it is signed off as a completely
reliable and independent model.
Scenario
The problem area for which a model has been developed is called scenario. Scenario is the reality in which
the real problem exists and the model has to be deployed. The model has to deal with the scenario the way
it has been trained. Scenario is the source of real data which is fed into the model for processing either at
regular intervals (hourly, daily, weekly etc.) or in real-time. A regular-interval scenario can be a less critical
problem area in which emergent threat is not there such as pollution monitoring in a region needs weekly
or monthly data, studying a cancer patient for research, monitoring the diet and fitness of sportspersons,
monitoring performance of students and their study habits etc. Real-time scenarios represent a non-stop,
critical, emergent, life threatening or disastrous situation such as online transactions, war situation, flood,
monitoring patient in critical care, air, road, sea and rail traffic etc.
Nature of the scenario describes the expectations of a model for its robustness and reliability. The parameters
which are used to monitor and study the scenario determine the real performance of the model.
There are two aspects which are considered here – prediction by the machine and reality of the scenario
during the prediction being made.
There are 2 possibilities when the prediction made by the machine and the reality of the scenario are
compared. These 2 possibilities are listed here:
The prediction matches the reality.
The prediction does not match the reality.
The predictions are done in two terms – Yes and No or True and False or Negative and Positive.
For example, if a Flood Forecasting System (FFC) has to predict for flood then the possibilities are Yes or
No. Another example is to predict whether the Air Quality Index of a region will be fatal next year or not.
13
Here also, answer would be Yes or No.
The predictions which are done in Yes are called Positive predictions. If this prediction matches with the
reality then it is called a True Positive prediction. For example, if the flood forecasting system predicts
that the flood will be there in a particular month in a region and that really happens then it is a True
Positive prediction.
What is a True Negative prediction then?
When the prediction is a No and in real it really isn't, or doesn't occur then it is called True Negative
prediction. For example, if the flood forecasting system predicts that the flood will not be there in a
particular month in a region and flood really does not happen then it is a True Negative prediction.
The reverses of the above two possibilities are called False Positive and False Negative predictions.
The prediction which is done in Yes and does not match with the reality then it is called a False Positive
prediction. For example, if the flood forecasting system predicts that the flood will be there in a particular
month in a region and that really does not happen then it is a False Positive prediction.
The prediction which is done in No and does not match with the reality then it is called a False Negative
prediction. For example, if the flood forecasting system predicts that the flood will not be there in a
particular month in a region and flood does happen then it is a False Negative prediction.
Confusion Matrix
The above permutations can be summarised neatly in a tabular structure called confusion matrix. It is
also known as error matrix. Confusion matrix is a tabular representation to visualise the performance
of an algorithm or model. Confusion matrix are mostly useful for supervised learning models.
Confusion Matrix shows all the 4 permutations. Let us summarise them once again.
True Positive: Prediction is yes and it is true. (E.g. flood predicted and it did occur)
True Negative: Prediction is no and it is true. (E.g. flood not predicted and it didn't occur)
False Positive: Prediction is yes and it is false. (E.g. flood predicted but it didn't occur)
False Negative: Prediction is no and it is false. (E.g. flood not predicted but it occurred)
In the confusion matrix table, reality parameters are kept at the top and prediction parameters on left hand side.
14
Model Deployment
The purpose of evaluation is to assess of if the model is fit to use as desired in real life problem-area.
Once the model is found to be perfectly fit to perform, it is deployed in the real problem-area for all the
beneficiaries use.
Forms of model deployment: Model is deployed in various forms depending on the need. Some
common forms are:
A fully-functional software application.
As part of a software library in the form of modules or packages.
A fully-functional web service.
An on-demand web service where users subscribe only for selective features.
A mobile application.
A combination of software and hardware such as driving simulator, AI-robotics solution, or self-
driven vehicle, etc.
Post deployment optimizations: Over a period of time, the performance of the deployed model is
monitored against certain performance parameters and ethical parameters. Its performance data is
gathered and analysed. Also, the feedback of the users is also collected if the model is easily accessible
and usable by all or not. On the basis of these analyses, further improvements are suggested and done
in the features and working of the model, making it more robust and efficient. This way, in long term,
the model is completely accepted into the system for regular use.
A problem scope is mutual understanding of all stakeholders about what is to be done to solve
that problem.
AI project stages include problem scoping, data acquisition, data exploration, modelling data,
evaluation.
Problem scoping gives a clear vision of the problem which is otherwise very abstract and
undefined.
15
Problem scoping covers 4Ws – Who (stakeholders), What (the problem), Where (Context
of the problem) and Why (Rationale of the solution).
Problem scoping includes identifying and defining the problem and goals to achieve.
Logical relationship between data values generates meaningful information.
Data used to train the machine is called training data.
Data used to evaluate the performance of the machine is called testing data.
Data quality refers to the accuracy and relevance of data.
A system map is a tool to show the relationship among various elements of a problem area
in a graphical form.
Data exploration is used to obtain basic understanding of data to determine its suitability
for training AI algorithm.
Structured data is usually found in symmetrical documents like databases.
Unstructured data does not have a predefined, fixed model like images, videos, etc.
Treatment of missing values is necessary to prevent the AI system being biased.
Feature engineering is the technique of extracting useful information from the data-sets.
One of the most common and efficient way to visualise data is graphs and charts.
Data visualisation includes comparing data, establishing relationships among data values
and finding distribution and composition of data.
ML is the sub-set of AI and Deep learning means learning from immense volume of data.
AI modelling has two approaches – rule-based and learning-based.
Rule-based approach involves a set of rules and facts.
Learning-based approach is useful where useful predictions are based on a number of
factors which are difficult or not possible humanly.
Decision tree is the simplest yet efficient means to model the data using rule-based
approach.
In evaluation stage, the AI model is tested for its accuracy and reliability.
If AI model passes the evaluation then, it is deployed in the problem area.
A model can be deployed as software, mobile app, web service, etc.
16
Abstract: Having no definite shape or definition, brief and unclear.
Scope: Working boundary of a problem area, premise of proposed solution.
Resource: Person, tools and infrastructure required to work on a project.
Budget: Projection of expected expenses and allocation of finances to different parts of an
organisation or project.
Context: Problem area that needs the solution.
Problem statement: A short, concise description of the problem.
Goal: A piece of problem-solving purpose to be achieved.
Data feature: Data type such as number, text, date, image, etc.
Training Data: Data used to train the machine.
Testing data: Data used to test the AI model after training.
Visualisation: To create a pictorial representation of data.
Raw data: Original form of data collected from various sources.
XML: eXtensible Markup Language to describe documents structure and data.
Parameter: A variable identifying a category of data such as Price, Colour, Weight, etc.
Abstract: Not defined distinctly, having no definite structure.
c. Plan d. Organisation
c. Plan d. Goals
c. Plan d. Activities
17
4. Existing businesses are suitable for AI implementation due to bulk or immense ____________.
a. Scope b. Data
c. Revenue d. Information
5. Data exploration is followed by ___________________.
a. Modelling data b. Data acquisition
c. Optional d. Random
7. Problem scoping gives a clear vision of otherwise ________________ problem.
a. Distinct b. Specific
c. Large d. Undefined
8. Problem statement includes ____________________________________________________.
a. The problem description b. Proposed solution
c. Primary d. Programming
13. Data acquisition gives us _____________ data while data exploration provides us ____________
data.
a. Raw, training b. Raw, testing
18
14. Data exploration is necessary because data available from data ________________ cannot be
used directly for machine learning.
a. Scoping b. Acquisition
c. Analysis d. Processing
15. A data-set containing names of 200 students and their marks in 5 subjects is an example of
_______________ data.
a. Structured b. Unstructured
c. Big d. Raw
16. A data-set containing 1000 samples of handwriting images is an example of ______________ data.
a. Structured b. Unstructured
c. Big d. Visual
17. In a data-set, the amount paid is not there in some records. These values are called ____________
values and can be _____________________.
a. Numeric, calculated b. Missing, omitted
c. Efficient d. Biased
19. Extracting hidden insight from the existing data is called ______________________.
a. Artificial engineering b. Data engineering
19
23. To view percentage contribution of parts in a whole, ____________ chart is suitable.
a. Pie b. Waterfall
c. Line d. Bar
24. Stacked column chart is suitable to view ___________________.
a. Distribution b. Composition
c. Pixel d. Logical
27. In rule-based approach, ______________ are defined as sets of if-else.
a. Rules b. Facts
c. Relationships d. Data
28. In rule-based approach, ______________ are defined by the help of data.
a. Rules b. Facts
c. Relationships d. References
29. In rule-based approach, the intelligence of the machine is defined by the _____________.
a. Developer b. Humans
20
B. Fill in the blank.
Why, Goals, What, Scope, Data, Solution, Project, Bottom, Top, Data exploration, Rule, Stakeholders,
Web scraping, Spreadsheet, Bias, Complex, ANNs, AI, Prediction, ML, Learning
21
6. Problem scoping can be done at any point of time while developing a solution.
7. Smaller projects do not need problem scoping.
8. Problem is identified in problem scoping.
9. Vision and outcome of a problem-solving exercise are different.
10. A problem statement needs to be in fine details.
11. Line chart is useful in showing the distribution of data values as part of a whole.
22
9. Give one consequence of missing value in the training data.
10. What do you mean by feature engineering?
11. List any 4 types of data visualisations and their use.
12. How are ML and DL related to AI?
13. Give two example of a rule-based learning system.
14. Give two examples of learning-based machine learning model.
15. Name the final stage of AI Project Cycle. What is the use of this stage?
COMPETENCY-BASED QUESTIONS
A. Consider the following scenario and answer the questions that follow:
An educational institute wants to enhance the learning experience of their students. For this
purpose, they conceived of a learning mobile app which students can use to ask questions regarding
any topic in their course they are pursuing. It should also be able to learn from the questions asked
by the students to relate the topics in the course in such a way that if student asks for one topic, it
displays the links to other related topics also and lists the questions that have been asked on that
topic in past years. This will help students quickly revise the topics and learn on the move anytime
and anywhere. This will also reduce load on the teachers in helping students revise the lessons.
1. Create the Problem Statement Template for the given scenario.
2. What major data feature you suggest should be most useful for the learning app?
23
B. Make the decision tree for the given scenario.
A furniture trader needs to predict whether sale of furniture would be high in a region of the country
considering the 4 parameters given in the data-set below. The sale is higher during festivals. Sale
goes down during rainy seasons due to the problems like transport, damages by water etc. Higher
number of dealers promises higher sale. Sales can be high or low depending on the reviews by the
customers of a region. Create a decision tree on this data-set. Also, try to find out if there are any
redundant records then eliminate them.
CUSTOMER
RAINY NUMBER OF
FESTIVAL REVIEWS IN THE SALE
SEASON DEALERS
REGION
24
CASE STUDY/SCENARIO-BASED QUESTIONS
Consider the following scenarios and answer the questions that follow them:
A. In this stage the data is prepared to train the AI model by fixing missing values, setting up the data
values in suitable formats, and giving a proper structure to the data. Then data is split into training data
and testing data. The data needs to be relevant according to the problem statement and goals set in the
initial phase of the project. The training and testing of the AI model continues until the model passes the
evaluation phase.
1. In which stage of AI Project Cycle the training data is prepared?
2. Which stage of AI Project Cycle is used to train the AI model?
3. Name the initial and final phase of AI Project Cycle.
4. Training data should have a few missing values. (T/F)
5. Which phase of AI Project Cycle has not been discussed in this scenario?
B. Raj is an AI project manager. For an eye hospital, he has to develop an AI-enabled model and a medical
expert system which scans the patient’s eye through a web cam and generates the diagnosis cataract
(blurry vision due to cloudiness of eye lens). Hospital is not able to cope up with the rising number of
patients and several hundred patient applications are in wait list. The AI model will examine the eye and
generate several parameters which will make the diagnosis report. This report will be sent to an expert
system which reads the data and images of the patient’s eye and suggest possible medication. Expert
system also notifies to the medical officer about the expert doctors on the given diagnosis who are free to
take patient’s case. Patient is also informed through email which doctor has been allocated to him/her. All
the data required to develop the solution is in various files on hospital’s web site and hospital database.
Hospital has data of 2,00,000 cataract patients of past 20 years. Data will be explored and cases of 60%
of the patients will be used to train the AI model. Rest of the data will be used to evaluate the model. This
will help the hospital to deal with the growing number of patients quickly and efficiently.
1. Write the problem statement of this scenario.
2. What value addition will the solution do to the hospital?
3. Using which technique Raj can pull out data from the hospital web site?
4. There is some structured data in the hospital in the form of _______________________.
5. Data of how many patients will be used to test the AI model?
6. If there are any missing values in the training data then what could be the consequences of deploying
the AI model in reality?
LAB ACTIVITY
1. Visit dzone.com/articles/twelve-types-of-artificial-intelligence-ai-problem and find out what kind
of problems AI is addressing presently.
2. Visit the following links and observe how AI works with various types of real-life data.
Document your observations to discuss in class.
experiments.withgoogle.com/objectifier-spatial-programming
experiments.withgoogle.com/ai/bird-sounds/view
25
experiments.withgoogle.com/ai/drum-machine/view
3. Visit https://datavizcatalogue.com/ and try making Bubble Chart, Bar Chart, Calendar, Line
Chart, TimeTable and Tree Diagram with your own datasets.
Take screenshots of the charts and paste them in a Word document.
4. Visit www.toptal.com/designers/data-visualization/data-visualization-tools and find out the
features of at least 10 distinct data visualisations.
Download Dia from dia-installer.de. Use this tool to make decision trees in this chapter.
Standard toolbar: It
provides quick tools like new,
open, save, print and export
diagram etc.
Diagram Canvas: It is the
largest area for drawing.
Toolbox: All the blocks
and connecters you need are
available in the Toolbox. It
also provides options for line
widths and arrow styles.
www.eduitspl.com
www.youtube.com/edusoftknowledgeverse
26