Chapter 3 Describe Features of Computer Vision Workloads On Azure - Exam Ref AI-900 Microsoft Azure AI Fundamentals
Chapter 3 Describe Features of Computer Vision Workloads On Azure - Exam Ref AI-900 Microsoft Azure AI Fundamentals
Computer vision is the processing of still images and video streams and
extracting information from those images. Computer vision can interpret
the image and provide detail and understanding about the image in a
computer-readable form. Computers can take this information and per-
form further processing and analysis. Many applications use computer
vision to enhance user experience or to capture information about ob-
jects and people.
Cognitive Services are available as a set of REST APIs that can easily be
deployed and consumed by applications. Essentially, Cognitive Services
are off-the-shelf services that help you develop an AI-based solution more
quickly and with less specialist expertise.
Cognitive Services are a family of AI services and APIs that you can use to
build intelligent solutions. Cognitive Services enable applications to see,
hear, speak, search, understand, and begin with decision-making.
Decision
Language
Speech
Vision
Web search
The group of services in the Decision group helps you make smarter
decisions:
The group of services in the Language group extract meaning from un-
structured text:
The group of services in the Speech group allows you to add speech
processing into your apps:
The group of services in the Vision group helps you extract informa-
tion from images and videos:
The group of services in the Web Search group allows you to utilize the
Bing search engine to search millions of webpages for images, news,
product, and company information. These services have been moved
from Cognitive Services to a separate service, Bing Web Search.
As you can see, Cognitive Services consist of a broad, and growing, set
of AI services. A common feature of these services is that they require no
training and can easily be consumed by applications with a REST API call.
We will now look at how you can deploy Cognitive Services in Azure.
Cognitive Services are easily deployed in Azure as resources. You can use
the Azure portal, the CLI, or PowerShell to create resources for Cognitive
Services. There are even ARM templates available to simplify deployment
further.
Figure 3-1 shows the service description for the Cognitive Services
multi-resource service.
After clicking on the Create button, the Create Cognitive Services pane
opens, as shown in Figure 3-2.
FIGURE 3-2 Creating a Cognitive Services resource
You will need to select the subscription, resource group, and region
where the resource is to be deployed. You will then need to create a
unique name for the service. This name will be the domain name for your
endpoint and so must be unique worldwide. You should then select your
pricing tier. There is only one pricing tier for the multi-service resource,
Standard S0.
Clicking on Review + create will validate the options. You then click on
Create to create the resource. The resource will be deployed in a few
seconds.
You can create a Cognitive Services resource using the CLI as follows:
You can create a Computer Vision resource using the CLI as follows:
Once your resource has been created, you will need to obtain the REST
API URL and the key to access the resource.
Once created, each resource will have a unique endpoint for the REST API
and authentication keys. You will need these details to use Cognitive
Services from your app.
To view the endpoint and keys in the Azure portal, navigate to the re-
source and click on Keys and Endpoint, as shown in Figure 3-3.
FIGURE 3-3 Keys and Endpoint
EXAM TIP
For more information on using Cognitive Services with containers AI, see
https://docs.microsoft.com/azure/cognitive-services/cognitive-services-
container-support.
Understand computer vision
Computer vision is the interaction with the world through visual percep-
tion. Computer vision processes still images and video streams to inter-
pret the images, providing details and understanding about the images.
Computer vision makes it easy for developers to process and label vis-
ual content in their apps. The Computer Vision service API can describe
objects in images, detect the existence of people, and generate human-
readable descriptions and tags, enabling developers to categorize and
process visual content.
Some other key features of computer vision include the ability to:
Categorize images
Determine the image width and height
Detect common objects including people
Analyze faces
Detect adult content
Describe an image
Categorize an image
Tag an image
Object detection will also return the coordinates for a box surrounding
a tagged visual feature. Object detection is like image classification, but
object detection also returns the location of each tagged object in an
image.
Figure 3-5 shows an example of object detection. Three cats have been
identified as objects and their coordinates indicated by the boxes drawn
on the image.
OCR can:
Using OCR, you can extract details from invoices that have been sent
electronically or scanned from paper. These details can then be validated
against the expected details in your finance system.
220-240V ~AC
hp
LaserJet Pro M102w
Europe - Multilingual localization
Serial No.
VNF 4C29992
Product No.
G3Q35A
Option B19
Regulatory Model Number
SHNGC-1500-01
Made in Vietnam
Facial detection can provide a series of attributes about a face it has de-
tected, including whether the person is wearing eyeglasses or has a
beard. Facial detection can also estimate the type of eye covering, includ-
ing sunglasses and swimming goggles.
Detect faces
Analyze facial features
Recognize faces
Identify famous people
The facial detection identified the face, drew a box around the face,
and supplied details such as wearing glasses, neutral emotion, not smil-
ing, and other facial characteristics.
Now that you have learned about the concepts of computer vision, let’s
look at the specific Computer Vision services provided by Azure Cognitive
Services.
EXAM TIP
You will need to be able to distinguish between the Computer Vision,
Custom Vision, and Face services.
Analyze image
The analyze operation extracts visual features from the image content.
The image can either be uploaded or, more commonly, you specify a
URL to where the image is stored.
You specify the features that you want to extract. If you do not specify
any features, the image categories are returned.
The URL for the image is contained in the body of the request.
The visual features that you can request are the following:
The Computer Vision service only supports file sizes less than 4MB.
Images must be greater than 50x50 pixels and be in either of the JPEG,
PNG, GIF, or BMP formats.
Below is the JSON returned for the image of the three cats used earlier
in this chapter for these categories: adult, color, and imageType features.
"categories": [{
Describe image
https://{endpoint}/vision/v3.1/describe[?maxCandidates][&language]
Following is the JSON returned for the image of the three cats used ear-
lier in this chapter:
"description": {
Detect objects
https://{endpoint}/vision/v3.1/detect
Following is the JSON returned for the image of the three cats used ear-
lier in this chapter:
"objects": [{
The detect operation identified three cats with a high level of confi-
dence and provided the coordinates for each cat.
Content tags
The tag operation generates a list of tags, based on the image and the ob-
jects in the image. Tags are based on objects, people, and animals in the
image, along with the placing of the scene (setting) in the image.
Following is the JSON returned for the image of the three cats used ear-
lier in this chapter:
"tags": [{
Domain-specific content
There are two models in Computer Vision that have been trained on spe-
cific sets of images:
https://{endpoint}/vision/v3.1/models/{model}/analyze[?language]
The analyze operation can also detect commercial brands from images
using a database of thousands of company and product logos.
Thumbnail generation
https://{endpoint}/vision/v3.1/generateThumbnail[?width][&height][&smartCropping
OCR is the extraction of printed or handwritten text from images. You can
extract text from images and documents.
Read The latest text recognition model that can be used with images
and PDF documents. Read works asynchronously and must be used
with the Get Read Results operation.
OCR An older text recognition model that supports only images and
can only be used synchronously.
https://{endpoint}/vision/v3.1/read/analyze[?language]
https://{endpoint}/vision/v3.1/ocr[?language][&detectOrientation]
The JSON returned includes the pieces of text from the image, as
shown next:
"regions": [{
OCR only extracts the text it identifies. It does not provide any context
to the text it extracts. The results are simply pieces of text.
Content moderation
The analyze operation can identify images that are risky or inappropri-
ate. The Content Moderator service, although not part of Computer Vision
(it is in the Decision group of APIs), is closely related to it.
You can then deploy your model with an endpoint and key and con-
sume this model in your apps in a similar way to the Computer Vision
service.
The following steps take you through creating a custom object detection
model to identify fruit from images.
We will use the fruits dataset that you can download from
https://aka.ms/fruit-objects. Extract the image files. There are 33 images,
as shown in Figure 3-10.
You will need to use 30 of the images to train your model, so keep three
images for testing your model after you have trained it.
First, you need to create a Custom Vision service. Figure 3-11 shows the
pane in the Azure portal for creating a Custom Vision service.
FIGURE 3-11 Creating a Cognitive Services resource
Clicking on Review +Create will validate the options. You then click on
Create to create the resource. If you selected Both, two resources will be
deployed with the Training resource using the name you provided and
the name of the Prediction resource with “-Prediction” appended.
You can create Custom Vision resources using the CLI as follows:
You will need to create a new project. You will need to name your
project and select your Custom Vision training resource (or you can use a
multi-service Cognitive Service resource).
Next, you should select Object Detection as the Project Type and
General for the Domain, as shown in Figure 3-12.
FIGURE 3-12 New Custom Vision project
The domain is used to train the model. You should select the most rele-
vant domain that matches your scenario. You should use the General do-
main if none of the domains are applicable.
General
Food
Landmarks
Retail
General (compact)
Food (compact)
Landmarks (compact)
Retail (compact)
General [A1]
General (compact) [S1]
General
Logo
Products on Shelves
General (compact)
General (compact) [S1]
General [A1]
Compact domains are lightweight models that are designed to run lo-
cally—for example, on mobile platforms.
Once the project is created, you should create your tags. In this exer-
cise, you will create three tags:
Apple
Banana
Orange
Next, you should upload your training images. Figure 3-13 shows the
Custom Vision project with the images uploaded and untagged.
You now need to click on each image. Custom Vision will attempt to
identify objects and highlight the object with a box. You can adjust and re-
size the box and then tag the objects in the image, as shown in Figure 3-
14.
FIGURE 3-14 Tagging objects
You will repeat tagging the objects for all the training images.
You will need at least 10 images for each tag, but for better perfor-
mance, you should have a minimum of 30 images. To train your model,
you should have a variety of images with different lighting, orientation,
sizes, and backgrounds.
Select the Tagged button in the left-hand pane to see your tagged
images.
You are now ready to train your model. Click on the Train button at the
top of the project window. There are two choices:
You can use the Quick Test option to check your model. You should up-
load one of the three images you put aside. The image will be automati-
cally processed, as shown in Figure 3-16.
The model has identified both the apple and the banana and drawn
boxes around the pieces of fruit. The objects are tagged, and the results
have high confidence scores of 95.2% and 73.7%.
To publish your model, click on the Publish button at the top of the
Performance tab shown in Figure 3-16. You will need to name your model
and select a Custom Vision Prediction resource.
NOTE PUBLISHED ENDPOINT
You cannot use a multi-service Cognitive Services resource for the pub-
lished endpoint.
Publishing will generate an endpoint URL and key so that that your ap-
plications can use your custom model.
Object detection
Image classification
Content moderation
Optical character recognition (OCR)
Facial recognition
Landmark recognition
Custom Vision uses images and tags that you supply to train a custom
image recognition model. Custom Vision only has two of the capabilities:
Object detection
Image classification
Facial recognition has many use cases, such as security, retail, aiding
visually challenged people, disease diagnosis, school attendance, and
safety.
Gender
Age
Emotions
Similarity matching
Identity verification
The Face service can be deployed in the Azure portal by searching for
Face when creating a new resource. You must select your region, resource
group, provide a unique name, and select the pricing tier: Free F0 or
Standard S0.
Detection
The Face service detects the human faces in an image and returns their
boxed coordinates. Face detection extracts face-related attributes, such as
head pose, emotion, hair, and glasses.
https://{endpoint}/face/v1.0/detect[?returnFaceId][&returnFaceLandmarks]
[&returnFaceAttributes][&recognitionModel][&detectionModel]
The detection model returns a FaceId for each face it detects. This Id
can then be used by the face recognition operations described in the next
section.
The JSON returned using the detect operation on the image of the au-
thor in Figure 3-7 is shown next:
{ "faceId": "aa2c934e-c0f9-42cd-8024-33ee14ae05af",
"smile": 0.011,
"gender": "male",
"age": 53.0,
"glasses": "ReadingGlasses",
As you can see, the attributes are mainly correct except for the hair
color. This is expected as the image in Figure 3-7 was a professionally
taken photograph with good exposure and a neutral expression.
Recognition
The Face service can recognize known faces. Recognition can compare
two different faces to determine if they are similar (Similarity matching)
or belong to the same person (Identity verification).
EXAM TIP
Ensure that you can determine the scenario for each of the four facial
recognition operations.
Computer Vision
Face
Video Analyzer for Media
EXAM TIP
Computer Vision can detect faces in images but can only provide basic
information about the person from the image of the face, such as the esti-
mated age and gender.
The Face service can detect faces in images and can also provide infor-
mation about the characteristics of the face. The Face service can also
perform the following:
Facial analysis
Face identification
Pose detection
The Video Analyzer for Media service can detect faces in video images
but can also perform face identification.
The Face API can detect the angle a head is posed at. Computer Vision
can detect faces but is not able to supply the angle of the head.
Video Analyzer for Media can detect faces but does not return the at-
tributes the Face API can return.
The Face API service is concerned with the details of faces. The Video
Analyzer for Media service can detect and identify people and brands
but not landmarks.
Custom Vision allows you to specify the labels for an image. The other
services cannot.
Computer Vision can identify landmarks in an image. The other ser-
vices cannot.
NOTE FORM RECOGNIZER
Form Recognizer can extract text, key-value pairs, and tabular data as
structured data that can be understood by your application.
Business cards
Invoices
Receipts
You can create Form Recognizer resources using the CLI as follows:
Receipt Type: Itemized
Merchant: Contoso
Phone number: +19876543210
Date: 2019-06-10
Time: 13:59:00
Subtotal: 1098.99
Tax: 104.4
Total: 1203.39
Line items:
Item Quantity: 1
Total Price: 999.00
Item Quantity: 1
Total Price: 99.99
There are three services that perform an element of text extraction from
images:
OCR
Read
Form Recognizer
The older OCR operation can only process image files. OCR can only ex-
tract simple text strings. OCR can interpret both printed and handwritten
text.
The Read operation can process images as well as multi-page PDF doc-
uments. Read can interpret both printed and handwritten text.
The Form Recognizer service can extract structured text from images
and multi-page PDF documents. Form Recognizer will recognize form
fields, and is not just text extraction.
Chapter summary
In this chapter, you learned some of the general concepts related to com-
puter vision. You learned about the types of computer vision, and you
learned about the services in Azure Cognitive Services related to com-
puter vision. Here are the key concepts from this chapter:
Thought experiment
Let’s apply what you have learned in this chapter. In this thought experi-
ment, demonstrate your skills and knowledge of the topics covered in this
chapter. You can find the answers in the section that follows.
The app requests that customers take a photo of the driver after an in-
cident. The app also requests that customers take several pictures of the
scene of an incident, showing any other vehicles involved and the street.
Customers are able upload dashcam videos as evidence for their claims.
Customers can also upload scanned images of their claim forms that also
contain a diagram explaining the incident.
Insurance adjustors have a mobile app where they can assess and doc-
ument vehicle damage. Fabrikam wants the app to assess the cost of re-
pairs based on photographs and other information about the vehicle.
This section contains the solutions to the thought experiment. Each an-
swer explains why the answer choice is correct.