Multimedia Information Retrieval

© Stephane.Marchand-Maillet@unige.ch – University of Geneva – Multimedia Retrieval – September 2015 - 1
Multimedia Information Retrieval
Stephane Marchand-Maillet
Viper group
University of Geneva
Switzerland
http://viper.unige.ch

• What is your background?
– Computer Science
– Scientific
– Humanities
• This lecture introduces intuition from
examples
– More technical details in recommended reading
Quick get-to-know

• To introduce (recall, rehearse) some key
principles and techniques of IR
• To show how these are adapted to Multimedia
• To look at the Multimedia context
• To see how to best face it
Objectives of the lecture

Outline
• Motivation and context
• Preliminary material on IR
• Audio-visual IR
• Complements
• Evaluation
4
Note: Several illustrations from within these slides have been borrowed from the Web, including Wikipedia or teaching
material. Please do not reproduce without permission from the respective authors. When in doubt, don't.

Data Production
• Growth of Data
– 1,250B GB (=1.2EB) of data generated in
2010.
– Data generation growing at a rate of 58%
per year
• Baraniuk, R., “More is Less: Signal Processing and the
Data Deluge”, Science, V331, 2011.
1 exabyte (EB) = 1,073,741,824 gigabytes
0
2000
4000
6000
8000
10000
2010 2011 2012 2013 2014
DataSize(EB)
Data Generation Growth
http://www.intel.com/conte
nt/www/us/en/communicati
ons/internet-minute-
infographic.html
http://www.ritholtz.com/blog
/2011/12/financial-industry-
interconnectedness/
Internet
Scientific
Industry
Data
By Sverre Jarp,
By Felix'Schürmann
© Copyright attached

A digital world
[From http://1000memories.com/blog/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox]

[Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html]
Data communication

User “productivity”
[Picture from: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/online-time.jpg - Feb 2012]

Motivation
• Decision making requires informed
choices
• Data production ever easier
• The information is often
overwhelming
– « Big Data » trend
• The information is often not easy to
manage and access
We need to bring a structure to the raw data
• Document (data) representation
• Similarity measurements
• Further analysis: data mining, information retrieval, learning

Components of Data Quality
• Validity
• Integrity
• Completeness
• Consistency
• Relevance
• Accuracy
and…
• Timeliness
• Interpretability
• Reliability
• …
 Applies on all media

• Searching for multimedia
– Is it like searching text?
• Information retrieval model
• Representing documents
– BoW and TF*IDF
• Indexing model
– Inverted files
– hashing
Preliminary

Searching for multimedia
Basic solution:
• Attach text to multimedia and search for text
– Tags, keywords
– Description
“Multimedia Annotation”
– Domain of “Knowledge Management”
– Requires the definition of Ontologies, Taxonomies,…
Human in the loop
– Not scalable, not feasible

[Picture from: http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html]
Per minute rate (some years ago!)

• From the sensing device
– Device type, model
– Sensing conditions (flash/no flash)
– Author (device owner?)
– …
• From the environment
– Time, date, location
– Environment conditions (inside/outside, weather)
– Spoken languages
– …
• More? From the content?
– Later in this lecture
Note: Can never be exhaustive
Auto-annotation

• “Wasp on a yellow flower”
• Specific (latin) name of the flower? Of the insect?
• What is the insect exactly doing?
• Weather looks nice: “countryside scene”?
• The picture may be arty:
“2015 1st price college photo competition winner”
Inference cannot resolve all
• “The wasp who scared my daughter during our holiday in
Gruyere”
Annotation

• Exif: standard picture annotation container
• ID3-tag: standard audio information
• NTFS: window-based metadata exchange
• JPSearch, MPEG 7: Search-oriented description
standards
• Every format will have a text header to insert
“comments” (PNG, GIF, JPG, TIFF,…)
 Complementarity text-based / content-based
• Other category is recommender-based
– “social” annotation
Annotating mulitmedia

MPEG-7 Still Image Description

IBM MPEG-7 Annotation Tool (video/visual stream)
[http://www.research.ibm.com/VideoAnnEx/]

• Given a query (information need)
• Given a repository (information source)
 Infer the most relevant documents from the repository as an
answer to the query
The most used query type is Query-by-Example
“I want something like this”
• Descriptive query
• Free text (Google-like) query
• Example document-s (“ ”)
• Relevance is similarity
• Similarity is transferred to “closeness”
More like this…
Information Retrieval paradigm

• IR tells us that we do not (really) need to have a
complete absolute understanding of documents
to respond queries, we (just) need to be able to
compare 2 documents
• We merely need appropriate distance
measurements
– To infer similarity
– Based on document features (space)
– and a notion of vicinity (distance)
 Everything is about inspecting neighborhoods
– Provided the distance is semantically relevant
Information Retrieval

Information management process
Raw documents
Representation space
(visualisation)
Document features
User interaction
Feature extraction
“Appropriate”
mapping
“Decision”
process
Query

Distance-based search
Representation space
(visualisation)
Query
Relevance list
(sorted by distance
to the query)

Example: text
Text documents
Feature extraction
“Appropriate”
mapping
User interaction
“Decision”
process
“Word” occurrences
Query

Also...
• Any type of media: webpage, audio, video,
data,...
• Objects, based on their characteristics
• People in social networks
• Concepts: processes, states, ... Etc
Anything for which “characteristics” may be
measured
This is what this lecture is about

• Raw data (the documents) carries information
+ Computers essentially perform additions
We need to represent the data somehow to
provide the computer with as much as possible
faithful information
• The representation is an opportunity for us to
transfer some prior (domain) knowledge as
design assumptions
If this (data modelling) step is flawed, the computer
will work with random information
Representation spaces (intuition)

From documents to features
• Features help characterizing 1 document (summary)
• Features help comparing 2 documents (similarity)
• Features help structuring a collection (visualization)

Comparing two documents (occurrences)
2 1 1 2 1 0 0
1 2 0 1 1 1 2
1 2 0 1 1 1 2
1 1 0 1 1 0 0
“Terms” (Vocabulary)
“Documents”(Repository)

Comparing two documents (presence)
Y Y Y Y Y N N
Y Y N N Y Y Y
Y Y N N Y Y Y
Y Y N Y Y N N
Similarity: Y+Y=1 Distance: Y+N=1

Comparing two documents (presence)
Y Y Y Y Y N N
Y Y N Y Y Y Y
1 1 0 1 1 0 0
Similarity = 4
(Distance=3)

Comparing two documents
6 (0)
3 (4) 3 (4)
4 (1)
3 (2) 3 (2)
S(D)

“ “
Comparing two documents
S(D)
D1 Y Y Y Y Y N N 1(6)
D2 Y Y N N Y Y Y 2(4)
D3 Y Y N N Y Y Y 2(4)
D4 Y Y N Y Y N N 0 (7)
Q N N Y N N Y Y
Nota: The query is seen as a document (Query-by-example)

Query-based ranking
“ “
• Since similarity is computed only on (Y,Y) pairs,
only terms count for computing the similarity
Every document having no common term
with the query scores 0
(ie terms not in the query are ignored)
2(4) 2(4) 1(6) 0(7)
D4D3 D1D2Q

Computing the ranks
Naïve solution (fill the array):
• For every document, for every term, score 1 if
the term is both in the query and the
document
 N (documents) * M (terms) comparisons
Smarter solution
• For every term of the query (few), find
documents having this term and work from
there (the rest will score 0)

Information indexing
Goal:
to organize the data so as to
facilitate its access (read or write)
• Fast access for computation
• Fast access with selection criteria
• …
Baseline: exhaustive search, sequential scan (unordered list)
Strategy: enable “direct access”
Simple example:
• Address book: sorted list, with first letter shortcut
– Smith  “S”  search 19th sublist (ignore 25 other sublists)

Indexing structures
…+
M-tree
Tries
Suffix array
Suffix Tree
Inverted files
LSH…
Illustration: Wikipedia

Term Documents
D1,D2,D3,D4
D1,D2,D3,D4
D1
D1,D4
D1,D2,D3,D4
D2,D3
D2,D3
“find documents having this term”
Arrange documents by the terms they contain
 Number of comparisons ~ number of query terms
<< N*M
Inverted file
(D1),(D2,D3),(D2,D3)
D1, D2, D2, D3, D3
D4 is never considered !
(we know already that its score is 0)

Principles and lessons
• We need to be able to compare 2 documents
• Comparison is based on document features
• Similarity may be based on feature occurrence
• Only occurring features count in building
similarity
 Feature occurrence allows for the use of
inverted files to get independent from the size of
the repository
Feature occurrence loses a lot of structure in the
document

In practice (text IR)
• Document features are normalised word
(stem) occurrences (Bag of Words model)
– Vocabulary ~ 10’000 stems
– Billions of documents
• Term weighting scheme (TF*IDF) accounts for
feature statistics (better than Y/N occurrence)
• Similarity is based on cosine distance
• Inverted files include term statistics

BoW : Assumptions
• A text is reduced to the base vocabulary it uses
• Example:
• Becomes:
The ultimate goal is to define similarity as:
Two documents are similar if
they share the same vocabulary
Signals from a particle collider near Geneva suggested in
September that scientists might have sighted a long-sought
particle thought to be the source of mass itself.
A (x2), be, collide, from, geneva, have, in, itself, long-sought,
mass, may, near, of,
particle (x2), scientist, see, september, signal, source,
suggest, that, the, think, to

BoW: Limitations
• The structure is lost
– Title, sections, paragraphs, sentences, negations
• The representation is incomplete…
– If all words appear once only and there is no extra information such as
• Grammatical type
• Priors on terms (recent news, unused terms,…)
– And writing well is often about using a rich vocabulary
• Use of synonyms
• …or ambiguous
– If a word appears many times but because of polysemy
– Eg: mouse
• A part on mouse as a pet
• A part on computer mouse
• A part on Mickey mouse
– Does not make the text about “mouse” but maybe just about “kid
amusements”
… still works very well in practice

A document is represented by the occurrences of the
terms of the vocabulary
• Earlier example: Y/N occurrence
 Number of occurrence does not matter
• Counting the term occurrence
Favors longer documents (more occurrences)
Document length normalization (frequency)
• Not all frequent terms are relevant
Balance between representation (TF) and discriminative
power (IDF)
How good in my term in my query for this document?
A bit more details

• Term frequency (TF)
– Percentage of space taken by each term in the document:
“how much this term represents the document”
High TF  frequent term in the document  potential
representative term (keyword)
• Inverse document frequency (IDF)
– Inverse of the percentage of space taken by each term in
the collection:
“how much this term discriminates documents”
High DF  low IDF  term frequent in all documents  not
discriminant (all documents are relevant)
Eg: the, at, and, she, he, her… + thematic (patient, disease,…)
 Good terms are terms with high TF and high IDF
TF*IDF model: intuition

High TF
(frequent in document)
Potential keyword
Low TF
(infrequent in document)
High IDF
(infrequent in collection)
“Keyword”
Good representative
Verb, noun,..
Low representation
(automatically ignored)
Low IDF
(frequent in collection)
“Stop word”
Does not carry meaning
“structural”
The, at, and, she,…
(removed from voc / ignored)
Rare term
(automatically ignored)
TF*IDF term weight
When computing the difference
between documents, terms are
weighted by their TF*IDF factor

Inverted file construction
Docs
extraction sorting compaction
Term Doc ID
“blah” N
“new” 1
“the” 1
“tool” N-1
… ...
“tool” 1
“blah” N
“blah” 1
Term Doc ID
“the” 1
“new” 1
“tool” 1
“blah” N
… ...
“tool” N-1
“blah” N
“blah” 1
Term Doc ID
“blah” 1
“new” 1
“the” 1
“tool” N-1
… ...
“tool” 1
“blah” N
Freq
1
1
1
1
...
1
2
Term Tot. F.
“blah” 3
“the” 1
… ...
“tool” 2
“new” 1
...
Doc ID
1
1
1
N-1
1
N
Freq
1
1
1
1
...
1
2
factorisation
Misc
operations
Eg: posting file compression
Dictionary
Collection information
« idf »
Postings
Document information
« tf »

Value Key
1
2
…
20
…
33
…
Hashing / fingerprinting (intuition only)
=1
=2
=3
=4
=5
=6
=7
=1+1+2+3+4+4+5 = 20
=1+2+2+4+5+6+7+7 = 33
Hash Table
Query
 Immediate access
=1+2+2+4+5+6+7+7 = 33
• Associate a key (hash, fingerprint) to each value
• Insert the (key, value) into a hash table
Immediate retrieval of the query
Difficult constraint (Perceptual hashing) : The key does
not vary if the value changes a bit (eg due to noise)
Hash function

What is multimedia?
Basically our digital life
• Text
– Plain text (any language)
– Structured text (XML-like, code,…)
• Visual
– Images (Photo)
– Sketches (drawing, map,..)
• Audio
– Music
– Speech
– Sound
• Misc
– 3D object
– Video
– … (software, playlist,…)
+ Any combination
– PDF, Web pages and alike

Multimedia IR: Use Cases
• Image
– Find look-alike pictures, recognize landmark
• Video
– Find specific actions
• Music
– Find inspirational music, recognize music extract
• Medical
– Find similar cases
• Patent
– Find similar proposals, plagiarism
 Not always doable with text
47

Volume of multimedia data
• 2 billion persons connected to internet
– Mostly via mobile devices (phones)
• 1.8 billions active mobile phones
• 180 millions active web servers
Sources:
R. Baeze-Yates – RussIR 2010
http://news.netcraft.com/

Volume of multimedia data
• Web scale:
– Google:
• Early 2004: 4.3 billions indexed pages
• Early 2005: 8 billions indexed pages
• Today: XX billion indexed pages?
– http://www.archive.org
• Growth: 20 Tb/month
• Multimedia collection:
– Facebook: 140 billions photos
180 years at 25 fps
– YouTube (2008): 83.4 million videos
• Curated collections:
– AOL Video library
• 410 millions views in October 2011
– Institut National de l'Audiovisuel (INA-France):
• 700’000h audio (radio)
• 400’000h video (television)
• 2 millions documents

The new wave
[From http://1000memories.com/blog/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox]

Example: Information « loss »
• Annotation
– " Wasp on a yellow flower "
• Text search:
– « insect feeding
This document will be missed
• The user must know what is the annotation
limitation
• The goal is to represent the document
– Exhaustively: no specific interpretation context
– Automatically: no human intervention

• Image formation model
• Image features
• Indexing visual content
Visual Information Retrieval

Visual content has specific types
• Photo
• Medical image, satellite image
• Document: fax, scanned text…
• Drawing: sketch, blueprint, cartoon,…
• 3D, shape
Visual content

Example: Images
Images
Feature extraction
“Appropriate”
mapping
“Decision”
process
Search
Photo collage
Filtering
Color histogram
User interaction
Query

Pixels
RGB Sensor
RGB Color model
Any color = combination of RGB values
Image = 3 Arrays (RGB) of values between 0 and 255
Pixel = RGB values (color) at a position in the image

Image processing / Analysis
RGB Histograms
Gray scale (intensity) Edge image

Low-level image representation
Global Color Histogram data in color spaces:
Global Texture data from Gabor filter banks:

Corner detection / keypoints

Local features: SIFT[Lowe 1999] and SURF[Bay et al 2006]

From data to information
Interpreting the content
• Data is physically measured
• Information is interpreted semantically
Semantic gap
• Fusion is a way to reduce the semantic gap
• Examples
– Looking at a movie w/o audio
– Listening to a story w/o visual
– Voice conversation vs video conversation
– Disambiguation (eg „jaguar“)

Semantic gap:
“Discrepancy between the level of analysis of a computer and the level of
perception of the same data by a human user.”
Level Text Visual
Interpretation Story “Action”
Semantic
segmentation
Paragraph Scene
Group Sentence Object
Physical
segmentation
Word Homogeneous
region
Physical
inspection
Character Color, texture
Levelofinterpretation
Relevanceforindexing
Computerprecision
Semantic Gap

Semantic gap
• Exploited by Captchas

Region-based image understanding
Images are compared w.r.t the regions
(hopefully objects/concepts) they contain
Object detection eg, Face detection

Object detection / recognition
• Train the machine to recognize specific groups
of patterns
Faces
[Viola, Jones, 2001]
Objects
 Automated image tagging  Text search

Naming faces : Image to text
Specific characteristics of a face are extracted
(eg eigenfaces)

Deep Learning
[http://www.slideshare.net/NVIDIA/visual-computing-the-road-ahead-an-nvidia-ces-2015-presentation-deck]

Bag of Visual Words
• Once basic visual features are extracted,
– Eg: the description of the surrounding of each corner point
(eg SIFT)
• Basic features are grouped
– Eg corner of an object
• And “quantified”
– Many examples are gathered into one instance
• To create a “visual dictionary” of visual words
• An image is a composition (occurrences) of these visual
words (“terms”)
 We can use of the machinery related to text
- Inverted files, etc

Comparing two documents (occurrences)
2 1 1 2 1 0 0
1 2 0 1 1 1 2
1 2 0 1 1 1 2
1 1 0 1 1 0 0
“Terms” (Vocabulary)
“Documents”(Repository)

Hyperbrowser [S. Craver, 1999]

Digital shoebox

Collection mining

• Modeling sound
• Audio information
– Speech, music, sound
• Searching for audio
Audio IR

Example: Audio
Audio
Feature extraction
“Appropriate”
mapping
“Decision”
process
Playlist
Filtering
User interaction
Query

Sound characteristics
Sound corresponds to waves perturbing the environment
Wave characteristics
– Frequency/Period
1 Hertz (Hz) = 1 vibration/second
Related notion of pitch
– Intensity
• Human ear has logarithmic scale
Decibel (dB) scale
0dB = 10-12 W/m2 (Threshold of hearing)
10dB = 10-11 W/m2
…
x*10dB= 10-(12-x) W/m2
2
m
Watt
Area
Power
Time
Energy
Area
1
I

Sound perception
Airwave collection
Mapping to
mechanical waves
Mapping to
compressional waves
Mapping to
electrical signal
0dB = 10-12 W/m2 : Threshold of hearing
160 dB = 10-4 W/m2 : Instant perforation of eardrum

200k100k
Sound perception
0
Frequency f (Hz)
Human (20-20k)
Dog (50-45k)
Cat (50-85k)
50 100 500 1k 5k 10k 50k
Bats (50-85k)
Dolphins (-200k)
Bats (-120k)
Elephant (5-)
Infrasound Audible sound Ultrasound

Audio and Sound
50 100
80
20
500 1k 5k 10k 20k
40
60
0
120
100
Non-audible
Painful
Audible sounds
Speech
Frequency f (Hz)
Intensity I (dB)
Partly adapted from [Schäuble,1997]
Quiet library,
soft whisper
Quiet office,
living room,
Light traffic at a distance,
gentle breeze
Conversation,
Sewing machine
Busy traffic,
noisy restaurant
Subway, heavy city traffic,
alarm clock at 2 feet,
factory noise
Truck traffic, noisy home
appliances, shop tools,
lawnmower
Chain saw,
pneumatic drill
Rock concert in front of
speakers, thunderclap
(180) Rocket launching pad
(140) Gunshot blast, jet plane

Audio information
• Speech
– Male, female speech
Signal-based speech characterisation
– Automatic speech recognition (ASR)
ASR (text) retrieval
• Music
Genre recognition (Jazz, classic,…)
Excerpt retrieval
Query by humming
…
• Sound
– Car, water, people, animals,…
– Often useful in relation to visual (video)

Audio information
high
low
natural man-made
Informationcontent
Origin
Animal
sounds
Wind
Water
Speech Music
Urban noise Machines,
engines

Audio features
• Low-level-audio representation
– Spectrogram
– MFCC
• Mid-level representations
– MIDI strings
– LPC
• High-level representation
– Automatic speech recognition (ASR)

Translation
Signal-
based
Audio information indexing
Audio
SoundMusicSpeech
ASR
Wordspotting
Translation
Piece/Genre recognition Classification
Visual
A/V Processing
Transcripts,
Speaker info
Score,
MIDI

Speech
Speech production
model
Speech perception
model
Note: In vision, mostly only
perception models
Speech

Speech production
Human apparatus
Speech production scenarios
1. Voiced sound (vocal tract)
2. Fricative sound (‘S’)
3. Plosive sound (‘P’)

Speech production modeling
Mechanical model
Human apparatus
Multistage process
1. Build up
pressure
2. Release
3. Return to first
state

Speech signal modeling
Given w(t) as a speech signal
where S(w) is the STFT of w.
The model assume H(w) as the
vocal tract transfert function
and E(w) is the excitation
222
)()()( wEwHwS 

Note: MP3 coding
Advanced coding model
MP3 = MPEG 1 Audio layer 3
See previous
slides

Sony music browser

Islands of music [http://www.oefai.at/~elias/music]
Organisation d'archives musicales
• Basée sur des cartes auto-organisées
de Kohonen (Self organising maps –
SOM)

ThemeFinder [http://www.themefinder.org/]

• Design a nice perceptual hash function for
audio (based on perceptual features)
• Slice an audio document (music piece) into
small chunks (large enough to be “unique”)
• For each chunk compute a key through the
hash function and store as value the music
piece ID (title)
Music query by example
Retrieval by noisy chunk
Audio fingerprinting

• Video type
– Movies
– Amateur souvenir film
– Lectures
– Youtube (short, static)
• Combining individual media (fusion)
– each brings some information…
• Going further: social multimedia…
Multimedia IR

Interpreting the content
• Data is physically measured
• Information is interpreted semantically
Semantic gap
• Fusion is a way to reduce the semantic gap
• Examples
– Looking at a movie w/o audio
– Listening to a story w/o visual
– Voice conversation vs video conversation
– Disambiguation (eg „jaguar“)

Multimodal video features
Over temporal modalities:
• Segmentation (boundary) cues
• Categorization (region) cues
Word relatedness
measure
Visual syntactic
homogeneity
Multimodal
boundary-based cues
Semantic
region- based cues
Audio
Visual
Text

Adding context to interpretation
Video Story Segmentation:
Using fusion between the visual and audio
modalities, high-level segmentation may be
achieved
“shot”
 keyframe
Audio
transcript

Main goals of fusion
Multimodal information fusion aims at interpreting
jointly multiple sources of information
representing the same underlying „concept“
The main goal is the extraction of information
By fusing information, one aims at:
• Being more accurate in the discovery of the
„concept“
– Each individual stream may be incomplete
• Being more robust in the discovery of the
„concept“
– Each individual stream may be distorted (eg, noisy)

Multimodal fusion
• From many sources of information and
context, how to make our best to “interpret”
the data
representation
color
texture
shape
face
local features
vector space model
Port;;Naval and River;;For
Transportation;;Structure;;Architecture;;Vegetabl
e Garden;;Landscape Farmland -
Countryside);;Landscape;;Landscape
(Seascape);;Landscape;;Panorama;;View of the
City;;View 98
data
Interpretation

Levels of fusion
How to organise fusion from classical
„unimodal“ interpretation?
Feature
extraction
RecognizerRaw data
„Knowledge“
Output
Unique
source

Levels of fusion
Raw data
Output
Raw data
Multiple
sources
One unique
output
Recognizer Decision

Early fusion strategy
• Acts over features
• All modalities are „concatenated into one“
• Only one decision is taken over the concatenated input
Raw data
Output
Raw data
Multiple
Sources
Fusion
One unique
output
Recognizer Decision
Example: multimodal medical image analysis
• Sources: modalities (XRay, PET,…)

Intermediate fusion strategy
• Each source acts as an input for a specific decision
• All recognizers are coupled in some way to form a meta-
recognizer
• One decision is taken
Fusion
Raw data
Output
Raw data
Multiple
sources
One unique
output
Recognizer
Meta-
Recognizer
Recognizer
Decision
Example: Video analysis
• Source: A/V streams

Late fusion strategy
• Each source is processed individually by a specific recognizer
• Multiple independent initial decisions are taken, possibly associated with
confidence scores
• A final decision is taken based on this output
Final
decision
Fusion
Raw data
Output
Raw data
Multiple
sources
One unique
output
Recognizer
Recognizer
Decision
Decision
DecisionExample: Object recognition
• Source: image regions

Prototype-based fusion
     ,,,, SIFT)(
xxx p72π
query
document x
     ,,7,2, SIFT)(
xxx pπ

Concept “Mona Lisa”
Multimedia is complex data
1503-1506 by L. Da Vinci
EN: Mona Lisa
IT: La Gioconda
FR: La Joconde
Visible in Le Louvre, Paris
48° 51′ 34″ N 2° 19′ 54″ E
Type: Oil on poplar
Dimensions: 77 cm × 53 cm (30 in × 21 in)
Photo by Mr. X
Date, context
Reason of trip
Variations from artists
Embedding into prints
Chocolate piece
URLs about Mona Lisa
Books about Mona Lisa
Videos about Mona Lisa
Audio about Mona Lisa
“official picture”
Texts about Mona Lisa
“Factual” data
“Faithful” data “Related” data

Networked multimodal data
Data
Tags
Users

Features
Networked multimodal data
Data
Tags
Users

• Curse of dimensionality
– Sparsity
– Dimension reduction/visualisation
• Big data
– Indexing
– Visual Analytics
Complements

Statistics tell us that the more data we have, the
better we get
Simulation: exponential distribution
n=10 n=100 n=1’000 n=3’000
n=10’000 Mean Standard deviation

Imagine I want to know the age spread of a population
• I create 10 intervals (0-10, 10-20,…) and I want to know the proportion of
people in each interval
• To have a reliable estimate I want 100 persons per interval in average
 With 10*100=1000 persons, I get my estimates
Now, imagine I want to estimate the height against age (I add one dimension
to my measure)
• I create 10 height interval and question, “how many people with height X
that are aged Y?”
• I have 10*10=100 such (X,Y) pairs
• To get 100 persons for each pair in average I need 100*100 = 10’000
persons
And so on…
 The size of the data needed for a reliable estimate grows with the
exponential of the dimension (huge!!!)
 In practice our data size is limited
 Our only choice to preserve reliability is to reduce the dimension
High dimensionality

Dimension reduction: principle
• Given a set of data in a M-dimensional space, we
seek an equivalent representation of lower
dimension
– For better statistics
– For visualization (2D, 3D,..)
• Dimension reduction induces a loss. What to
sacrifice? What to preserve?
– Preserve local: neighbourhood, distances
– Preserve global: distribution of data, variance
– Sacrifice local: noise
– Sacrifice global: large distances
– Map linearly
– Unfold data

Some example techniques:
• SFC: preserve neighbourhoods
• PCA: preserve global linear structures
• MDS: preserve linear neighbourhoods
• IsoMAP: Unfold neighbourhoods
• SNE family: unfold statistically
Dimension reduction

Non-recursive Space Filling Curves
Z-Scan Curve Snake Scan Curve
[Illustrations from the lecture “SFC in Info Viz”, Jiwen Huo, Uni Waterloo, CA]

Recursive Space Filling Curves
Hilbert Curve
Gray Code Curve
Peano Curve

3D SFC
Z Curve Peano Curve Hilbert Curve
N-dimensional algorithm:
A.R. Butz (April 1971). "Alternative algorithm for Hilbert’s space filling curve.".
IEEE Trans. On Computers, 20: 424–42.

PCA
[Illustration Wikipedia]

• A local neighbourhood graph (eg 5-NN graph) is built to create
a topology and ensure continuity
• Distances are replaced by geodesics (paths on the
neighbourhood graph)
• MDS is applied on this interdistance matrix (eg with m=2)
IsoMap (non Euclidean)
[Illustration from http://isomap.stanford. edu]

• MNIST dataset
t-SNE example
[Illustration from L. van der Maaten’s website]

Traces of our everyday activities can be:
• Captured, exchanged (production,
communication)
• Aggregated, Stored
• Filtered, Mined (Processing)
The “V”’s of Big Data:
• Volume, Variety, Velocity (technical)
• and hopefully... Value
Raw data is almost worthless, the added value is in
juicing the data into information (and knowledge)
Big Data

Large-scale: Massive data volume, too large to
perform sequential scan of a small portion
Aim:
From a query, “filtering” by (fast) approximate
search) a tiny (fixed-size) portion of the data as
candidate answers. And then perform (slow) exact
search in that tiny-scale sample
Method:
Assuming the features group the document
coherently, approximately identify the vicinity of
the query and search within that neighborhood
Tools for Large Scale indexing

Approximate search
Fast approximate filtering
border (exact search zone)
Actual slow border
Final resultsNever considered

Idea: reference-based indexing
“If I am close to you, I see the same thing as you”
I am close to Geneva, Montreux and Evian, anything also close
to these places is close to me
• Fix few reference documents (landmarks)
• For each document, measure the distance to each
reference document
– Eg: define the set of the 5 closest landmarks
• For a query
– Find its 5 closest landmarks
– Find all documents whose 5 closest landmarks are the same
– Actually (slow distance) compare to these

• Fix few reference documents (landmarks)
Can be done offline
• For each document, measure the distance to each
reference document
– Eg: define the set of the 5 closest landmarks
Also done offline
• For a query
– Find its 5 closest landmarks
Fast since few landmarks
– Find all documents whose 5 closest landmarks are the
same
Can be made fast (eg Inverted File)
– Actually (slow distance) compare to these
 Fast since tiny sample

Permutation-based Indexing
L(x1, R)= (𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x2, R)=(𝑟1, 𝑟2, 𝑟3, 𝑟4, 𝑟5) L(x3, R)= (𝑟5, 𝑟3, 𝑟2, 𝑟4, 𝑟1)
n=5:
D={x1, . . . , x 𝑁}, N objects,
R = {𝑟1, . . . , 𝑟 𝑛} ⊂ D, n references
Each 𝑜𝑖 is identified by an ordered list:
L(x𝑖, R)= {𝑟𝑖1, . . . , 𝑟𝑖𝑛} such that
d(x𝑖, 𝑟𝑖𝑗) ≤ d(x𝑖, 𝑟𝑖𝑗+1 ) ∀j = 1, . . . , n − 1
x
y
z
1x2x
3x
4x
5x
r1
r2
r3
r4
r5
 
j
ririSFD
rank
i jj
RxLRqLxqdxqd || ),(),(),(),(

• Assess system performance on standard
challenges
– Data collection
– Annotation / groud-truth collection
– Query formulation
– Performance measures ( precision, recall,…)
Yearly event / related to scientific conference
In general, participants are academic teams
Evaluation

ImageCLEF
http://www.imageclef.org/
CLEF= Cross
Lingual Evaluation
Forum

ImageCLEF Wikipedia
Images + text (Overall: 237,434)
• English only: 70,127
• German only: 50,291
• French only: 28,461
• English and German: 26,880
• English and French: 20,747
• German and French: 9,646
• English, German and French: 22,899
• Language undetermined: 8,144
• No textual annotation: 239

TRECVid
The goal of the conference series is to encourage research in
information retrieval by providing a large test collection, uniform
scoring procedures, and a forum for organizations interested in
comparing their results.
http://trecvid.nist.gov/

MIREX
http://www.music-ir.org
• The Music Information Retrieval Evaluation
eXchange (MIREX) is an annual evaluation
campaign for Music Information Retrieval
(MIR) algorithms, coupled to the International
Society (and Conference) for Music
Information Retrieval (ISMIR)

Other
• 3D Retrieval
– SHREC: Shape Retrieval Contest
– http://www.aimatshape.net/event/SHREC
• XML retrieval
– INitiative for the Evaluation of XML retrieval
– https://inex.mmci.uni-saarland.de/
• Many dedicated corpora for specific tasks
– Generic (Image-Net)
– Medical (ImageCLEF)
– Satellite imagery
– …

Conclusions
• Multimedia Information Retrieval extends classical IR
– Word occurrence model is preserved
• Text IR is not always sufficient / feasible
– Completeness
– Scalability
• Multimedia IR requires accurate interpretation of multimodal
data
• Fusion is required for reaching such an accurate level
• Multimedia IR still faces many challenges
– Accuracy, interactivity, scalability

Challenges
• Speed and accuracy of response are the major
challenges
• IR tends to be mobile
– Low computational power, connection, memory,…
• Large scale
– Big data
• Partial search
– Searching for a part of a document
• Semantic search
– Searching with high-level description

• Baeza-Yates R. and Ribeiro-Neto B. (1999): Modern
Information Retrieval. Addison-Wesley.
http://people.ischool.berkeley.edu/~hearst/irbook
• Baeza-Yates R. and Ribeiro-Neto B. (2011): Modern
Information Retrieval. Addison-Wesley, Second
Edition
http://www.mir2ed.org/
Recommended reading

Thank you!
Questions?

Big Data and Large-scale data
– Mohammed, H., & Marchand-Maillet, S. (2015). Scalable Indexing for
Big Data Processing. Chapman & Hall.
– Marchand-Maillet, S., & Hofreiter, B. (2014). Big Data Management
and Analysis for Business Informatics. Enterprise Modelling and
Information Systems Architectures (EMISA), 9.
– M. von Wyl, H. Mohamed, E. Bruno, S. Marchand-Maillet, “A parallel
cross-modal search engine over large-scale multimedia collections
with interactive relevance feedback” in ICMR 2011 - ACM International
Conference on Multimedia Retrieval.
– H. Mohamed, M. von Wyl, E. Bruno and S. Marchand-Maillet,
“Learning-based interactive retrieval in large-scale multimedia
collections” in AMR 2011 - 9th International Workshop on Adaptive
Multimedia Retrieval.
– von Wyl, M., Hofreiter, B., & Marchand-Maillet, S. (2012).
Serendipitous Exploration of Large-scale Product Catalogs. In 14th IEEE
International Conference on Commerce and Enterprise Computing
(CEC 2012), Hangzhou, CN.
More at http://viper.unige.ch/publications
References

References
Large-scale Indexing
– Mohamed, H., & Marchand-Maillet, S. (2015). Quantized Ranking for Permutation-
Based Indexing. Information Systems.
– Mohamed, H., Osipyan, H., & Marchand-Maillet, S. (2014). Multi-Core (CPU and
GPU) For Permutation-Based Indexing. In Proceedings of the 7th Internation
Conference on Similarity Search and Applications (SISAP2014), Los Cabos, Mexico.
– H. Mohamed and S. Marchand-Maillet “Parallel Approaches to Permutation-Based
Indexing using Inverted Files” in SISAP 2012 - 5th International Conference on
Similarity Search and Applications .
– H. Mohamed and S. Marchand-Maillet “Distributed Media indexing based on MPI
and MapReduce” in CBMI 2012 - 10th Workshop on Content-Based Multimedia
Indexing.
– H. Mohamed and S. Marchand-Maillet “Enhancing MapReduce using MPI and an
optimized data exchange policy”, P2S2 2012 - Fifth International Workshop
onParallel Programming Models and Systems Software for High-End Computing.
– Mohamed, H., & Marchand-Maillet, S. (2014). Distributed media indexing based on
MPI and MapReduce. Multimedia Tools and Applications, 69(2).
– Mohamed, H., & Marchand-Maillet, S. (2013). Permutation-Based Pruning for
Approximate K-NN Search. In DEXA, Prague, CZ.

References
Large data analysis – Manifold learning
– Sun, K., Morrison, D., Bruno, E., & Marchand-Maillet, S. (2013).
Learning Representative Nodes in Social Networks. In 17th Pacific-Asia
Conference on Knowledge Discovery and Data Mining, Gold Coast, AU.
– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Unsupervised
Skeleton Learning for Manifold Denoising and Outlier Detection. In
International Conference on Pattern Recognition (ICPR'2012), Tsukuba,
JP.
– Sun, K., & Marchand-Maillet, S. (2014). An Information Geometry of
Statistical Manifold Learning. In Proceedings of the International
Conference on Machine Learning (ICML 2014), Beijing, China.
– Wang, J., Sun, K., Sha, F., Marchand-Maillet, S., & Kalousis, A. (2014).
Two-Stage Metric Learning. In Proceedings of the International
Conference on Machine Learning (ICML 2014), Beijing, China.
– Sun, K., Bruno, E., & Marchand-Maillet, S. (2012). Stochastic Unfolding.
In IEEE Machine Learning for Signal Processing Workshop (MLSP'2012),
Santander, Spain.

Multimedia Information Retrieval

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Multimedia Information Retrieval (20)

Recently uploaded (20)

Multimedia Information Retrieval