Effective Searching by Dominik Kornas

Effective searching
Integrating External Search Engines with Adobe AEM
Dominik Kornaś

3 years in Cognifide – exactly today 
Senior software engineer & technical lead
Focused on systems integration tasks
The ”search guy” in Cognifide
Who am I?

What we won’t talk about
Sorting
Document
structure
Indexing
Managed
relevancy
model
Input data
processingHighlighter
Faceted
search
Wildcard
search
Statistics
Autocomplete
Spellchecking
Lemmatization
Sentence
search
Pagination
Content
normalization
Metadata
Data
collections
& views

„What is the best British football team?”
If we ask such a question, will the search engine find the answer?
The goal of searching

The search engine will find the question, not the answer.

vs.
„best team football UK”
Are we asking questions or issuing queries?

Effective searching is about finding keywords:
• in the shortest possible time
• close to each other in a block of text
• that are in a desired context
and being sure the engine knows about the data we are looking for!

Microsoft FAST
The first major external search integration with AEM (then: CQ 5.4)
in Cognifide.
Push-like indexing using CQ-FAST connector from Adobe.

Microsoft FAST
Implemented as a dedicated replication agent, triggered by the
content replication.
http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html

Content builder
Transport
handler
MS FAST
Microsoft FAST
Replication agent processing workflow: HTTP request for a content
Metadata
Markup

Microsoft FAST
We can decide which instance the content should be read from.

Content builder
Transport
handler
MS FAST
Microsoft FAST
Replication agent processing workflow: metadata.ecma evaluation
Markup
Metadata

Content builder
Transport
handler
MS FAST
Microsoft FAST
Replication agent processing workflow: data upload
Markup
Metadata

Microsoft FAST
Sends content to MS FAST.
The ”cq5” suffix in the URI is
a document collection.
A named subset of documents
in the entire FAST index.
http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html

Content builder
Transport
handler
MS FAST
Microsoft FAST
Replication agent processing workflow: indexing
Markup
Metadata

Microsoft FAST
The replication agent is OK for one site, stored in a single FAST
collection of documents.
It becomes complicated in the multi-site environment where each
site must be located in a separate index area.
And when the search results should not contain data coming from
the different sites.

Microsoft FAST
The complex ACL configuration has been used to ensure that only
one proper agent will deliver the document to FAST.
It was hard to set and maintain without the proper tools that have
automated the whole process.

Google Search Appliance
For the AEM & GSA integration, we have considered reusing of the
CQ-FAST connector approach.
But aware of the issues, we have decided to develop our own
micro-framework that takes care about the indexing process.
Installed as a single OSGi bundle.
Provides a set of services and utilities to help with the indexing.

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The indexing process
spans between the
author and the publish
AEM instances.
All stages are tracked
and it is possible to
recover from the failure
and retry the indexing.
AuthorPublish
Process status tracking & persistence

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The process starts with
the content replication.
OR
Programatically from the
backend, e.g. triggered
by the scheduler service.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
Each replicated content
path is filtered against
a whitelist & a blacklist.
There’s an option to use
a custom OSGi service
able to decide if the
content should be
indexed, removed or
ignored.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The indexing information
is persisted in a special
kind of repository node
and replicated to the
publish instance.
We can choose which
publish instance(-s) will
receive the data.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The information is
received and instantly
dispatched to the
indexing queue(-s).
We can handle indexing
in a single or multiple
different search engines.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The content is gathered
using the
SlingRequestProcessor
OSGi service.
It’s like a request for an
HTML page sent from
the Java code and
consumed by itself.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
Metadata is collected
according to multiple
different rules:
• the content resource
type
• the content path
• values of the
component properties
• custom rules
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The content and
metadata are combined
together and sent to the
search engine.
Depending on the
implementation it can be
done for each single
document or in batches.
AuthorPublish

Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Failure or
timeout
Retry
In case of any failure,
indexing is rescheduled
and launched again as
many times as it is
configured.
If the server goes down,
indexing will restart
when the machine is up
again.
AuthorPublish

The flexible nature of our solution saved us when some fancy
requirements came.

Apache Solr
The search engine, which is:
• free & open source
• powerful
• customizable
• scalable
And what is the most important, it is a part of the Jackrabbit Oak
(JCR 3), the repository engine which has been used for AEM 6.
AEM with the integrated Solr is right there.

Apache Solr
The solution developed for GSA has been ported to work with Solr.
Changes:
• Replaced the ”glue code” that does the final data push, with
one that uses SolrJ Java library.
• Names of the document metadata fields has been changed to
follow the Solr naming convention for dynamic fields.
Everything else remained untouched.

Search driven components
No server-side processing.
Search engine used as a mini database of metadata.
Configuration via query parameters.
Pure front-end implementation.

The whole page can be read from
the dispatcher cache.
An AJAX request gets the content
directly from the search engine.
The response is JSON-structured, easy to parse and to display,
using JavaScript.
{
"id": "223344",
"firstName": "Michael",
"lastName": "Johnson",
"phone": "(123)-777-8888",
"office": "Office UK",
"department": "504",
"title": "Lead Architect"
}

Search results component configured to return employee data.

User profile.
The name, mobile,
email, image path etc.
are all metadata values
of the document.

Carousel with news.
By changing the
maximum number
of search results,
we can control the
number of slides in
the carousel.

Effective Searching by Dominik Kornas

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Effective Searching by Dominik Kornas (20)

More from AEM HUB (20)

Recently uploaded (20)

Effective Searching by Dominik Kornas