Knowledge Search Engines
Knowledge Search Engines
Tutorial
The ideal search engine would be able to match the search queries to the exact context and
return results within that context. While Google, Yahoo and Live continue to hold sway in
search, here are the engines that take a semantics (meaning) based approach, the end result
being more relevant search results which are based on the semantics and meaning of the
query, and not dependent upon preset keyword groupings or inbound link measurement
algorithms, which make the more traditional search engines easier to game, thus including
more spam oriented results.
Popular SW Search Engines:
Semantic Web Search Engine (SWSE)
Sindise
Watson
Falcons
Swoogle
Semantic Web Search
Zitgist Search
Hakia
SWSE (Fig. 1) is a search engine for the RDF Web on the Web, and provides the equivalent
services a search engine currently provides for the HTML Web. The system explores and
indexes the Semantic Web and provides an easy-to-use interface through which users can
find the information they are looking for. Because of the inherent semantics of RDF and
other Semantic Web languages, the search and information retrieval capabilities of SWSE are
potentially much more powerful than those of current search engines. SWSE indexes RDF
data from many sources, including OWL, RDF and RSS files. RSS2 is converted to RDF and
they will be adding GRDDL sources soon. Developed by DERI Ireland.
Sindise
Sindice (Fig. 2) is a lookup index for Semantic Web documents built on data intensive cluster
computing techniques. Sindice indexes the Semantic Web and can tell you which sources
mention a resource URI, IFP, or keyword, but it does not answer triple queries. Sindice
currently indexes over 20 million RDF documents. Developed by DERI Ireland.
Fig. 2. Sindice.
Sindice allows its users to find documents with statements about particular resources. It
is in the first place not an end-user application, but a service to be used by any decentralised
Semantic Web client application to locate relevant data sources. As an application service
Sindice can be accessed through its Web API, for human testing and debugging it is also
offered an HTML front-end.
Thus, searching e.g. for the URI of Tim Berners-Lee can be displayed on the HTML
interface for humans. However, the application interface returns the same results but in
various machine-processable formats such as RDF, XML, JSON and plain text, an example is
shown below. In this example, several documents are returned, each of which mentions Tim
Berners-Lee’s URI. The results are ranked in order of general relevance and some further
information is given to enable users to choose their preferred source.
<?xml version=”1.0” encoding=”iso−8859−1”?>
<rdf:RDF xmlns:rdfs=”http://www.w3.org/2000/01/rdf−schema#”
xmlns:rdf=”http://www.w3.org/1999/02/22−rdf−syntax−ns#”>
<rdf:Description rdf:about=”http://www.w3.org/People/Berners−Lee/card#i”>
<rdfs:seeAlso rdf:resource=’http://www.w3.org/People/Berners−Lee/card’/>
<rdfs:seeAlso rdf:resource=’http://danbri.org/foaf.rdf’/>
<rdfs:seeAlso rdf:resource=’http://heddley.com/edd/foaf.rdf’/>
<rdfs:seeAlso rdf:resource=’http://www.eyaloren.org/foaf.rdf’/>
<rdfs:seeAlso rdf:resource=’http://people.w3.org/simon/foaf’/>
<rdfs:seeAlso rdf:resource=’http://www.ivan−herman.net/foaf.rdf’/>
</rdf:Description>
</rdf:RDF>
Moreover, Sindice enables Semantic Web clients such as Piggy Bank or Tabulator to find
documents with information about a given resource, identified through an explicit URI, an
inverse functional property or a keyword search. This capability fits well on top of many
existing Semantic Web clients. The immediate use for Sindice inside such clients is to enable
a “find out more” button, to be shown next to the available information about a resource.
Upon pressing that button, the client would contact Sindice for a ranked list of
documents with more information about the resource. The user would be presented with a
ranked list of these documents including a human-readable source description. The user
could then choose the sources of interest (or those considered trustworthy), after which the
client application could import the information from these documents. The user could
maybe also select to “always consider these domains as providers of good information” to
allow fully automated information import during subsequent lookups.
For clients that implement the linked data principles, integration with Sindice is trivial.
Sindice behaves as a “good citizen” of the linked dataWeb: it provides all results as RDF that
themselves follow the “linked data” principles. While Sindice supports, uses, and promotes
the linked data model (namely in its crawling and ranking), it also supports locating
information about URIs that are not URLs and cannot be de-referenced such as telephone
numbers or ISBN numbers. But most importantly, Sindice helps locating statements about
resources made outside their “authoritative” source.
Watson
Watson (Fig. 3) allows searching through ontologies and semantic documents using
keywords. At the moment, you can enter a set of keywords (e.g. "cat dog old_lady"), and
obtain a list of URIs of semantic documents in which the keywords appear as identifiers or in
literals of classes, properties, and individuals. You can also use wildcards in the keywords
(e.g., "ca? dog*"). Developed by KMi, UK
Fig. 3. Watson.
Even if the first goal of Watson is to support semantic applications, it is important for it
to provide web interfaces that facilitate the access to ontologies for human users. Users may
have different requirements and different levels of expertise concerning semantic
technologies. For this reason, Watson provides different “perspectives”, from the most
simple keyword search, to sophisticated queries using SPARQL. These interfaces are
implemented in Javascript, using the principles of AJAX, thanks to the DWR Library.
The keyword search feature of Watson is similar in its use with usual web or desktop
search systems. The set of keywords entered by the user is matched to the local names,
labels, comments, or literals of entities occurring in semantic documents. A list of matching
ontologies is then displayed with, for each ontology, some information about it (language,
size, etc.) and the list of entities matching each keyword. The search can also be restricted to
consider only certain types of entities (classes, properties, individuals) or certain descriptors
(labels, comments, local names, literals).
At a technical level, this functionality relies on the Apache Lucene indexing system.
Different indexes – concerning semantic documents, entities, and relations between entities
– are built from the metadata extracted during validation.
One principle applied to the Watson interface is that every URI is clickable. A URI
displayed in the result of the search is a link to a page giving the details of either the
corresponding ontology or a particular entity. Since these descriptions also show relations to
other elements, this allows the user to navigate among entities and ontologies. For example,
with the query “university researcher student”, we obtain 19 matching semantic documents.
Among them, http://www.aktors.org/ontology/ portal.daml contains the entity
http://www.aktors.org/ontology/portal#Researcher. Clicking on this URI, we can
see that this entity is described (sometimes we different descriptors) in several ontologies.
In particular, it is shown to be a subclass of
http://www.aktors.org/ontology/portal#Working-Person. Following the link
corresponding to this URI also shows its description in each of the semantic documents it
belongs to. Then, the metadata corresponding to one of these documents can be retrieved
following the appropriate link, e.g. http://www.aktors.org/ontology/portal, to find
out about its languages, locations, etc. Finally, a page describing a semantic document
provides a link to the SPARQL interface for this semantic document, as described in the next
paragraph.
A SPARQL endpoint has been deployed on the Watson server and is customizable to the
semantic document to be queried. This endpoint is implemented thanks to the Joseki
SPARQL server for Jena. A simple interface allows to enter a SPARQL query and to execute it
on the selected semantic document. This feature can be seen as the last step of a chain of
selection and access task using the Watson web interface. Indeed, keyword search and
ontology exploring allow the user to select the appropriate semantic document to be
queried. The developers plan to extend this feature to be able to query not only one
semantic document at a time, but also to automatically retrieve the semantic data useful for
answering the query. This kind of feature, querying a whole repository instead of a single
document, has been implemented in the OntoSearch2 system.
Falcons
Falcons (Fig. 4) is a keyword-based search engine for the Semantic Web, equipped with
browsing capability. Falcons provides keyword-based search for URIs identifying objects and
concepts (classes and properties) on the Semantic Web. Falcons also provides a
summarization for each entity (object, class, property) for rapid understanding. Falcons
currently indexes 7 million RDF documents and allows you to search through 34,566,728
objects. Developed by IWS China.
Fig. 4. Falcons.
In many cases, people have something in mind and want to learn more about it. For
example, to attend ISWC 2008, many researchers come to Karlsruhe for the first time. They
want to know more about this city, maybe anything or maybe something in particular.
Actually, such requirements exist widely in traditional Web search, covering over 60% of
Web queries. Traditional Web search engines provide webpages which contain the keywords
in a query, e.g, “Karlsruhe”. The user obtains a series of webpages, but he/she lacks ways to
get these webpages organized, or in other words, it is not easy for the user to specify a
particular dimension of knowledge about the subject, except for resubmitting queries with
different combinations of keywords to try his/her luck.
The Semantic Web brings possibilities of classifying knowledge without using third-party
algorithms. Generally, objects on the Semantic Web are with typing information, which can
be naturally utilized to organize knowledge. In Falcons, after submitting a query “Karlsruhe”,
the user is served with a list of objects as well as several types such as “Event”, “Landmark”,
and “Organization”. Then the user can specify a type to focus on a particular dimension of
knowledge. For example, organizations in Karlsruhe attract the user’s interest, and thus
“Organization” is selected. Thus, the results are filtered to include only the objects of the
type “Organization”, such as “University of Karlsruhe”. Moreover, the type panel is updated
to include “Club”, “Company”, “University”, etc., which are all subclasses of “Organization”.
So actually, Falcons enables the user to navigate a class hierarchy to shift the focus from one
type of knowledge to another.
In some cases, people seek objects with one or more particular properties, e.g., “find out
people that know Peter Mika”. The user has some knowledge about the targets but wants to
know more. In traditional Web search, it is called a resource query, which covers over 20%
of Web queries. In this case, the user knows that the target people know Peter Mika, and
wants to obtain their names. In traditional Web search engines, the user may submit a query
with the phrases “knows” and “Peter Mika”, but he/she is likely to find a lot of webpages
that contain both phrases but these pages do not answer the query well. Even some useful
webpages are returned, the user still has to open those pages and mine related knowledge
from them, which costs a lot of time.
On the Semantic Web, objects are defined with RDF triples, which are naturally
property-value pairs. It creates conditions to improve the precision of retrieval. In Falcons,
after submitting a query with the phrases “knows” and “Peter Mika”, the user obtains a list
of objects. The user can immediately find that Michael Hausenblas and Frank van Harmelen
know Peter Mika, and the demand is satisfied, even before clicking on any links in the results
page. It is because, for each object in the results, we provide a snippet of knowledge of this
object, as the form of property-value pairs. It is important that the snippet is query-
dependent, which may directly answer the user’s question in mind. If the user wants to
know more about each resulting object, he/she can click on it to obtain a comprehensive
knowledge of this object, which is integrated from all over the Semantic Web.
On traditional Web search engines, in order to seek relations between objects, e.g.,
between Peter Mika and Jim Hendler, the user submits a query with their names. A webpage
is returned only because its content contains both names, although in many cases it is too
difficult to find out any relations between these two phrases in the page due to the long
distance in text.
Comparatively, the RDF model used by the Semantic Web exactly describes relations
between resources. The scenario in the previous subsection has already showed how
Falcons enables to seek direct relations. Falcons also enables to seek indirect relations.Thus,
the user submits a query with the phrases“Peter Mika” and “Jim Hendler”, and obtains a list
of objects. The first one is a person that knows both Peter Mika and Jim Hendler, and the
second one is a conference of which both Peter Mika and Jim Hendler are organization
committee members. The user immediately gets answers and does not have to spend any
more time in other activities like reading webpages.
Swoogle
Swoogle (Fig. 5) searches through over 10,000 ontologies. 2.3 million RDF documents
indexed, currently including those written in RDF/XML, N-Triples, N3(RDF) and some
documents that embed RDF/XML fragments. Currently, it allows you to search through
ontologies, instance data, and terms (i.e., URIs that have been defined as classes and
properties). Not only that, it provides metadata for Semantic Web documents and supports
browsing the Semantic Web. Swoogle also archives different versions of Semantic Web
documents. Developed by the Ebiquity Group of UMBC.
Fig. 5. Swoogle.
The standard search engine interface is quite simple, the user can just type one or more
of keywords describing the information he/she is trying to locate. This is no more
complicated than a traditional Web search engine. However like a traditional Web search
engine this can lead to a large number of irrelevant results. To narrow the search users can
restrict it to the specific type of resource that they are trying to locate such as a person
(FOAF Person) or news article (RSS Item). If the search is still producing a large number of
irrelevant results than they can refine it further by specifying one or more specific property
values that the resource must have. For example if a user is trying to locate a person with a
last name of 'Smith' and a first name of 'John' he/she would enter the search string
'[foaf:surname]~smith [foaf:firstName]~john'.
Thus, the best search strategy to narrow the search is to increasingly narrow it in each
step. For instance, firstly, you can enter the keywords that best describe what you are
looking for (e.g. web browser). Then select the type of resource you are looking for from
the drop list (e.g. RSS Item). Finally, enter one or more properly values of the resource
(e.g. [rss:title]~web browser).
Zitgist Search
The Zitgist Query Service (Fig. 7) simplifies the Semantic Data Web Query construction
process with an end-user friendly interface. The user need not conceive of all relevant
characteristics - appropriate options are presented based on the current shape of the query.
Search results are displayed through an interface that enables further discovery of additional
related data, information, and knowledge. Users describe characteristics of their search
target, instead of relying entirely on content keywords.
Hakia
The brainchild of Dr. Riza C. Berkan (Fig. 8), tries to anticipate the questions that could be
asked relating to a document and uses them as the gateways to the content. The search
queries are mapped to the results and ranked using an algorithm that scores them on
sentence analysis and how closely they match the concept related to the query. Hakia
performs pure analysis of content irrespective of links or clickthroughs among the
documents (they are opposed to statistical models for determining relevance). The engine
has also started using the Yahoo BOSS service and also presents results in a “gallery” with
categories for different content matching the query. Users can also request to try out the
incremental changes that are being tried at Hakia’s Lab.
Fig. 8. Hakia.
Hakia is a relatively new search engine that wants to find and present search results in a
new way. The future of search is understanding information, not merely finding it.
For instance, a user is curious about the renaissance scientist Johannes Kepler. If he
searches Hakia for Johannes Kepler, he gets a presentation page from the Hakia Galleries,
containing a picture of Kepler along with search results grouped in categories like Biography
and Timeline, Awards and Accomplishments, and Speeches and Quotes. This is a very
convenient way to get the search results presented.
Currently, the Hakia Galleries answer around 600,000 popular queries in various topics
of interest, expanding the coverage every day. Few of these are: piano, Hillary Clinton,
coffee, India, breast cancer, red sox, Paris Hilton, Pokemon. Hakia galleries are distilled in a
semi-automated process, a mixture of meaning-based technology and editorial process.
Editors are involved in the automated gallery generation process as administrators. Their
role ranges from checking, correcting, and removing items that are inappropriate. Note that
humans are not involved in acquiring search results, it is all automated.
However, if the user doesn’t need a gallery of information, but something more specific,
like to know more about drugs to remedy his headache. He could ask Hakia: “Which drug
treats headache?”. In the search results sentences that contain an answer to the
question are highlighted, like “aspirin has been used to treat migraine and
other headaches” and “Nurofen is indicated for the relief of headache
and back pain of musculo-skeletal origin”. So even before the user clicks on
the search results, he has some of suggestions.
Notice how the sentences in the search results don’t contain the exact same words as in
the query. Either of the sentences contain the word ‘drug’ and one talks about relieving
headache instead of treating headache. This is because Hakia uses fuzzy logic to expand my
query. Fuzzy logic means a flexible algorithm. The flexibility is used to take the original query
and create its equivalent and enriched versions on the fly. The principles used in this process
come from ontological semantics. The reason Hakia is doing this is to bring search results
from a variety of equivalent articulations of the search query and related concepts. For
example, the word headache is related to migraine or the word treatment is related to
cure. Without such enrichment, a search algorithm will stick to the word used, and will not
be able to retrieve results from equally relevant material.
So if the user searches for “headache treatment” a search on Yahoo or Google will not
return results regarding a cure for migraine, unless this phrase “headache treatment” is
present on the relevant page, or someone has used this word when linking to it. Notice that
Hakia is the first search engine to introduce this ability to users, although the current beta
version is not fully equipped with this capability yet.
Natural language search is one of Hakia’s advantages. This simply means that the user
can pose a question to the search engine (e.g. “When was Abraham Lincoln born?”) instead
of breaking the question down into keywords (e.g. “abraham lincoln birth”). Natural
language search also means that the user can expect an answer to his question right on the
search results page. Hakia will present search results which contain an answer to the
question, not a list of web pages that might or might not contain an answer. This is made
possible by research in the intersection between the scientific disciplines of philosophy of
language, mathematical logic, and cognitive science. This is called ontological semantics, a
formal and comprehensive linguistic theory of meaning in natural language.
Hakia’s SemanticRank algorithm differs from popularity algorithms like Google’s
PageRank in that it determines a site’s relevancy not by its popularity, but by the relevancy
of the query to the content of the page, which is determined by popular vote (via link
referrals) in Google. The critical breaking point in this equation is when the user queries start
to become longer than usual, unique, complex, and personal. When this happens, the
‘popularity’ reference point disappears. Queries like these are called long-tail queries, and
there are zillions of them.
Moreover, Hakia has invented a new system called Qdexing, which is specifically
designed for meaning representation. Qdex means query detection and extraction. This
entails analyzing the entire content of a webpage, then extracting all possible queries that
can be asked to this content, at various lengths and forms. These queries become gateways
to the originating documents, paragraphs and sentences during the retrieval mode. Note
that this is done off-line before any actual query is received from a user.