Apache Solr Succinctly
Apache Solr Succinctly
By
Xavier Morera
This book is available for free download from www.syncfusion.com on completion of a registration form.
If you obtained this book from any other source, please register and download a free copy from
www.syncfusion.com.
This book is licensed for reading only if obtained from www.syncfusion.com.
This book is licensed strictly for personal or educational use.
Redistribution in any form is prohibited.
The authors and copyright holders provide absolutely no warranty for any information provided.
The authors and copyright holders shall not be liable for any claim, damages, or any other liability arising
from, out of, or in connection with the information in this book.
Please do not use this book if the listed terms are unacceptable.
Use shall constitute acceptance of the terms listed.
SYNCFUSION, SUCCINCTLY, DELIVER INNOVATION WITH EASE, ESSENTIAL, and .NET ESSENTIALS are the
registered trademarks of Syncfusion, Inc.
3
3
Table of Contents
About the Author ....................................................................................................................................... 10
Preface ....................................................................................................................................................... 13
Introduction ............................................................................................................................................ 13
My Promise to You ................................................................................................................................. 13
Who is This Book For?........................................................................................................................... 14
Code Examples ...................................................................................................................................... 14
Acknowledgements ................................................................................................................................ 14
Chapter 1 Why Solr and Enterprise Search? ........................................................................................ 15
Search is Everywhere ............................................................................................................................ 15
Definition ................................................................................................................................................ 16
Why Solr?............................................................................................................................................... 16
Solrs History and Famous Sites ............................................................................................................ 17
Chapter 2 Architecture of an Enterprise Search Application .............................................................. 18
Where and How ..................................................................................................................................... 18
Placing the Search Engine .................................................................................................................. 18
Inside the Search Engine ..................................................................................................................... 19
Chapter 3 Solr Configuration .................................................................................................................. 22
Getting Solr ............................................................................................................................................ 22
Starting Solr ........................................................................................................................................... 24
Configuring Solr in a Different Port ........................................................................................................ 25
Solrs Admin UI ...................................................................................................................................... 26
Getting Assistance ............................................................................................................................... 27
Dashboard ........................................................................................................................................... 27
Logging ................................................................................................................................................ 27
Core Admin .......................................................................................................................................... 28
Java Properties .................................................................................................................................... 29
Core Selector ....................................................................................................................................... 29
Analysis ........................................................................................................................................... 30
DataImport....................................................................................................................................... 30
Documents ...................................................................................................................................... 31
Files ................................................................................................................................................. 32
Ping ................................................................................................................................................. 32
Plugins and Stats ............................................................................................................................ 32
Query............................................................................................................................................... 33
Replication....................................................................................................................................... 35
Schema Browser ............................................................................................................................. 35
Summary.............................................................................................................................................. 35
Chapter 4 Your First Index ...................................................................................................................... 36
Solrs Sample Data ................................................................................................................................ 36
Simple Anatomy of a Query and a Response Query ............................................................................. 38
Response ............................................................................................................................................. 39
Other Response Sections ............................................................................................................... 40
Docs and Modeling Your Data ........................................................................................................ 41
Playing Around with Solr ...................................................................................................................... 43
A Real Query with Facets ............................................................................................................. 43
Fields ............................................................................................................................................... 44
Sorting ............................................................................................................................................. 45
Summary.............................................................................................................................................. 50
Chapter 5 Schema.xml: The Content ..................................................................................................... 51
5
5
7
7
9
9
Whenever platforms or tools are shipping out of Microsoft, which seems to be about
every other week these days, we have to educate ourselves quickly.
10
Free forever
Syncfusion will be working to produce books on several topics. The books will always be free.
Any updates we publish will also be free.
11
11
12
Preface
Introduction
Search is everywhere, yet it is one of the most misunderstood functionalities of the IT industry. It
is an incredibly useful feature that most people (including developers) take for granted, unless
it's missing or poorly implementedand then you frustrate and annoy your users.
Enterprise search never used to be for the faint of heart, or for those who possessed a thin
wallet; it frequently needed a lot of time and deep pockets to get it right. Apache Solr has
changed all that.
Even though Apache Solr is highly popular, getting started can sometimes be daunting. That's
what motivated me to write this book. While there's a lot of information about search engines
and Solr, in my opinion, it's not simple enough to get some people startedinformation is
scattered all over the place and often difficult to find. The Solr Wiki is very complete, but deeply
technical, and in many cases, scares beginners away.
Of course you also have the option to evaluate other commercial search engines, but they can
be hugely expensive and require a steep learning curve. Because of this, Solr has rapidly
become the number one choice. This is my personal opinion, but it is shared by thousands of
developers and companies all over the world.
Most importantly however, I have a promise for you.
My Promise to You
I promise that in the next couple of hours and hundred pages, I will teach you to build something
that might take you weeks to learn on your own. Together we'll create a search experience that,
if done from scratch, could cost thousands of dollars to build, and we'll have a lot of fun along
the way. That's not bad for a free e-book right? However, here's my disclaimer: it's not going to
be a fully advanced and complete application, and I will be leaving lots of room for improvement
and expansion. I will promise you though, that it will be an amazing start and a very interesting
journey.
13
13
Code Examples
All code examples in this book can be found on GitHub at https://github.com/xaviermorera/solrsuccinctly.git.
Acknowledgements
Special thanks to Syncfusion for providing me with the opportunity to author this book, to
Pluralsight for the support on the creation of Getting Started with Enterprise Search using
Apache Solr training, and to Search Technologies for initiating me into the wonderful world of
search engines.
And of course, to my wife for being so patient with me on all my endeavors, which includes a lot
of up to 16-hour workdays and 80-hour workweeks. And to my reasons for living, my daughters
Juli and Luci.
14
There is far more to search than meets the eye, however. Mr. Kamran Khan, CEO of Search
Technologies, says that in the majority of cases there are only two types of search: outside the
firewall, and inside the firewall. Outside the firewall is used to make money, and inside the
firewall to save money.
15
15
So I asked, why?
Outside the firewall search is a powerful tool for selling. Think, for example, of eBay and
Amazon. A good search in an e-commerce site allows a customer to find what he or she
is looking for and purchase. Ka-ching! The cash register is happy!
Inside the firewall search helps find preexisting items, related work, or internal
documents, all of which allow employees to leverage the technology to their advantage
and avoid duplicating work.
People expect to find things, and fasthuman nature craves simplicity and accuracy.
Definition
Lets look at the definition of search:
To make a thorough examination of, or look over carefully in order to find something.
To make a careful examination or investigation of, to probe. Or to conduct a thorough
investigation, seek.
Source: American Heritage Dictionary of the English Language, Fourth Edition (or Google
define:search)
As the definition points out, searching is the action of seeking something, yet the most important
part of searching for something is the ability to find it. Ive said it several times to multiple search
engineers: instead of search engines, we should call them find engines, but I have received
no traction with this idea.
Semantics aside, this book will focus on Enterprise Search, and specifically with Solr. We define
Enterprise Search as the practice of generating content and making it searchable to a
defined audience out of multiple enterprise-type data sources, like a database or a CMS.
As an example, if you use SharePoint in your organization, the search input found at the top
right is classed as an enterprise search solution. Anything that attempts to take a large tangled
mass of many different sources of internal corporate data, and allows that data to be indexed,
filtered, and organized with a goal to finding inner information easier, is classed as and
applicable to be an enterprise based search solution.
Why Solr?
Apache Solr is open-source, it has a fast and sophisticated text search, it's highly extensible,
highly scalable, and can work with dynamic content. It has great query speed when properly
scaled, and there are many more reasons. Solr also has a very active development community
made of individuals and companies who contribute with new features and bug fixes on a regular
basis.
16
On an historic note, search never used to be for the faint of heart. Some of the older solutions
were very, very complex and would easily cost many tens of thousands of dollars; a fully
commercially supported solution might even cost millions of dollars. Then Solr changed the
name of the game in a very big way, and now it's here to stay.
Search engines are a totally different animal. You will either fall in love with what you can do
with a search engine, or you might end up absolutely hating them if you try to tackle them headon without the proper resources. With Solr, you're in luck: this is a proper resource for a small
budget with an army of helpers to help you get started smoothly and efficiently.
17
17
18
It is absolutely clear that application architectures can be wildly different, but lets make a few
assumptions here and generalize to some degree on some of the most general use cases,
starting from the top of the diagram.
We can assume that our application will have a UI, which can be built in ASP.NET Web Forms,
MVC, AngularJS, PHP, or many other UI frameworks. Our application also has an API that
might be used for other applications to connect to, such as an iOS or Android mobile
application.
Eventually we get to the application, which may be your key source of income, and you are very
proud of it. If youre like I was before I discovered Solr, you probably have something really nice,
but that has technical elements that just do not feel right. You may even have provided a not-sonice user experience that frustrated a fewor even a few thousandusers.
This is where search comes in. You connect to the search engine via the search API. Solr
provides an innovative RESTful interface for your needs, or you can choose a client like SolrNet
or SolrJ. This all means that your application can run a query or two, refine and provide the user
with Indexes to the exact resulting Content, and through the use of MetaData retrieve the
required results with the appropriate levels of Security.
Lets go to the bottom of the diagram for a moment to understand the multiple data sources that
can provide data to your search engine. Most applications get their data from a database, like
SQL Server or MySQL. However, in many cases they could also be getting it from a NoSQL
database, content source, other applications like a Content Management System, or the file
system.
There are multiple ways to retrieve the data that we will be adding to the search engines. One of
them is whats called a connector, which retrieves data from the store and provides it to a
document-processing pipeline.
The document-processing pipeline, also known as DPMS, takes the content from a data source,
performs any necessary transformations, and prepares to feed the data to the search engine.
19
19
The first and most important point is that Lucene, a free, open-source information retrieval
software library, is the actual search engine that powers Solr. This is such an important point;
Solr has actually been made part of the much larger Apache Lucene project.
It really caught my attention when I first discovered Solr within Lucene, so much so that I simply
had to investigate it further, and I'm very glad I chose to do so. Lucene is written in Java, was
originally created in 1999 by Doug Cutting, and has since been ported to multiple other
languages. Solr, however, continues to use the Java version.
There are many other projects that extend and build on Lucenes capabilities. One of these is
ElasticSearch, which even makes for a good Solr contender (though arguments are accepted).
On top of Lucene, we have the Solr core, which is running an instance of a Lucene index and
logs along with all the Solr configuration files. Queries are formatted and expanded in the way in
which Lucene is expecting them, meaning you do not need to do this manually (which can be
tedious and complex). These queries are configured and managed (along with how to expand
them, and configure the schema details) in the files schema.xml and solrconfig.xml. In simpler
deployments, you can often get away with modifying only these two files. What follows is the
very short explanation of the purpose of each one:
Schema.xml contains all of the details about which fields your documents can contain
and how those fields should be accessed with when adding documents to the index or
querying those fields.
Solrconfig.xml is the file that contains most of the parameters for configuring Solr itself.
20
If you look within Solr Core(s) in the Solr Architecture diagram in Figure 4, you can see where
analysis and caching reside. Analysis is in charge of processing fields during either query or
indexing time. Caching allows performance improvement.
Initially Solr only supported a single core, but more recent versions can support multiple cores,
each one of which will have all the components shown in orange on the architecture diagram.
Solr also uses the word collection very often; in Solr-speak, a collection is a single index that
can be distributed among multiple servers. When you download and start Solr, it comes with a
sample index called collection1, which you can also call a core.
To be very clear, lets define some common Solr nomenclature:
Things get a bit more complex when you introduce SolrCloud Replication and start talking about
Shards, Leaders, Replicas, Nodes, Clusters, and ZooKeeper; these, however, are advanced
concepts that would belong in a second book about the subject.
Request handlers are responsible for defining the logic executed for any request received by
Solr. This includes queries and index updates.
Once a query is received, it is processed by the query parser. There are many parsers
available, such as the Standard query parser, DisMax, and eDisMax, which are the most
commonly used. You can, however, create your own custom parser if you wish.
In Solr 1.3 and earlier, creating a custom parser was the only way forward. Since version 1.3,
DisMax became the default query parser while still maintaining the ability to customize things
when needed.
Response writers are in charge of preparing the data in multiple formats to be sent back to the
client, for example, in JSON or XML-based data.
The HTTP request servlet is where you connect to Solr, and the update servlet is used to modify
your data via the update handler.
Note: If the term "servlet" is a strange one, don't worry. Think of a servlet as an endpoint
on a web server. Servlets are specific to the Java world, and are similar to controllers in
other web technologies.
Eventually we reach the admin interface servlet, which provides Solrs default administration UI,
something you'll come to rely on once you have deployed your search engine.
We could easily keep peeling away layer after layer and getting into more and more complex
and advanced functionality. However, that's not the purpose of this book, so we'll keep the
details at a reasonably simple level.
21
21
Click Download so that you are redirected to the appropriate mirror site for downloading Solrs
latest version. In this case, since Im running Windows, I will be downloading the zip file, solr4.10.2.zip. Source code is also available for download in solr-4.10.2-src.tgz, and if you need
an older version of Apache Solr, you can go to the Apache archives.
The file may take some time to download due to its 150-MB size. While you wait, now is a good
time to start checking your prerequisites, mainly Java. In older versions you could run on Java
1.6, but with Solr 4.8 and above, you need Java 7 (hopefully update 55 or later, as there are
known bugs in previous versions). At the time of writing of this book, 4.10.2 is Solrs latest
version.
To confirm you have the correct Java installed, open the Windows command line, which can be
done via the Windows key + R, or typing cmd in the Start menu or Run screen. Now, within the
command line, please type java version. The response will tell you which version of Java
you are running. A java is not recognized as an internal or external command
response means Java is not properly installed. Please go back to Javas installation instructions,
making sure the environment variable PATH correctly points to the Java directory. A correct
installation will show the following:
22
In the solr-succinctly directory, you will find several folders and files. First, there are a few text
files, which include changes, license, notice, readme, and system requirements.
In the example folder, you will find a fully self-contained Solr installation. It comes complete with
a sample configuration, documents to index, and a web application server called Jetty for
running Solr directly out of the box. Remember, if you are a .NET developer, Jetty will be the
equivalent of IIS.
The Jetty application web server provided with this distribution is meant for development
purposes. However, there are full distributions of the same software available for production use
when you reach that point.
In the dist folder, you should find a file named Solr.war; this is the main Solr application that
you deploy to your application server in order to run Apache Solr. This folder also contains
many useful JAR files. To clarify, a JAR (Java Archive) is a package file format typically used to
aggregate many Java class files and associated metadata and resources (such as text, images,
etc.) into one file to distribute application software or libraries on the Java platform.
In the contrib folder, you should find Solrs contribution modules. As with many open source
projects, what you'll find in here are extensions to Solr. The runnable Java files for each of these
contrib modules are actually in the dist folder.
In the docs folder, you'll find HTML files and assets that will increase your understanding of
Solr. Youll find a good, quick tutorial, and of course, Solrs core API documentation.
23
23
Ive seen a few people copy only the example folder to get Solr started, especially during local
deployments for development. It works, but will present you with a number of problems, as there
are dependencies that you'll almost definitely need to make things run correctly. It's always best
to copy the entire contents of the downloaded zip file. Paths are relative, however, meaning you
can easily rename example to something more meaningful without causing any significant
issues.
For my purposes throughout this book, I'll rename my cloned folder succinctly.
Starting Solr
Now that we have Solr, lets fire it up and get the party started!
At this point you might be expecting a solrinstaller.exe. This is not how it works. It is a bit
different, although not complicated at all.
Were now ready to run the Solr development environment using the included application web
server Jetty. A word of advice: Jetty is included with Solr, but it is not the only option. I also use
Tomcat for production purposes, and there are other alternatives. The bundled-in Jetty just
makes it a lot simpler to get started quickly.
I am using Windows right now, but the process is very similar in other operating systems.
The steps are extremely simple:
1. Open the command line, which can be done by typing cmd in the Windows Run dialog.
The Run dialog can be displayed with the Windows key + R.
2. Change the folder to the one you created previously, where you extracted Solr. Then, go
into the succinctly folder, which you recently cloned from example.
3. Now run java -jar start.jar. If all goes as expected, the console will start loading.
Initialization steps will be displayed in the command line; please expect a large of
amount of text to be shown. This is normal.
4. And finally, the most important part of the setup: open a browser and navigate to
http://localhost:8983/solr. If you see the following, you should be smiling, because you
have Apache Solr running:
24
If you don't see the screen in Figure 9 in your browser, or if Solr does not load, please review
the text output in your console. Exceptions are visible in the messagesthough sometimes they
are hard to find. The most likely scenario where Solr will not load is if there are errors in the
configuration files, most likely due to changes that have been made while experimenting.
As you can see, the new port is not yet working as expected. You need to restart Solr. This is
NOT a hot swap change!
25
25
To stop the current Solr instance, you need to change to the window where you started Solr,
and then press Ctrl + C so the service shuts down. Then restart using the same command as
before, java-jar start.jar. Solr will start. Now, if you refresh your browser, once again you'll
see Apache Solr, easy as that.
At this point I will revert back to using 8983, the default port, and restart Solr. These steps apply
only when using Jetty as an application web container. If you use Tomcat or another container,
youll need to use different configuration instructions.
Solrs Admin UI
Solr features a web interface that makes it easy for administrators and programmers to view the
Solr configuration details, run queries, analyze document fields, and fine-tune a Solr instance,
as well as access online documentation and help. As shown in Figure 12, the admin section is
made up of the sections Dashboard, Logging, Core Admin, Java Properties, and Thread Dump.
There's also core selector (a drop-down list) with multiple different functionalities, and the main
working pane to the right of the menu.
26
If you've already pointed your browser at http://localhost:8983/solr, then youre ready to review
each section in turn.
Getting Assistance
Underneath the main work pane, you'll see a small, icon-driven menu.
The main objective of this menu is to give you quick access to the various help and assistance
resources available to Solr users. It is made up of the documentation, which is hosted here, and
has links pointing to the official issue tracker located on the JIRA network. Theres also a link to
the Solr IRC channel, the community forum, and the Solr query syntax guide, all of which is
going to be very useful.
Dashboard
The Dashboard is the default section that is loaded when you navigate to the Admin UI. It
displays information it collects on your Instance, System, and Java Virtual Machine (JVM).
Depending on your configuration, it has been observed that the memory graph may not display
information when Windows virtual memory is set to automatic, or when the system is configured
not to use Swap memory.
Logging
The Logging section displays messages from Solr's log file. When you start Solr, you only have
one core, but if you have multiple cores, then all of the messages will be displayed.
Underneath the Logging menu item, you see the hierarchy of class maps and class names for
your instance. Click the column at the right and select the logging level from All, Trace, Debug,
Info, Warn, Error, Fatal, Off, and Unset as shown in Figure 15.
27
27
Core Admin
As you might remember from a previous section, we mentioned Lucene cores. A core is a full
copy of a Lucene index with its own schema and configuration.
You can manage your cores in the core admin section. The buttons at the top allow you to add a
core, unload one of the existing cores, rename a core, swap a core, reload the core with any
changes made since the last reload, and optimize a core.
Tip: When you click the Reload button, you have to wait for the button to turn green, or your
changes will not take effect. The commands here are the same ones available through the
core admin handler, but they are provided in a way that is easy to work with. If there are
problems loading the core, you will see the exceptions in the log, or if you started from the
console, the commands will also be displayed there. Restarting Solr will also load all cores,
including new ones.
28
Java Properties
The Java Properties screen allows easy, read-only access to one of the most essential
components of a top-performing Solr system. It allows you to see all the properties of the JVM
running Solr, including the class paths, file encodings, JVM memory settings, operating system,
and more.
The Thread Dump screen lets you inspect the threads currently active in your server. Each
thread is listed, and access to the stack traces is available where applicable. There's also an
icon that indicates state; for example, a green check mark signifies a runnable state. The
available states are new, runnable, locked, waiting, time waiting, and terminated.
Core Selector
The core selector allows you to select or find a specific core. Click Core Selector, and a dropdown menu will appear. You can start typing your cores name, which comes in handy when you
have many cores, or you can click the name of your desired core. Once you have selected your
core, youll be able to perform core-specific functions. When you click on the core, it will start by
displaying the Overview with the statistics for this particular core.
29
29
Analysis
The Analysis screen lets you inspect how your data will be handled during either indexing or
query time, according to the field, field type, and dynamic role configurations found in the
schema.xml. Ideally, you would want content to be handled consistently, and this screen allows
you to validate them in the field type or field analysis chains.
This screen is also very useful for development when selecting analyzers for debugging
purposes. Analyzers will be mentioned later in this book.
DataImport
Some of the most common data sources include XML files and relational databases. Therefore,
we need an easy way to import from databases and XML files into Solr. This is achieved using
the DIH or data import handler. It is a contrib that provides a configuration-driven way to import
data into Solr in both full builds and incremental delta imports. The DIH within the admin UI
shows you the information about the current statuses of the data import handler.
In the current instance, there are no data import handlers configured, and they will not be
covered in this book. However, if you want to learn how to configure and use data import
handlers, your current Solr download comes with a predefined example that is easy to start and
test. Please go to the example-DIH folder in C:\solr-succinctly\example\ and open
Readme.txt. Follow the instructions you find there to get started.
30
Documents
The Documents screen allows you to execute multiple Solr indexing commands in a variety of
formats directly from the browser. It allows you to copy or upload documents, JSON, CSV, and
XML and submit them to the index. You can also construct documents by selecting fields and
field values. You should always start by defining a request handler to use by typing the name of
the handler in the Request-Handler (qt) textbox. By default, /update will be defined.
31
31
Files
The Files screen is used to browse and view the various configuration files for a specific core
(for example, solrconfig.xml and schema.xml). It is read-only, and it is a great way to access
your files without having to actually log into the machine.
Ping
You can ping a specific core and determine if it is active. It is very simple to use; simply click this
option, and it tells you how many milliseconds it took for it to respond.
32
Query
The Query section is probably one of the most important parts of the admin UI. It's where you
submit a structured query and analyze the results. The Admin UI includes a set of options for
the multiple available parameters to make the users life simpler, including:
request-Handler(qt): Specifies the request handler to use; it uses the standard if it's not
specified.
q: The query, for which results returned will be ranked from more relevant to least
relevant.
fq: The filter query, basically used to narrow down result sets. The difference with q is
that fq does not affect ranking.
sort: Tells Solr by which field you want sorting to be applied, either ascending or
descending.
start, rows: Controls how many results and starting where should be returned. Used
mainly for paging.
fl: Specifies which fields should be returned in the response. If not specified, all are
returned. In Solr 4 and above, you can specify functions (a more advanced topic).
df: The default field; it will only take effect if the qf (Query Fields) is not defined.
wt: The response writer, which indicates how to format the response; for example, XML
or JSON.
indent: Makes it more readable.
debugQuery: Used to display debug information
dismax: Ticking this checkbox displays the DisMax query parser parameter. DisMax is
already the default query parser in newer versions of Solr.
edismax: Displays the Extended Dismax Parameters, which is an extended query
parser used to overcome the limitations of DisMax.
hl: Enables highlighting of results.
facet: Displays faceting parameter options.
spatial: Shows options for spatial or geo-spatial search.
spellcheck: Enables spell checking of results.
If an option is not available in the Admin UI, there are always the Raw Query Parameters,
which basically just pass along the specified parameters to Solr verbatim.
The options I just mentioned will be covered more in Chapter 8.
When you execute a query within the Admin UI, the results will load in the right-most panel. This
makes it very simple to run queries, review results, tweak, and run queries again.
Depending on your browser and configuration, one tip that I have for you is to open the results
within the browser and use XML instead of JSON. I normally use Google Chrome, and the
browser presents the XML in such a way so that you can expand and contract each section,
making it easy to view all results. Simply click the box with a link above the results that looks like
the one shown in Figure 24:
33
33
Now lets take a quick look at a response, which is made up of several sections that can include
the following:
Response header: Includes the status, the query time, and the parameters.
Results: Includes the documents returned from the search engine that match the query
in doc subsections.
Facets: Items or search results grouped into categories that allow users to refine or drill
down in specific search results. Each facet also displays number of hits within the search
that match each specific category.
I encourage you to play around and experiment with the query section; this is where you learn
the most about Solr.
The following figure shows you how a typical response might look:
34
Replication
Replication using Master and Slave nodes is the old method of scaling in Solr. The replication
screen lets you enable or disable replication. It also shows you the current replication status; in
Solr, the replication is for the index only.
Replication has been superseded with SolrCloud, which provides the functionality required to
scale a Solr solution. However, if you're still using index replication, you can use this screen to
see the replication state.
Schema Browser
The Schema browser displays schema data. It loads a specific field when opened from the
analysis window, or, if you open it directly, you can select a field or field type. If you click on the
load term info, it will show you the top end terms that are in the index for that field. And if you
click on a term, you will be taken to the query screen to see the results of a query of that term in
that field.
You can load the term information for a field if there are terms for that specific field. A histogram
will show the number of terms with a given frequency in that field. This may be a bit confusing in
the beginning, but later on it will be pretty useful.
Summary
We have concluded the quick tour of the Admin UI. The objective was to provide you with an
overview of the many different components of the Admin UI, and explain what are they used for.
The next step in this journey is to move on to understanding how we model our data according
to Solrs needs, and for this purpose, we will use the sample data provided.
35
35
Before we start indexing any documents, lets first confirm that we don't have any documents in
the index. One way to do so is to navigate to the Admin UI, chose collection1 from the Core
Selector, and click on Query. Then, at the bottom of the section, click Execute Query or click
on any of the non-multiline text boxes, and push Enter.
All-in-all this constitutes quite a few stepsthere is, however, a quicker way. Navigate directly
to Solr via its RESTful interface, querying for all documents. This will not use the Admin UI; it
will just run the query. The URL looks like this:
http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
As you can see in the results, we have zero documents in our index.
36
Time to upload the sample data. From exampledocs in your command prompt, type:
Java -jar post.jar *.xml
All the XML files supplied have been posted directly into my index and have been committed;
this has all been done automatically by the POST tool. It's also worth noting that a simple
mistake that many people make is they post data to the index, but forget to commit. Data is only
ever available for searching if you remember to execute the commit; however, since the post
tool does this for you automatically, it's a mistake you often won't make.
Post.jar is only one way of indexing documents. Another mechanism is the data import handler,
which allows connections to databases and imports data in either full or incremental crawls. You
can also add XML, JSON, CSV, or other types of files via the Documents section in the Admin
UI. Additionally, you can use a client library, like SolrNet or SolrJ, and there are multiple
content-processing tools that post documents to the Solr index. One that I see being used all
the time is Search Technologys ASPIRE, which has a PostToSolr functionality.
Switch back to your browser and run the default query again. You should now see 32
documents in your index. The following figure shows the output you should now get, allowing
you to become familiar with Solr responses.
37
37
At this point, youve run a couple of queries, which amounts to asking your search engine to
perform a basic query. You did this in two ways: first by using the Query section in the Admin
UI, and second by using Solrs RESTful interface.
Tip: If your browser supports XML formatting (like Google Chrome does), you can make a
quick change for easier readability. Please open the response in your browser, look for the
wt=json parameter in the URL, and change to wt=xml. The wt is the response writer, which
tells Solr how to format the response. Try it.
38
As we've seen so far, Solr uses a fairly standard RESTful interface, which allows you to easily
see the URL used to make a query; like any standard URL, its made up of the host name, the
port number, and the application name.
The request handler for queries (in this case we're using select) is the default request handler,
and is the Solr equivalent of Hello World. The default query of exampledocs is made up of the
following URL parameters:
URL
Description
http://localhost:8983/solr
/collection1
/select?
q=*%3A*&wt=json&indent=true
Like with any technology, the best way to learn and understand is to play with it; imagine Solr's
default install as your big data and enterprise search training wheels. Open the Admin UI,
change the parameters, and see how your results are modified and what differences your
changes make to the search. Once you've tried a few queries and gotten a feel for how they
work, youre ready to move on.
Response
When you run a query, the response you get will contain two full sections:
39
39
ResponseHeader
Response
The ResponseHeader contains information about the response itself. The status tells you the
outcome; 0 stands for OK. If you query for a nonexistent request handler, you would get a 404
response code as the HTTP response.
The ResponseHeader also includes QTime, which is the query execution time and echoing of
the parameters.
The response section includes the results of those documents that matched your query in doc
subsections or nodes. It includes a numFound that indicates how many documents matched your
query, and start, which is used for paging.
40
Each document has a set of fields, and each field can be of a different type. In this specific
sample case for the documents we just uploaded, we can see that we have id, sku, name,
manu, cat, features, includes, weight, price, popularity, inStock, and store.
The schema also includes a series of common metadata fields, named specifically to match up
with Solr Cell metadata.
41
41
Note: Solr Cell is a functionality that allows sending rich documents such as Word or PDF
documents directly to Solr for parsing, extraction, and indexing for search. We will not be
covering SolrCell in this book.
To use an analogy: if you are familiar with databases, then a doc would correspond to a row.
The name would be the column name, and type is exactly the same thingit indicates what
type of information will be stored in this specific field. Required indicates if it is mandatory, just
like specifying NOT NULL in the structured query language.
The ID in this specific case is just like the primary key, the unique id for the document. It is not
absolutely required, but highly recommended. You specify which field you want to be the
primary key in the schema in <uniquekey>.
Multivalued=true|false indicates whether you want to hold multiple fields within the
same field. For example, if a book has multiple authors, all of them would be stored in
one field.
Solr supports many different data types, which are included in the Solr runtime packages. If you
want to get very technical, they are located in the org.apache.solr.schema package.
Here is the list according to Solrs wiki:
BCDIntField
BCDLongField
BCDStrField
BinaryField
BoolField
FloatField
ICUCollationField
IntField
LatLonType
LongField
SortableLongField
SpatialRecursivePrefixTr
eeFieldType
StrField
TextField
42
ByteField
CollationField
CurrencyField
DateField
DoubleField
EnumField
ExternalFileField
PointType
PreAnalyzedField
RandomSortField
ShortField
SortableDoubleField
SortableFloatField
SortableIntField
TrieDateField
TrieDoubleField
TrieField
TrieFloatField
TrieIntField
TrieLongField
UUIDField
It is worth mentioning that there is something called Schemaless mode, which pretty much
allows for you to add data without the need to model it, as well as dynamic fields. We will not be
covering them in this book.
43
43
Two things should definitively stand out from the response. First, you will see only relevant
results; in this case, three documents matched instead of 32.
More importantly, you can now see facets for manu. We will get into more details later about
facets, but for now please take a look at the list within facet_fields called manu, which holds
the list of all manufacturers, sorted from highest occurrence to lowest. It includes the names and
a count. Facets are also called navigators, and they allow drill down on specific result sets. In
this example, given that the list is very long, I have included an ellipsis () to indicate that there
are many more results, mainly with 0 values; you can see this reflected in the figure below.
Fields
At the present moment we are returning all fields, which may or may not make sense,
depending on your specific needs. If you need to provide all fields back to the application, then
there is no need to use fl (fields) input. If you want smaller responses to help with
performance, especially when using large documents, just include the list of fields that you want
returned in fl. Simply type them in separated by a blank space or comma. This also helps with
readability while querying for testing.
44
A very neat and useful trick is to include score as a field, which will tell you the score (or how
relevant a document is) from the result set. Try adding a query to the previous search; I will add
q=drive, include the score field, and execute and analyze the results, as you can see in the fl
field in the following figure.
Results are ranked from highest to lowest score, or most relevant to least relevant. That is, of
course, if no other sorting is applied.
Sorting
To take advantage of the ability to select which fields to display, lets try sort. Sorting to a query
is a very simple processsimply type in the field you want to sort on, and then either asc
(ascending) or desc (descending), as the next figures demonstrate.
45
45
You can also sort on more than one field at a time. To do this, simply specify the field name, the
sort direction, and then separate the groups with a comma. For example: name desc, id asc.
46
It is worth mentioning that if you have deep paging, you are asking with very large offsets (i.e. rows=1000000).
Then it is very inefficient, as Solr needs to calculate in memory the first 999999 results to return the results. In such
cases (for example, extracting all records of a large result set), the recommendation is to use cursors.
1
47
47
2. When you execute the query you will get three results, two of which have a category of
hard drive.
We just ran a query using the q field. Lets now run a new query using fq instead. The intention
is to prove how q and fq affect queries in a different way. The bottom line is that q affects
ranking, while fq does not. It is extremely important to understand this difference, as using them
incorrectly will bring results that are not as relevant as they should be.
Query
Please reload the Admin UI in both windows so that we can start from clean query pages.
In one of the windows, add in the q input box the following query: drive AND cat:hard drive. Be
careful with capitalization, and remember to include the following four fields in the fl section: id
name cat score. Your query should look like the following.
48
Filter Query
In the other window, set q=drive and add cat:hard drive within fq. As before, include the four
fields in the fl section: id name cat score. Your query should match the following:
49
49
When you look at the response results, using 'fq' doesn't affect the score. The first run took
the longest, and the second was quicker. The third run using 'fq' has not changed at all,
showing that Solr has just returned the results already cached from the previous queries.
Element/Score
q=drive
q=drive& fq=cat="hard
drive"
6H500F0
0.81656027
0.81656027
3.035773
SP2514N
0.6804669
0.6804669
2.9439263
0579B002
0.33681393
--
--
Summary
In this chapter, youve learned how to load Solrs sample documents and how to run a few
simple queries. Weve discussed the anatomy of a simple query and response, and finally,
proved the difference between q and fq in terms of ranking. In the next chapter, we'll continue
by learning how to create a schema for our own documents.
50
We will start by indexing data for only three fields, and then over the course of the chapter,
incrementally add a few more so we can perform queries with faceting, dates, multi-values, and
other features that you would most likely need in your application. Lets take a quick look at our
sample data to see what it contains. As you can see, we have things like book title, description,
and author. We will be using a CSV file; however, for display, I am currently showing you the
data using Excel.
51
51
Whenever you want to add fields to your index, you need to tell Solr the name, type, and a
couple of other attributes so that it knows what to do with them. In laymans terms, you define
the structure of the data of the index.
You do this by using the Schema.xml file. This file is usually the first one you configure when
setting up a new installation. In it you declare your fields, field types, and attributes. You specify
how to treat each field when documents are added to or queried from the index, if they are
required or multi-valued, and whether they need to be stored or used for searching. Even
though it is not required, you can also declare which one is your primary key for each document
(which needs to be unique). One very important thing to remember is that it's not advisable to
change the schema after documents have been added to the index, so try to make sure you
have everything you need before adding it.
If you look at the schema.xml provided in your download, you'll see it includes the following
sections:
Version
The version number tells Solr how to treat some of the attributes in the schema. The current
version is 1.5 as of Solr 4.10, and you should not change this version in your application.
Type Definitions
Logically there are two types: simple and complex. Simple types are defined as a set of
attributes that define its behavior. First you have the name, which is required, and then a class
that indicates where it is implemented. An example of a simple type is string, which is defined
as:
52
Complex types, besides storing data, include tokenizers and filters grouped into analyzers for
additional processing. Lets define what each one is used for:
Tokenizer
Tokenizers are responsible for dividing the contents of a field into tokens. Wikipedia defines a
token as: a string of one or more characters that are significant as a group. The process of
forming tokens from an input stream of characters is called tokenization. A token can be a
letter, one word, or multiple words all embedded within a single phrase. How those tokens
emerge depends on the tokenizer we are currently using.
For example, the Standard Tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters. Delimiter characters are discarded, with a couple of exceptions.
Another example is the Lower Case Tokenizer that tokenizes the input stream by delimiting at
non-letters and then converting all letters to lowercase. Whitespace and non-letters are
discarded. A third one is the Letter Tokenizer, which creates tokens from strings of contiguous
letters, discarding all non-letter characters. And the list goes on and on.
Filter
A filter consumes input and produces a stream of tokens. It basically looks at each token in the
stream sequentially and decides whether to pass it along, replace it, or discard it. It can also do
more complex analysis by looking ahead and considering multiple tokens at once, even though
this is not very common.
Filters are chained; therefore, the order affects the outcome significantly. In a typical scenario,
general filters are used first, while specialized ones are left at the end of the chain.
Analyzers
Field analyzers are in charge of examining the text of fields and producing an output stream. In
simpler terms, they are a logical group of multiple operations made up of at least one (but
potentially multiple) tokenizers and filters. It is possible to specify which analyzer should be used
at query time or at index time.
53
53
Field Definitions
In this section, you specify which fields will make up your index. For example, if you wanted to
index and search over the books in Syncfusions Succinctly Series or Pluralsights Online
trainings, then you could specify the following fields:
A field definition has a name, a type, and multiple attributes that tell Solr how to manage each
specific field. These are known as Static Fields.
Solr first looks for static definitions, and if none are found, it tries to find a match in dynamic
fields. Dynamic fields are not covered in this book.
54
Copy Fields
You might want to interpret some document fields in more than one way. For this purpose, Solr
has a way of performing automatic field copying. To do this, you specify in copyField tag the
source, description, and optionally, a max size as maxChars of the field you wish to copy.
Multiple fields can easily be copied into a single copyField using this functionality.
Copy fields can also be specified using patterns; for example, source="*_i" will copy all fields
that end in _i to a single copyField.
The way to use this table is to look for the specific scenario that you want for your field, and
determine the attributes. Lets say you want a field where you can search, sort, and retrieve
contents.
This means there are three scenarios: Search within field, Retrieve contents, and Sort on field.
Looking for the required attributes in the columns, you would need to set indexed=true,
stored=true, and multivalued=false.
55
55
Succinctly Schema.Xml
Its time to make it our own Solr with our data. We will take our sample data, which can be found
in GitHub in the following repository: https://github.com/xaviermorera/solr-succinctly.git.
56
The source files for the exercises, located in the assets folder. It is under 50KB in size,
so you can download them separately if required.
A finished example, which you may not need if you follow the instructions provided in
this book.
Understanding the documents that we will index in this demo is easy. In the real world, it can be
trickier.
Open the command line and navigate to where we unzipped Solr earlier. It should be in
C:\solr-succinctly\succinctly\solr. This is where collection1 is located.
In this directory, you will find the collections that are available in the current installation.
Right now, we only have collection1. We need to clone collection1, so please copy and
paste, and rename the new collection to succinctlybooks.
Now go into the succinctlybooks folder and open core.properties. Here is where you specify
the name of the core, which is also called collection. It should look like this:
57
57
Now restart your Solr and go to the Core Selector. Succinctlybooks should be displayed.
If you forget to rename the collection name within core.properties and try to restart, you will
get an error telling you that the collection already exists. The error displayed in the console will
be similar to the following:
2972 [main] ERROR org.apache.solr.core.SolrCore
ull:org.apache.solr.common.SolrException: Found multiple cores with the name
[collection1], with instancedirs [C:\solr-succinctly\succinctly\solr\collection1\]
and [C:\solr-succinctly\succinctly\solr\succinctlybooks\]
Quick Cleanup
It is not a requirement to clear the index and comment out the existing fields; however, given
that we have data in our index, we need to do it to avoid errors on fields we remove and types
we change.
The following two steps will show you how to ensure we clean out the redundant data.
Step 1: Clear the index
The collection that we just copied came with the sample data we indexed recently. So where
does Solr store the index data? Inside the current collection in a folder called index in the data
folder. If you ever forget, just open the Overview section in the Admin UI section where you can
see the current working directory (CWD), instance location, data, and index.
58
The next step is to clear the index, as we will be modifying the fields so that we can create our
new index. Please stop Solr first by typing Ctrl + C from the console window where you started
Solr, open Windows Explorer in your Lucene index, select all files within the index, and delete.
When you restart Solr, your index now has 0 documents. We now have an empty index to start
with.
It is necessary to point out that if you do not delete the index, we will be changing the
uniquekey from string to int. Given that some of the keys in the original samples have keys
that look like MA147LL/A, you will get the following error when you restart:
59
59
Soon, we will be changing our uniquekeys name, but not its type. If you insist that you want int
as the type for bookid instead of string, you will get the error I just showed you at the start,
even if you have a clean index. Figure 65 shows the error you will run into if you do not follow
the instructions.
Ill leave it to you to play around and figure out what the elevate.xml file is used for, which is
one of the two potential culprits of this error:
First, look for the definition of id and comment it out all the way to store, as shown in the
following image. Do it with an XML comment, which starts with <!-- and ends in -->.
60
Now lets look for the Solr Cell fields, and comment out from title all the way to links. There are a
few more fields that you should comment out, which are content, manu_exact, and payloads.
Notice I did not comment out text, as it is a catchall field implemented via copyFields. We will
soon get to it.
61
61
Leave dynamicFields and uniqueKey as they are; we will get to them soon.
Bookid: The book id is just a number that will serve its purpose as a unique key.
Title: The title of the book. This is the text that will be searched, stored, and retrieved.
Description: A slightly larger text, with the description of the book.
Author: The Succinctly series usually includes only one author per book; however, it is
potentially multivalued, so we will declare as such. We will use this one as a facet.
Tags: Another multivalued field; well use it also as a facet.
Open the schema.xml file for the "succinctlybooks collection in Notepad++ or any other text
editor. In case you forgot or skipped the previous exercises, it is located here: C:\solrsuccinctly\succinctly\solr\succinctlybooks\conf.
62
It is time to define our static fields. The fields should be located in the same section as the
sample data fields that we just commented out. Please look for the id field definition, and add
them at the same level, starting with bookid.
Bookid will be our unique key. We declare a field with this name, and add the type, which in this
case is string. If you want, it can also be an int; it does not really make a big difference.
Given that it is a uniquekey, it needs to be indexed to retrieve a specific document; it is
required, and unique keys cannot be multivalued. Remember Field Properties by Use Case?
Also, please be mindful of capitalization; for example, multiValued has an upper-case V.
We changed the name of the unique key from id to bookid. Look for the uniquekey tag and
change accordingly.
And now we define the rest of the static fields. You should end up with some entries in the
schema like this:
You may have noticed by now that title and description are of type text_general, while author
and tags are of type string. As you might have guessed, these are different data types in the
Solr landscape.
String is defined as a simple type with no tokenization. That is, it stores a word or sentence as
an exact string, as there are no analyzers. It is useful for exact matches, i.e. for faceting.
On the other hand, the type definition of text_general is more complex, including query and
index time analyzers, for performing tokenization and secondary processing like lower casing.
Its useful for all scenarios when we want to match part of a sentence. If you define title as a
string, and then you searched for jquery, you would not find jQuery Succinctly. You would need
to query for the exact string. This is not what we would most definitively want.
63
63
We will be creating facets for tags and authors, which means a string is the correct type to use
for these. Will we be able to find them if we only type the name or last name? Lets wait and
see.
Summary
In this chapter, we started looking at the schema.xml file. We found out how important this file is
to Solr, and we started editing it to define our own collection containing information about the
Succinctly e-book series.
In the next chapter, we'll move on to the next stage in our game plan and cover the subject of
indexing.
64
Chapter 6 Indexing
Making Your Content Searchable
When you hear the word indexing in the context of Solror other search engines for that
matterit basically means taking content, tokenizing it, modifying it if necessary, adding it to the
index, and then making it searchable. Solr retrieves results very fast because it searches an
inverted index, instead of searching text directly.
But what exactly is an inverted index? It is a data structure that stores a mapping from content,
like words or numbers, to its location in a set of documents. Because of this, searching
becomes very fast, as the price is paid at indexing time instead of at query time. Another way of
referring to an inverted index is as a postings file or inverted file. So if you hear any of these
three terms, they mean the same thing.
During indexing, Solr inverts a page-centric data structure to a keyword-centric data structure. A
word can be found in many pages. Solr stores this index in a directory called index in the data
directory. There are many ways of indexing your content; in this chapter, I'll introduce you to a
couple of them.
Indexing is nothing newhumanity has been doing it for centuries! This is something that we do
all the time in our busy lives. The index at the back of a book for example, or a TV guide telling
you which programs are on your TV stations, are both perfect examples of indexing in action.
You use them by quickly scanning a predefined list, looking for some meaningful keyword or
topic. Once the keyword or topic is found, the entry will contain some kind of a pointer (for
example, a page number) that allows you to go straight to the information you seek.
65
65
Indexing Techniques
We've already indexed some data using the post.jar tool, but there are many more options:
You can use the Solr cell framework built on Apache Tika for binary files, like PDF,
Word, Excel, and more.
It is also possible to upload XML files by sending them via HTTP requests.
The DataImportHandler allows accessing a database to retrieve data, but it is not
limited to databases. The DataImportHandler can also read from RSS feeds or many
other data sources.
You can also build your custom Java application via Solr's Java client API, SolrJ.
And for those of you who love .NET like I do, you have SolrNet.
As I mentioned before, there are other content processing pipeline tools like the Search
Technologies ASPIRE post-to-Solr tool.
And finally, you can build your own on top of Solrs RESTful interface.
66
Now from the assets folder, using Windows Explorer, please copy exercise-1-succinctlyschema.csv and exercise-1-succinctly-schema.csv to our exampledocs folder in C:\solrsuccinctly\succinctly\exampledocs.
You might be wondering why we are copying the CSV and BAT files to exampledocs. This is
because that is where post.jar is located, and even though you can set the correct paths, it is
easier this way.
The next step is to execute index the files. For this purpose, we will open a command prompt
and navigate to the exampledocs folder. You can just run the exercise-1-succinctlyschema.bat file, which will execute the following command:
java -Durl=http://localhost:8983/solr/succinctlybooks/update Dtype=text/csv -jar post.jar "exercise-1-succinctly-schema.csv"
Read the response in the command window. If all went well, it will prompt 1 files indexed.
67
67
Excellent! Lets run a query now in succinctlybooks for *:*. You can do it from the Admin UI.
If you do not get this response, please make sure that the exercise files are within the
exampledocs folder, right next to post.jar. Also, run post.jar /? from exampledocs to
confirm that it is able to execute.
Everything looks great. We have 50 documents and our data seems ok. Lets analyze one
record:
68
Something doesnt look right. Can you pinpoint what it is? Look at tags. You probably noticed
by now, but lets make it a bit more obvious. Within tags, there is only one entry with git|sourcecontrol. It is a multivalued field, but it is treating git and source-control as part of the same
tag. For this example to be correct, they should be two separate values.
To review further, please click the link below for the response, with xml as wt (response writer):
http://localhost:8983/solr/succinctlybooks/select?q=*%3A*&wt=xml&indent=true
Note: If you have modified Solrs location, please use your current location.
You should be able to see from the previous figure that the tags field has been indexed as a
single field, not multiple fields, even though we declared as multivalued. The reason this
happened is very simple: we did not tell post.jar which field we want to separate, and which one
is the separator.
We can easily fix this by running the following command:
69
69
java Durl="http://localhost:8983/solr/succinctlybooks/update?f.tags.split=true&f.tags.se
parator=|" -Dtype=text/csv -jar post.jar "exercise-1-succinctly-schema.csv"
I've also included the fix in the following file in our assets folder: exercise-1-succinctlyschema-index-fixseparator.bat.
Once we've made this change, try re-running the previous queries; you should see a difference.
Whenever you have to specify multivalue inputs as a single string, you must ensure that you tell
Solr it needs to split the input up, using the following parameter:
f.tag.encapsulator='<separator character here>'
70
Now select Documents and type in the following text within Document(s) input field:
{"bookid":"51","title":"Solr Succinctly","description":"Solr Succinctly gets you
started in the enterprise search world.","author":"Xavier
Morera","tags":"enterprise-search"}
Click Submit Document, and you should get a success status in the right-hand section. Leave
this window open, as we will use it in the upcoming two sections, and open a new tab in the
same location to continue testing.
71
71
If you were to try and run this as a singular query all on its own as follows:
http://localhost:8983/solr/succinctlybooks/select?q=Solr&wt=xml&indent=true
You may be surprised to find that you don't get any results. I'll leave the explanation to this until
a little bit later; for now, I want to show you a little more about how Solr searches its indexes.
72
If you look just below the document input field, you'll see an input parameter called Overwrite;
initially this will be set to true. Its purpose with this default setting is to ensure that it updates
where needed and doesn't insert a new record. Set it to false, and try changing the author
name again, and you should find that it now adds a new record instead:
Partial Updates
Partial updates is a feature people have been requesting for years in Solr; however, it was not
until Solr 4.0 that it became available. Put simply, partial updating involves updating a single
field within a document without the need for indexing the full document. This may not sound like
much, but if you have big documents (and a lot of them) that require a huge amount of
processing just for a simple single field change, you can quickly see how much processing time
would be wasted. This along with the sheer number of documents can make a big difference.
Let me share with you a story that happened to me a few years ago. I was working on a project
for a patent searching application. It basically had a double digit TB index, made up of about
96MM patents containing every patent application and grant filed for all patent authorities
worldwide. Document sizes ranged from a few bytes to many megabytes; we had thousands of
fields, and indexing a document meant consuming a lot of processing power due to field
normalization and many other required operations.
73
73
Each patent entry has one or many classification codes that basically specify the content of
each patent; these codes used USPC, ECLA, and many others, depending on the owning
authority.
From January 1, 2013 on, the Cooperative Patent Classification started to use as the official
new classification, a scheme jointly developed by the United States Patent and Trademark
Office and the European Patent Office.
This meant that all patents suddenly needed to be reclassified, the upshot of which was
basically to add a new field for the new CPC classification code. In technical terms, this wasn't a
huge task by any stretch of the imagination. We received a CSV file that contained the patent
canonical number and the CPC codes, so we knew exactly what needed to be matched to
which records. All patents needed to be searchable with the new CPC code, and this is where
our problems began. We did not have the ability to perform partial updates, meaning we had to
fully reprocess about 80 million+ documents for every single updatea task that took weeks to
do.
A partial update couldve reduced the amount of time needed to a couple of days. The moral of
the story is simple: use partial updates where possible, and you'll quickly realize how invaluable
they are.
Lets do a partial update now. If you recall, here is what we have to index my book document:
{"bookid":"51","title":"Solr Succinctly","description":"Solr Succinctly gets you
started in the enterprise search world.","author":"Xavier
Morera","tags":"enterprise-search"}
Leave only bookid and author, changing author to Xavier Partial Update, and click Submit
Document.
{"bookid":"51","author":"Xavier MT"}
Now run the query to retrieve this document. What happened? Basically, when you did the
update, it added the full record with the fields you specified. Full update is not what we need; we
need partial update.
74
Lets try again. Start by resetting the document to its original state. Run the query to confirm.
{"bookid":"51","title":"Solr Succinctly","description":"Solr Succinctly gets you
started in the enterprise search world.","author":"Xavier
Morera","tags":"enterprise-search"}
Once you reset things, try to submit a partial update again. Specify which field you want to
update by using the key word set within {}, as follows:
{"bookid":"51","author":{"set":"Xavier MT"}}
Run the query again for bookid. Now you will have a partial update on author.
One last thing you need to be aware of: for a partial update to work correctly, you must have all
your fields set to stored=true. This can be an issue if you wanted to manage your index size
by not having all fields stored, but if not specified, you wont be able to do a partial update on
that field.
Deleting Data
Now that we know how to insert and update documents, the next step is to learn how to delete
documents. You can delete documents by ID and by Query. For example, to delete this book
from the index, you could use either of the two following ways:
The first is to delete by ID. This is the command that will tell Solr which ID it needs to delete:
<delete>
<id>51</id>
</delete>
75
75
The response obtained should look like the following. A status of 0 means no errors were
returned.
However, it does not indicate the number of records, or if records were actually deleted. For this
purpose, you would need to run a query to confirm. From the Admin UI, please select the
succinctlybooks core, click on the Query section, add in q the following bookid:51, and
execute.
76
By this point, you should be able to see that it's possible to delete the entire index, simply by
using the following URL:
http://localhost:8983/solr/succinctlybooks/update?stream.body=<delete><query>*:*</q
uery></delete>&commit=true
Specifying a wildcard is way more efficient than specifying each index individually, something
which is not always possible, as Solr may have files locked.
It is worth noting that you need to set commit to true, or else it won't be committed to the index.
If you are deleting multiple documents, it is preferred if you dont do a commit on every single
operation.
Also, you can delete documents that match multiple fields. Any query that you can build for
searching can also be used for deleting. Then, if you're using SolrNet or SolrJ, you can do a call
to their API using the function Solr.deleteByQuery*:*. We will not be covering the API of SolrJ
or SolrNet in this book, but I believe it is worth mentioning.
77
77
Using one or both of the methods we learned earlier, perform a query for a book with an ID
equal to 52; you can use the following URL, or enter it into the Admin UI query input:
http://localhost:8983/solr/succinctlybooks/select?q=bookid%3A52&wt=json&indent=true
The next step is to add the document in the same way as we did with the CSV files. For ease of
use, you'll find a batch file along with the XML; if you run this the document will be indexed.
If youre not on Windows, or cannot run batch files, then the command you need is as follows:
java -Dauto -Durl=http://localhost:8983/solr/succinctlybooks/update -jar post.jar
"exercise-2-solr-xml.xml"
Notice how I used Dauto instead of specifying the file extension. The tool is able to process
multiple extensions as depicted in the command line response.
Now run the query again for bookid 52. It will return one document.
78
If you get this far, give yourself a pat on the backyoure well on your way to understanding
how Solr works and creating your own search indexes.
Using cURL
cURL is a command line tool for transferring data using various protocols, one which typically
needs admin access in a shell-based scope, but is simple and easy to use. When it comes to
working with Solr, I can say that cURL is your friend. It is great because it is easy to use, and
you can easily post binary files. A training on cURL is beyond the scope of this book, but I will
show you a quick demo of how it can be used. Also, if you are in an environment where you
cant use cURL, you can achieve similar results using plugins like Chromes Postman plugin.
To get started, you need to download cURL, which is very simple to install.
You actually use cURL from the command line. It allows you to post information and even post
files. It lets you add, update, and delete documents.
To invoke it, type cURL in the command line, and then the location of your update handler. You
also need to include which core you're actually committing it to.
79
79
Regarding parameters, I am passing commit equals true, which means the information should
be committed to the index once I issue the command. Then Im passing -H for the header, with
a content type of text/XML.
Next is the command for the Solr. In this case, I'm doing an add command, which is exactly the
same as in the Solr XML format, with the fields that I want included in this document.
The cURL command to complete all of these operations is as follows:
curl http://localhost:8983/solr/succinctlybooks/update?commit=true -H "ContentType: text/xml" --data-binary "<add><doc><field name=\"bookid\">53</field><field
name=\"title\">Scrum Succinctly</field><field name=\"author\">Xavier
Morera</field><field name=\"tags\">scrum</field></doc></add>"
To make your life easier, I have also included exercise-3-curl.bat to the exampledocs folder in
succinctlybooks and run it. You must ensure that your system can find and run the cURL
program for the batch file to work.
You should be able to see that the status of the previous operation is 0, which, as you now
know, means no errors. If you subsequently run a query for a book with ID = 53, you should see
one document appear within your results.
The document I indexed does not have all fields. Only bookid is required, but it is possible that
if you copy pasted the field definitions and left required=true, then Solr will prompt an
exception message like this:
<str name="msg">[doc=53] missing required field: description</str>
If this scenario occurs, please make sure that only bookid contains a required=true attribute
within Schema.xml.
80
You can also issue any other command you wish. For example, a delete command would look
like this:
curl http://localhost:8983/solr/succinctlybooks/update?commit=true -H "ContentType: text/xml" --data-binary "<delete><query>courseid:getting-started-enterprisesearch-apache-solr*:*</query></delete>"
Fiddler
If you are used to web development, you are probably aware of Fiddler. If not, then Fiddler is a
debugging proxy that logs all HTTP traffic into your computer. It's an excellent tool if you have
problems, or if you want to debug the requests as you are working with Solr. Use it to inspect,
reissue, and compose requests. To get Fiddler, visit http://getfiddler.com.
Once it is installed, open Fiddler. It starts monitoring all traffic within your computer, so I
recommend you set a filter so that it only picks up local requests. To do so:
81
81
Go to Filters
Select Show only Intranet Hosts
Besides monitoring, Fiddler can also issue requests. Lets learn how to issue a request.
Go to the Composer tab. You have the option of specifying which verb you want to use, such
as GET or POST. In this case, I'm going to do a POST to the update handler, specifically to the
succinctlybooks core. This is the URL:
http://localhost:8983/solr/succinctlybooks/update?wt=json
The next step is to add the headers. Dont worry about the content length; Fiddler adds it
automatically.
User-Agent: Fiddler
Content-Type: application/json
Host: localhost:8983
Content-Length: 241
82
Your Composer tab should appear as shown in Figure 101. Click Execute.
As soon as the request is issued, Fiddler will log it in the left panel. Result 200 means all went
well. If this is your first time using Fiddler, make a mistake on purpose to see an HTTP 500
response.
Now double-click on the request, and Fiddler will open the details.
83
83
84
Re-indexing in Solr
When you are running Solr in either your production or development environments, at some
point youll need to re-index. One scenario that requires re-indexing is when there is a schema
change due to a new field being added. While it's true that you can make partial updates, there
are some cases you need full updates, and performing a re-index is the only way to go.
Depending on the type of schema change, you may need to delete all your documents and then
start re-indexing again from scratch. In this case, it's advantageous to have a full secondary set
of Solr servers so you dont lose search capabilities while re-indexing takes place. The point is
that while you are reindexing because of a schema change, you need to point your application
to an exact copy of the original Solr index, and once reindexing is complete, you point your
application to the new Solr index.
What exactly does re-indexing mean? Basically, its the process of indexing every single
document again, just as you did when you originally added them to the index.
85
85
In some cases, re-indexing can be painfully slow, because accessing the original data sources
is not very efficient. If you run into a scenario like this, I suggest you set up an intermediate
store, or another Solr that serves as a cache to help you re-index in a much quicker way.
Summary
In this chapter, youve learned how to index data, which is one of the most basic operations of
Solr; its how you insert data into the search engine. You learned how to index by using the
included post.jar, a command line tool called cURL, and Fiddler. You also learned how to delete
and update data. Regarding updates, we learned the difference between full and partial
updates, a feature that not all search engines have.
And now it is time to learn how to configure Solrs core via Solrconfig.xml.
86
Chapter 7 SolrConfig.Xml
Configuring Solr
Solrconfig.xml is the main configuration file used to configure Solrs core. There are multiple
sections that include XML statements used to set configuration values for a given collection,
parameters which include important features like caching, event listeners, request handlers,
request dispatchers, highlighter plugin configuration, data directory location, and items available
in the admin UI section.
Request Handlers
One particularly important feature that can be configured is the request handler. A request
handler is in charge of accepting an HTTP request, performing the search, and then returning
the results back to the calling client.
Request handlers are specified using a QT parameter, and they define logic executed for any
request passed to them.
You can, for example, include filters or facets. You can also make the changes in two modes.
One way is to append, which adds them to the request without the user asking for them, or you
can add an invariant. In this case, if you select invariant, it will be added to the request, and the
user cannot modify it. Invariants are very useful for scoping or even for security.
Multiple request handlers can be specified in the same Solrconfig, and you have named
request handlers covering multiple Solr cores.
There are three types of query parameters in a request handler:
Defaults: Provides default parameter values that will be used if a value specified at
request time.
Appends: Provides parameter values that will be used in addition to any values
specified at request time or as defaults.
Invariants: Provides parameter values that will be used in spite of any values provided
at request time. It is a way of letting Solr lock down options available to Solr clients. Any
parameters values specified here are used regardless of what values may be specified
in either the query, the defaults, or the appends parameters.
The default request handler in a Solr installation is /select, which should by now be very
familiar to you, as this is the one we've been using for each example so far in this book.
87
87
If you open your Solrconfig.Xml file and look for the handler, you will see that it basically has
three defaults, the echoParams, rows, and df parameters. As previously mentioned, a
requestHandler can have multiple other parameters defined to control how a query is handled
via appends or invariants.
If I uncomment the included sample sections of the /select request handler, we should see
something that looks like the following:
As you can see, this example is explicitly stating that any result to be returned has to be in
stock. This is done by adding the filter query instock:true. Don't make these uncommenting
changes yourself just yet; were going to build our own handler in just a moment.
88
The following figure shows that you should receive a 404 error from that request; this is to be
expected, and indicates that you are in fact safe to add the new handler.
Figure 108: 404 error returned by Solr to show that the 'books' handler does not yet exist
Now open your Solrconfig.Xml located in your solr/succinctlybooks/conf folder. Please look
for the /select request handler, copy it, and remove all commented out lines. Dont make any
changes just yet. It should look something like this:
The next step is to navigate to the Core Admin and click Reload. By default, collection1 will be
selected; please make sure you select succinctlybooks. Dont navigate away just yetkeep
looking at the Reload button. It needs to turn green for a few seconds to indicate that reload was
successful.
89
89
Run the query again, but make sure you are using /books instead of the /select request
handler, as shown in red:
http://localhost:8983/solr/succinctlybooks/books?q=*%3A*&wt=json&indent=true
This time, it will most definitely work, and you have 53 resultsthe same 53 results. Lets make
a couple of changes, starting with a very simple one.
Tip: Every time you make a change to Solrconfig.xml it is required that you reload the core.
If we run the query before and after, we'll see that before we got 10 results, and afterwards we
only get five:
90
91
91
Reload the core, and then run a query for all documents. You will get only three results.
If you want a more specific query, try q= description:you. In this specific query, if you use
/select, you will get two results. One of them is my book, and the other a book is by Cody
Lindley.
92
If you do the same using our '/books' handler, however, you should only get one result.
93
93
Response Fields
Another aspect that you might want to control is which fields are returned in your response for
your particular request handler. This is particularly useful when you have a large number of
fields. In one of my recent projects, we had about 200 fields per document, of which only about
nine are required to be returned on each query for displaying results. So why return them all?
Selecting which fields should be returned is very easy. Basically, within defaults, just add one fl
entry and enumerate which fields you want returned.
94
Now lets test. First, run a query so you have a baseline. Next, reload the core. Finally, in a
separate window, run the same query again. The difference should be clearly visible.
First:
After reloading:
95
95
Facets
Our final small modification will be to return facets. If you recall from previous chapters, Faceting
is the arrangement of search results into categories based on indexed terms along with counts
that indicate the occurrence of each term. It makes it easier for users to drill down into complex
result sets and categorize the information better.
Facet.query is an arbitrary query used to generate a facet count. The facet.field is used to
specify to Solr which field to be treated as a facet. The prefix indicates that only terms that begin
with this prefix can be used as a facet.
Lets modify our /books request handler within Solrconfig.xml to return facets, and in the
process, we will also remove the filter query for author so that we get the entire result set. The
steps are simple:
1. Comment out the appends section.
2. Add an invariants with facet=true to enable faceting, and then specify two different
facets, authors and tags. The following XML code should be added to your config:
<lst name="invariants">
<str name="facet">true</str>
<str name="facet.field">author</str>
<str name="facet.field">tags</str>
</lst>
96
Reload the core and run a query for all records with all default values, and scroll down within the
response. Here is what you should be looking at:
The facet_counts section includes the resulting facets. In our case, we requested two
facet fields, author and tags. As you can see, they are ordered from highest number of
occurrences to lowest. Way to go Ryan with 6. My friend Peter has three books at the
time of writing, but my sources tell me he will be tied with Ryan pretty soon!
Further down, we can see the tags, which by the way, I made up for this exercise. They
could be refined further for more realistic results.
97
97
Finally, we did not include any facet queries, facet dates, facet ranges, or facet intervals.
You can specify also on multi-valued fields, like tags, and you can also use facet.mincount to
avoid showing all values below a certain number of hits.
Grouping is also possible with facets. In this case we do not have the number of pages per
Succinctly series e-book, but if we did, we could dynamically create a range by using the
following facets:
Focus on minimalism. For example, in a schema.xml when you included things that you
don't need, the same applies here: only include those things that you need or are
planning to use in the near future. Remember: YAGNI (you arent gonna need it).
Dont forget caching. Caching is a great tool to increase performanceespecially under
heavy loadsbut it's not always appropriate.
Avoid overwarming. When Solr starts, you may define some common warming queries.
Don't define too manythe more you define, the longer startup takes.
98
Dont define too many handlers. You can define too many handlers for each specific
scenario, which may over-complicate your deployment, and will make maintenance an
absolute nightmare.
Remember to review the default configuration. The out-of-the-box configuration is not
always exactly the best thing for production, so remember to review it before
deployment.
Make sure you upgrade. Solr moves at an incredible pace, so try to keep it up to date, or
you might be missing out on some important or interesting features.
Summary
In this section we learned how Solrconfig.xml is the file used to configure Solrs core. We
learned how to create a request handler, and then to configure it using appends. Some of the
possible configurations involved specifying facets, returned rows, and response fields.
We also learned that every time a change is made in Solrconfig.xml, the core needs to be
reloaded from the Admin UI.
Now its time to learn about searching and relevancy with Solr.
99
99
Relevance
Relevance is the degree to which a query result satisfies the user who is searching for
information. It means returning what the user wants or needs. There are basically two important
concepts we need to consider when talking about relevance: precision and recall.
Precision
Precision is the percentage of documents in the result set that are relevant to the initial query.
That is, how many of the documents contained the results the user was actually looking for. To
be clear, were not talking about exact matches here either; if you're looking for "red cars,"
matches containing "cars" may still be valid, but matches containing "red paint" would not.
Recall
Recall is the percentage of relevant results returned out of all the relevant results in the system.
That is, whether the user got all the documents that in reality matched his or her query. Initially,
it is a little bit difficult to understand with a definition, but it becomes a lot simpler with an
example:
In real life the scenarios are much more complex; search engines have from thousands to
millions of documents, so returning the relevant documents can be difficult.
Obtaining perfect recall is trivial. You simply return every document in the collection for every
query, right? But this is a problem if you return every document in the collectionit might not be
very useful for the user.
100
And here is where relevancy comes in. Relevancy is the number of the documents returned by
the search engine that are really relevant to your query. To use a real life example, imagine you
run a query in Google, and the first page does not return any useful results. None of the results
are relevant to your query.
There are four scenarios that you need to consider:
True Negatives: These results should never appear in a result set, as they have nothing
at all to do with satisfaction of the presented query. A true negative is as bad as it gets
for search results; returning them means your search application is not doing its job
correctly at all.
False positives: A false positive is when a query matches something in the database,
but that match does not relate to the context of the search. Taking our precision example
from the previous section, "red paint" would be a false positivethe match occurred due
to the use of the term "red," but the context of "paint" does not relate to a context
describing cars.
False Negatives: As the name suggests, its the complete opposite of a false positive. A
false negative occurs when a document result matches, but is not returned by the search
application. In our previous example, "red car paint" might get rejected on the grounds
that its context applies only to paint, and not to a car that's painted red, which is incorrect
if our search criteria involves "red cars." When designing your search application, you
never want to produce results like this.
True positives: This is the end gamewhat youre aiming for every time. These are
true, context-relevant search results that either satisfy the query, or make it easy to see
how the query can be re-organized in order to be better.
Accuracy
This leads to accuracy, which is a tradeoff. In some cases, if you get high precision, you might
get very little recall. That is, you might get documents that are extremely relevant to your query,
but you might get very few of them. This ultimately results in missing documents that potentially
include relevant, but less precise, information for the end user.
At the other end of the spectrum, we have large recall, but with much lower precision. The trick
to getting accuracy right is getting the correct balance between these two ends.
Context
You need to take into account the categories for each one of the contexts. For example, say you
are doing a search for a development company, and you have IT pros and developers. The IT
pros might like to get results that are more related to servers and network technologies, while
developers might want to look into web developmentyet they might be using the same
keywords.
101
101
Second Page?
It is also important to consider the relevance of the documents. Users rarely go beyond the
second page of results, meaning the most relevant results need to be on the first page, with the
second page containing the not-so relevant results.
Document Age
In some cases, document age is incredibly important. For example, if you were searching for
current news in a newspaper, you only want the most up-to-date results.
Security
A lot of search engineers never give this a second thought, but security is hugely important. I
worked on a project for Microsoft a number of years ago where, as part of a security initiative,
we had to perform an analysis of approximately 300,000 SharePoint sites. The goal here was to
find and prevent unintentional access to confidential company information that the search
engine may have returned by mistake. Document security must always be a number one
priority.
Speed
Finally, we get to the issue of speed, and the bottom line is this: people expect search results
pretty much instantly. A few milliseconds, or maybe even one second, is tolerable for most
people. Beyond that, youre going to see complaintslots of them.
Ive seen exceptions where queries could take minutes, but this specific process used to take
hours to find the relevant information. This is generally a specialist scenario, where minutes are
a massive savings of time, in the bigger scheme of things.
102
As you can see in the following figure, we get a very precise and exact match to our query, with
only one result.
103
103
tf (Term Frequency)
Term frequency is the frequency in which a term appears in the document or fields. The higher
the term frequency, the higher the document score.
Query Syntax
The DisMax query parser is the default parser used by Solr. Its designed to process simple
phrases entered by users, and to search for terms across several fields using different weights
or boosts. DisMax is designed to be more Google-like, but with the advantage of working with
the highly structured data that resides within Solr.
DisMax stands for Maximum Disjunction, and a DisMax query is defined as follows:
A query that generates the union of documents produced by its sub-queries, and that scores
each document with the maximum score for that document as produced by any sub-query, plus
a tie-breaking increment for any additional matching sub-queries.
That is a bit of a mouthfuljust know that the DisMax query parser was designed to be easy to
use and to accept input with less chance of an error.
Lets review some of the possibilities regarding search.
104
For example, say I want to look for all books that have database as part of the description. I
would run a query from the Admin UI for description:database using the /books request
handler as follows:
Figure 124: Our example query as it might be viewed using the Admin UI in Solr
The query should give you four results, which you can retrieve to a page of their own using the
following URL:
http://localhost:8983/solr/succinctlybooks/books?q=description%3Adatabase&wt=json&i
ndent=true
There is something you might notice, however: take a close look at the results returned by the
URL, specifically at the score that's returned for each.
The scores returned range from 1.05 to 0.63, which is fine for a general search using wildcards
over several fields, but in our case, were searching for a specific word, in a specific field that we
know occurs exactly once in each result. Shouldn't the score in this case be equal for each
result?
105
105
Let's test this on a different field and see what happens. This time, we'll search the authors
names for occurrences of my name, using author:"Xavier Morera". Enter the following URL
into your browser, making sure to adjust where needed for domain name and port number:
http://localhost:8983/solr/succinctlybooks/books?q=author%3A%22Xavier+Morera%22&wt=
json&indent=true
This time, we can see that the score for each result is now the same.
106
In order to show you what's happening here, we need to repeat the "database" query, but this
time, we'll use the debugQuery option to help us. If youre running from the Admin UI, make
sure you check the box by debugQuery before clicking Execute.
If youre entering the URL directly, make sure you add debugQuery=true to the end of the URL
before submitting it to your browser:
http://localhost:8983/solr/succinctlybooks/books?q=description%3Adatabase&wt=json&i
ndent=true&debugQuery=true
If you scroll down through the results to the debug section, you should see the answer in the
"explain" section; the "fieldnorm" process in Solr is the element that makes all the
difference.
107
107
Part of the analysis includes fieldnorm, which penalizes longer fields. If you look at the
following figure, you can see I've drawn a red line across the ends of the descriptions, and you
can see that the results at the top (With the more specific score) have shorter descriptions.
This is just one specific case, where the keyword appeared only once in four documents, and
the only difference was the field length. Real-world queries are usually much more complex.
Let's try searching for my name only in authors. This should be something like "q=Xavier"; we'll
use the following URL and see what happens:
http://localhost:8983/solr/succinctlybooks/select?q=author%3AXavier&wt=json&indent=
true&debugQuery=true
108
Initially this might seem like an odd responseafter all, we know for sure that my name appears
in the author field more than once, so how could our query not find anything?
Re-open the Schema.xml file and refresh your memory on the definitions we previously created.
You'll see author is a string, but the description type is text_general.
Its the field type that makes the difference; string is a simple type, storing just a simple text
string. To find it, you need to run a query for an exact match. This is great for faceting, but not
so good for general searching.
However, text_general is a complex type, as it has analyzers, tokenizers, and filters.
Additionally, within analyzers, it has both query and index time. Its main use is for general
purpose text searching.
Once you understand the different field types, things get much easier.
109
109
We only got one exact match. Fantastic, our search works and gives us exact results, right? Not
quite, as we don't really want to be absolutely specific when doing general searches.
Proximity
What if we wanted to find not an exact match, but a match in close proximity? For example,
MongoDB Succinctly had database system, but that left out Postgres Succinctly, which had
database management system, which is a very close match that could be useful for our
users.
To address this, we have something called proximity matching, otherwise known as the process
of finding words that are within a specific distance of our match word.
Change the query we just issued (the one that only returned one result) so that our q parameter
now reads 'q = description:"database system"~4'. If you are entering this via a URL in
your browser, the new query should look as follows:
http://localhost:8983/solr/succinctlybooks/books?q=description%3A%22database+system
%22~4&wt=json&indent=true&debugQuery=true
As you can see in the following figure, we now have two results, and more importantly, our
score gives us an idea of the order of importance or relevance.
110
As you might expect, you can also do an OR search. For example, the following query term
'description:database OR description:Azure', turned into the following URL:
http://localhost:8983/solr/succinctlybooks/books?q=description%3Adatabase+OR+descri
ption%3AAzure&wt=json&indent=true&debugQuery=true
This yields four results. You can also match between fields, for example, searching for all books
with tags 'aspnet' or with 'Net' in the title. The query term would be 'tags:aspnet OR
title:net', and the following URL demonstrates this:
http://localhost:8983/solr/succinctlybooks/select?q=tags%3Aaspnet+OR+title%3Anet&wt
=json&indent=true
You can nest operators as much as you need to, but you must remember capitalization. Its
different to use AND vs and; this is an important point. If you get the capitalization wrong, your
search won't work as expected.
111
111
Try altering the URL to include only 'tags:aspnet' then include the 'AND -title:Mobile'.
Note that the first form gives two results, and the second only gives one result, just as we might
expect.
Wildcard Matching
As you've already seen in many places in this book (*:*), we've used wildcards quite a lot so far.
There's more to wildcards than you might realize, however. Solr supports using wildcards at the
end and in the middle of a word. A ? means a variation of a single character, while * means
many characters.
112
In case youre wondering, a '*' at the beginning of a phrase (called a leading wildcard or suffix
query) was originally NOT supported in Solr. This has recently been changed, but please know
that it's an incredibly inefficient search method, and not recommended for production use.
Lets try some example wildcard searches. Create some searches (either in the Admin UI or
with a browser URL) using the following query terms:
author:"Xavier*"
author:Xavier*
author:X*a
author:*Morera
Try creating some URLs of your own to satisfy these queries, or simply just use the Admin UI.
Once you understand how the position affects the operator, scroll down to see if your results
match those in the following table.
Query
author:Xavier*
113
113
Result
Notes
This query has zero results as
you are doing a phrase
search.
Query
Result
Notes
author:Xavier*
author:X*a
114
Query
Result
author:*Morera
Notes
This gives us the results we
expected, but remember
placing the '*' in front of the
term is inefficient.
Due to the small sizes of our index and search data in these examples, we dont see a great
deal of difference in the query times. However, if we had a larger data set and index, you would
easily be able to see which methods are the most efficient.
Range Searches
Range queries allow matching of documents with values within a specified range of values. In
our example, we havent included any dates (or for that matter, any range-based data); if we
had done so, in a field called createddate for example, we could have performed a query that
looked something like:
createddate:[20120101 TO 20130101]
This would have allowed us to search the field createddate for results that were contained in
the lower and upper bounds enclosed by the square brackets. Here are a few more range
examples.
field:[* TO 100] retrieves all field values less than or equal to 100
field:[100 TO *] retrieves all field values greater than or equal to 100
field:[* TO *] matches all documents with the field
Boosts
Query time boosts allow us to define the importance of each field. For example, if you run a
query for a specific term, and you are more interested in that term appearing in the title than
in the description of the document, you might form a query term that looks like this:
115
115
title:javascript^1.5 description:javascript
In this case, you are applying a boost of "1.5" to whenever your term appears in the title,
while still remaining interested if it is also present in the description.
I recommend that you use explain so that you can see how your boosting affects scoring. In
our initial tour of the Admin UI in the query section, we mentioned that there is a checkbox
called debug query that is used to display debug information. Enable it, and the response will
come with a text that explains why a particular document is a match, or relevant, to your query.
In this particular case, you can see boost being used to affect the score of your document.
Boosting does not need to be performed on every query; likewise, every query is not performed
only on the default field(df). You can specify the query fields (qf) in your Solrconfig.Xml, so
that on every query, the desired boosts are applied automatically.
The following figure shows an example of a handler for the built-in sample collection (the
collection we used before we defined our Succinctly books collection). As you can see, the
handler has pre-specified which boosts should be applied when searching, and to which fields it
should be applied automatically.
This is the essence of how you tweak your Solr application. You make small changes over time
and analyze the results, log the searches your users are performing, then try the tweaks
yourself against those searches. This process of trial and error can be tedious, but in the search
industry in particular, it's often the best way to fine-tune things to provide the expected results.
It is important to take into consideration that df is only used when qf is not specified.
116
Imagine, however, that we wanted to search just for a keyword, such as Succinctly. This one
should match all of the books in our collection, right? After all, every book in the series has this
word in its title.
Not quite. Run a query from the Admin UI for the term 'Succinctly' and observe the results.
Why are no results returned? Its very simplelets take a look at our Solrconfig.xml file right
now to find an answer. Please find the /books request handler. As we can see in the following
figure, we have a df of text. df stands for default field; therefore, we are telling Solr that our
default search field is called text.
But if you look within Schema.xml for the copy field declaration for text, you will notice that it is
commented out. Therefore, it is empty now, as no information is copied over to text during
indexing.
117
117
Lets give it a try and modify the df so that it points to the description, which can be done in
Solrconfig.xml. As you can see in the following figure, df has a value of description instead of
text. Dont forget to reload the core or restart Solr.
If we now re-run our previous query following the changes we made to our configuration, we
should now see that we get much better results.
118
If you wish to use the direct URL, just enter the following into your browser:
http://localhost:8983/solr/succinctlybooks/books?q=succinctly&wt=json&indent=true&d
ebugQuery=true
In this case, we are searching in a single field, so lets revert back to text and create a
copyField for each one we would like to have copied. This is done in the Schema.xml file.
Reload and query again. Youll notice it did not workbut why?
119
119
The copyField is done when a document is indexed, so it is before the index analyzer. It is the
same process as if you provided the same input text in two different fields. In a nutshell, you
need reindexing.
Reindex the way that you did using exercise-1-succinctly-schema-index.bat from exampledocs,
from the command line.
Run the query again. How many results should you get?
120
The answer is 50. Why? We have 53 documents, but the reindexing only considers our initial
result set. We manually added the other three.
Synonyms
Synonyms are used in Solr to match words or phrases that have the same meaning. It allows
you to match strings of tokens and replace them with other strings of tokens, in order to help
increase recall. Synonym mappings can also be used to correct misspellings. Lets try a simple
test to illustrate what I mean.
Tip: To make sure that we have all of our sample data in our index, please open a
command prompt, navigate to solr-succinctly\succinctly\exampledocs, and run the
following batch file: exercise-1-succinctly-schema-index-fixseparator.bat. By doing
so, you will reload the sample books into your index.
Run a query for q=lightning on our books collection; you should see no results found.
121
121
Now open Schema.Xml for the succinctlybooks collection, and go to our default field type,
text_general. You can find it within the fieldtype name=text_general node, as you can
see in the following figure. Within the analyzer node of type=query, you can see a filter of
class=solr.SynonymFilterFactory. This indicates that your Solr has synonyms configured
for any fields of type text_general that are calculated at query time.
Great! That means no re-indexing is needed, although it might potentially affect performance at
a certain scale.
If you look closely at the filter for the synonyms, it has an attribute
synonyms=synonyms.txt. This means that our synonyms dictionary is this text file, which is
located in our conf directory for the succinctlybooks core.
122
Open the file and add an entry (like for lightning) so that it is used as a synonym for
bootstrap. We have comma-separated values.
lightning,bootstrap
Now try running the query for lightning again, using the /books request handler in the
succinctlybooks core. You should get no results. As with most configuration changes, you'll
need to reload the core for things to take effect.
123
123
Now, my friend Peter Shaws Bootstrap book is there. (Which I personally recommend to every
single developer who, like me, is UI challenged! It really makes a difference.)
Stopwords
Stopwords are how Solr deals with removing common words from a query. Common words are
defined as standard English common words such as 'a', 'an', 'and', 'are', and 'as', along with
many others. Any word that is likely to be commonly found in every sentence could be classed
as a stopword.
In some cases, a word does not have any special meaning within a specific index. In our case,
all documents have the word succinctly, so it provides no additional value when used. In a
previous project that I worked on, I had to index all patents and applications worldwide; this lead
to the word patent not having any special meaning.
Lets try a query with q=succinctly. You should get the following results:
124
Please remember that you can construct the URL for this query from the Admin UI, by running a
query and clicking on the gray box at the top right.
All results are found, as the word occurs in every single document. The way to indicate which
words should not be used in a query is via stopwords. This is done via the Solr.stopfilterfactory.
To add a stopword, you need to go to the conf directory, in the same location where we
modified our synonyms.
125
125
If you run the query now, all documents should still be returned, which means that the
stopwords are not working. This is expected, because we just made a configuration change, but
have not yet reloaded the core as required.
126
Reload the core and query again. No results were found, which is the outcome we expected.
Stopwords can be added both at query and at index time. Its very useful at index time because
if these words are removed from your index and they are very common, it helps with the index
size. At query time, it is also useful, as no reindexing is required.
Summary
In this section, we have learned some of the basics of searching using Solr. This is an extremely
large subject that could spawn hundreds or even thousands of pages, but this kick-start puts
you in a nice position to move forward on your own.
In the next section, we will discuss user interfaces with Solr.
127
127
Chapter 9 Add a UI
Solr can be used in many different ways. In a lot of cases, you can use it as a small functionality
within your application. For example, it can be used to implement a type ahead function as an
aid to the end user. In other cases, your application might be more search-centric, for example,
a patent analysis application to find prior art.
In any case, and irrespective of your current requirements, its highly likely that at some point
you'll need a custom user interface for Solr. In this chapter I present two well-known alternatives
that will make that task much easier.
128
The Velocity ResponseWriter, also known as Solritas, is a handler that allows processing of
results with the use of the template system called Velocity.
You can read more at http://velocity.apache.org/. It has not been updated recently, but you can
use it to learn a lot about querying, geolocation, and much more. Velocity is a very quick and
easy way to generate a UI for testing your data. To access it, simply navigate to the following:
http://localhost:8983/solr/browse
Your turn: Why not try modifying it with the Succinctly collection? I would recommend adding a
publicationdate field in Schema.Xml. Then, add random dates for all the books in our sample
data file, books.csv, index, and test. Branch in GitHub and give it a shot to learn how it works. It
includes geolocation and boosting.
129
129
What is SolrNet?
As the website states:
For a .NET developer, SolrNet helps you work with Solr in a very natural way, by allowing you to
represent your schema via the use of Plain Old CLR Objects (POCOs). If you are not familiar
with POCOs, they are basically a class that represents exactly what we have in our
Schema.XML, type-for-type, with the exact same names.
SolrNet makes Solr feel part of your code in a way that a RESTful interface really cant.
SolrNets History
SolrNet was created by Mauricio Scheffer, from Argentina, in 2007. I contacted him personally
and asked about the history of Solr. He pointed me to the original blog post where he first
introduced SolrNet, which can be found here.
He also gave me an overview of how SolrNet was born. He had a requirement to add facets to a
site he was working on at the time, but due to other commitments with work, he did not have
time to act on it, so he was paid an additional sum of money to complete the work outside
normal office hours. As part of the project, he negotiated the release of the code as open
source.
He originally posted to code.google.com, but as it has become favorable with many open source
projects, it now lives in GitHub.
130
At the time, Solr was on version 1.2, meaning it was an early release, and one which had very
little documentation. He based some of his work on SolrSharp, which by this point had fallen
into an in-active state. His main driving force, however, was the desire to add unit tests and
improve the overall build of the library.
In any case, thank you Mauricio! Also, special thanks for responding to my messages with the
insight and information you did, allowing me to share the story with my readers.
Getting SolrNet
To get SolrNet, simply clone it from GitHub: https://github.com/mausch/SolrNet
If you dont know Git, there are two things you can do. One is to just click Download to get a
local copy of the code, and the second is to get the Git Succinctly e-book. Git is an amazing tool
that you should not ignore.
My Git client of choice is SourceTree, but feel free to use whichever makes you more
comfortable.
There's an old URL in code.google.com; its the original repository that is still alive, but no
longer maintained, so ignore it.
131
131
SolrNet
Once you have SolrNet, there are several ways to get it up and running. You will need Visual
Studio. I have 2012, but it works with other versions as well, including Visual Studio Community.
SolrNet comes with a sample .NET application. You can use this as a base to create your own
Solr application, or just as a testing ground for your Solr configuration and development.
It comes in the form of a standard ASP.NET MVC application. If youre not familiar with
ASP.NET MVC, you may have a few other concepts to learn to get started; the SampleSolrApp
teaches you very well how to use SolrNet.
My usual workflow is to first open the main SolrNet project and build the solution, just to check
that everything is present and working ok.
132
Once youre happy that SolrNet is working ok, close that project and open the solution for the
sample application. As shown in the following figure, select to rebuild the solution as you did
with SolrNet.
133
133
At this point, you should expect to get some build errors in the solution.
134
If you look in the project references, youll see that you need to re-link the newly rebuilt SolrNet
assembly.
You can fix this by re-binding the reference to SolrNet.dll, which can be found in
SolrNet\bin\Debug.
135
135
If you rebuild again after adding the reference, you'll find you still have a couple more things to
fix.
Add SolrNet.DSL from SolrNet.DSL\bin\Debug to fix the remaining issues, then rebuild and
run.
If everything has worked, you should be greeted with the following web application:
136
Go ahead and play around, run queries, and analyze responses. View how facets, paging, and
items per page affect queries. Put in a breakpoint or two. Compare it with what you have in the
Admin UI.
3. At this point, if you run the SolrNet app, you will get an application error as follows:
137
137
4. Open Solrs log, and you will see the error. It all makes sense now. You are trying to
read from succinctlybooks using collection1 schema. How do I know? Please look at the
following figure. Solr did not tell me directly, but it hinted me in the right direction by
stating "document is missing mandatory field bookid". I realized I had
documents in my index that did not have the unique key, meaning that they are from
another collection. It may seem hard at this moment, but once you get experience, you
will be able to pick out these errors much easier.
5. Dont believe me completely? Use the best trick in the book while debugging: turn on
Break on Exceptions. Quick access is via Ctrl+Alt+E in Visual Studio. Visual Studio does
not tell you when an exception is raised and caught. However, if you turn on Break on
Exceptions, you will be prompted whenever any exception occurs in the exact line.
138
Now you can clearly see that the real exception is being masked.
If you've gotten to this point, you are on the right path. Here are a few tips on next steps:
The sample apps uses a function called AddInitialDocuments to populate with sample
data. We dont need it in the succinctlybooks collection, so comment it out.
You need to modify your POCO to match your Solr schema.xml. It is currently defined in
Product.cs, as shown in the following figure.
139
139
You need to modify the facets to load only those that are related to succinctlybooks, and
not collection1.
Make sure you use only your fields from your collection.
Summary
In this section, we have learned how we can add a user interface to our Solr search engine. The
first option was using the Velocitas response writer, which is built into the downloaded Solr. The
second option was using the SolrNets sample application. It's not a finished, full-blown
application, but this is an excellent start for something that might make you some moneyor
save you some money.
140
Final Words
And with this, we have concluded this e-book, part of the amazing Succinctly series from
Syncfusion. Let's just take a few minutes to do a final review.
You have an idea or a need. This idea might make you some money or save you some money.
Search is an important piece of many of the ideas out there. If you don't do it right, you might
frustrate your users, but if you do it properly, you can entice them.
Search used to be difficult and expensiveit used to be a long steep roadbut this has all
changed. Now, Solr comes to the rescue.
To get your idea up and running, you first have to understand where your data is. There could
be multiple data sources, like databases, custom management systems, files, feeds, web
pages, or even data entered by your users. There are different ways of getting the data, for
example, with crawlers or connectors.
Get to know your data, get your Solr ready, model your data in the schema.xml, configure your
Solr in the Solrconfig.xml, and then index your data.
Once you've stored, sorted, and indexed things, your data is searchable via REST.
If you want to take it a step further, you have SolrNet (or SolrJ, Solritas) to help you. You can
look at the Solr sample application if you want an easy way to get started. There are other
packages and applications that can be used, but I didnt mention them here.
If you've come this far, you are on the right path to do some amazing things with your
implementation of Solr in your application.
I am Xavier Morera, and I thank you for staying with me. I hope you've enjoyed reading Solr
Succinctly and following along as much as I have enjoyed writing it.
Ping me on Twitter @xmorera if you have questions or comments, or if there is anything I can
do to help you.
This is not the endit is the beginning of your great journey in search!
141
141