Full Text Indexing
Full Text Indexing
Technical Note
P/N H10778
REV A01
June 28, 2012
1
EMC Confidential – Internal use only
About this document
Audience
This document is intended for eRoom Administrators, EMC Technical
Support engineers, and Triage engineers for eRoom. It is assumed that
the reader is familiar with the eRoom functionality including but not
limited to eRoom MMC Navigation, Site Settings page, and eRoom
Search, and a basic understanding of full-text indexing.
2 Technical Note
EMC Confidential – Internal use only
Introduction
Introduction
Q: What is Hummingbird Search Server?
A: A third-party party software that eRoom uses for full-text indexing.
Q: What is the eRoom Index Server(s)?
A: Any machine(s) on which the Hummingbird Search software resides
is referred to as the Index Server.
Q: What are the operations performed by the Index Server?
A: Index Server operations take place in two steps:
Step 1 Step 2 (Index Validation)
The immediate mode is fast, but at the cost of increased disk space usage and
therefore longer search time. The periodic mode is slow to index, but when
complete, you have a compact index (using lesser disk space) and therefore
shorter search time. This is the tradeoff between quick indexing and fast searches.
Q: Which indexing mode does eRoom employ?
A: eRoom switches between the Immediate and Periodic modes based on the
circumstances. The eRoom application performs regular indexing activities in the
Immediate Mode and skips Step 2, by design. Since step 2 is an expensive
operation, eRoom performs that step only under specific circumstances.
Q: What are the files available on the Index Server?
A: Each eRoom facility has a separate full-text table and index. These files are
Technical Note 3
EMC Confidential – Internal use only
Introduction
located in the ~FullText folder of the Index Server machine. Each eRoom facility
has nine different file types in ~FullText.
File types that make up an Index table
Note: Temporary files such as .rup, .rupx and .dup files are created during
Step 2 (Index Validation). These files are changed to dictionary and reference
files after table validation is completed. Since the Search Server limits a single
reference file to 2GB in size, in the case of very large facilities, you may see
multiple reference files (.ref.001 .ref.002 etc.).
Q: The ~FullText folder contains several files. How do I match a file to its
corresponding facility?
A: Every file in the ~FullText folder contains a <facilityGUID> in the file name
(prefixed by an “i” or “f”) that uniquely determines the facility with which it is
associated. The <facilityGUID> occurs in the filename after the initial “i” or “f”.
Q: I only know the URL name of the facility. How do I obtain the
<facilityGUID>?
A: Run the following script using the ERSQLExec7.exe utility
USE DATABASE eRoomSite;
SELECT UrlName,UniqueID FROM Facilities where UrlName like
'%<FacilityURL>%' and Deleted=0;
Q: For any <facilityGUID> I see some file names starting with “I” and some
starting with “F”. What is the difference between i-files and f-files?
A: For any eRoom facility, there are eight f-files (f<facilityGUID>) that
correspond to indexing of custom fields only and largely remain unused after
creation. The remaining nine i-files (i<facilityGUID>) are the bread and butter
4 Technical Note
EMC Confidential – Internal use only
eRoom Scheduler Service : The Indexer Thread
files of the facility full-text index. In most situations, the f-files can be ignored
with the focus on the i-files only. Henceforth, we will focus on the i-files
(ignoring the f-files) and any subsequent references to the full-text table/indexes
will mean i-files only.
Q: How do immediate mode indexing and periodic mode indexing work?
A: In immediate mode indexing, the eRoom application reads textual data and
writes it as is to the differential index (.dyx) files. As a result, the size of the
differential index file grows at least by the size of the text that is read. Although
the immediate mode indexing is space inefficient, data is made available in
eRoom searches, immediately. Although the .dyx file grows, the .ref and .refx
files do not grow, because they are not written to.
In periodic mode indexing, new data is first read and written and then the
VALIDATE INDEX operation is performed, that digests all changes made and
incorporates them into the periodic index (.ref and .refx) files. The index
validation step is an expensive operation that exploits redundancies in text. The
size of the periodic index grows by only a fraction of the newly added textual
data. Thus the periodic index is compact and provides faster search times.
Although the .ref and .refx files grow, the .dyx file is re-initialized at the end and
is very small.
Technical Note 5
EMC Confidential – Internal use only
eRoom Scheduler Service : The Indexer Thread
eTrace.
Indexer thread activities are broadly classified into the following types:
1. Incremental Indexing - This is a recurring activity that always
runs in the background and automatically updates the latest
changes in eRoom to the full-text tables.
2. Rebuild Indexing – This is a one time activity and must be
triggered explicitly. It purges all existing index tables for the
specific facility and prepares fresh full-text tables.
Incremental Indexing
The indexer thread wakes up every 5 minutes (this interval is configurable) to
perform incremental indexing activities.
Step 1 Index Server eRooms: Detect all eRooms marked as not indexed and
index them.
Step 2 Process Event Logs: Iterate through all facility events and changes
that have occurred since the last run of incremental indexing, and
update full-text tables.
Incremental indexing takes into account different kinds of eRoom events,
including new and imported eRooms, new items, deleted items, edited or moved
items, or items for which access control was modified, and so on. Incremental
indexing is crucial to keep the facility full-text tables up-to-date with the latest
changes made in eRoom.
Q: How can the eRoom administrator configure or disable incremental
indexing activities?
A: Incremental indexing is enabled by default. However, the eRoom administrator
may disable it from the Scheduler dialog box on the Site Settings page (Index
New Files). Even the indexing interval (set to five minutes by default) is
configurable as per requirements using the eRoom Server MMC by navigating to
Server Tuning, General tab, and selecting Background Tasks (Full Text Indexing
Interval). Keeping incremental indexing disabled for extended periods while the
server is in use will impact the end-user search experience because newly added
data will not be indexed and thus will not reflect in searches.
Rebuild Indexing
You can perform the Rebuild indexing operation by navigating to the relevant
facility from the Site Settings page. When a facility index rebuild is triggered, the
facility is immediately marked for a rebuild and then the following activities are
performed (immediately if the Scheduler service is running, or if it is not running,
then whenever it starts up next):
1. Purge all existing full-text table data for this facility (all 16 files
in ~FullText) and mark the facility for rebuild.
6 Technical Note
EMC Confidential – Internal use only
Full-text Search
Note: When the rebuild indexing operation is initiated, the rebuild index
event takes priority over incremental indexing activities. This is of paramount
importance in a customer environment where the large size of a single facility
could mean that a rebuild will take a long time to complete (up to several
hours or even days, depending on the size of the facility), putting the
incremental indexing on hold and impacting the search experience.
Therefore, it is generally recommended that an index rebuild for a large
facility be triggered only during a scheduled down time or during low load
hours.
Full-text Search
When a user performs a search operation within eRoom, the scope of the
operation is limited by the permissions granted to the user. Although the
Technical Note 7
EMC Confidential – Internal use only
Full-text Search
SearchServer full-text tables are set up at the facility level, the scope of
any search operation within eRoom, is defined not in terms of the
facilities to search, but in terms of the eRooms to search.
The search scope depends on the following parameters:
type of search (room-level or site-wide)
room memberships of the user (for site-wide search)
community/site administrator privileges of the user (For the
Search all rooms that I administer option)
rooms that are indexed and ready for searching
Any eRoom search operation consists of the following steps:
Accepting keywords from the user
Defining the scope of the search
Setting up the list of indexed rooms available for searching
(ignoring the un-indexed rooms)
Utilizing the keywords and scope to construct the
appropriate FIND command to issue to the Hummingbird
SearchServer
Building a result set of rows returned by the SearchServer
Mapping each row in the result set to the corresponding
eRoom item
Displaying the list of items to the user
Consider the following sample of the FIND command issued to the
SearchServer:
Sample eRoom eTrace snippet
08/19/2011 17:34:15 w3wp[730:780]: FullText,3:
CERSearchServer::Find - msec(0) SELECT RELEVANCE('2:1') AS
REL,FT_CID,DOCID,NAME,DISPLAYDOC,DISPLAYNAME,FOUNDIN,FOUNDINNA
ME,ROOMID,DBID,MODIFYDATE,OWNERS,CANOPEN,UIFLAGS,ISROOM,FT_SFN
AME FROM IF2857508_1595_445D_889A_B01527A99E8D UNION
I111AEF5C_B0A7_401D_9FA0_2D91509F4529 WHERE (ROOMID IN (6,11))
AND ((NAME CONTAINS 'project' WEIGHT 5) OR (FINDTEXT CONTAINS
'project') OR (FT_TEXT CONTAINS 'project')) AND (ISROOM = 0)
ORDER BY REL DESC, DOCID DESC
8 Technical Note
EMC Confidential – Internal use only
Diagnosis and Troubleshooting
The keyword in this snippet is ‘project’. Only two facilities are searched.
The absence of the ROOMID IN condition in the WHERE clause
indicates that all rooms in both facilities are included in the scope.
Q: How do I map the ROOMID to a room name?
A: To map the ROOM IDs in the FIND command with their URL names,
run the following script via ERSqlExec7:
USE DATABASE eRoomSite;
SELECT UrlName FROM Rooms WHERE InternalID in (<IDs seen the
FIND command>);
Technical Note 9
EMC Confidential – Internal use only
Diagnosis and Troubleshooting
10 Technical Note
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key
Technical Note 11
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key
12 Technical Note
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key
searchable. This key value will only take effect when the specified room
or facility is marked as un-indexed.
Technical Note 13
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key
14 Technical Note