0% found this document useful (0 votes)
71 views

Full Text Indexing

Uploaded by

Bhanu Aleti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Full Text Indexing

Uploaded by

Bhanu Aleti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

EMC Confidential – Internal use only

EMC® Documentum® eRoom


Full-Text Indexing

Technical Note
P/N H10778
REV A01
June 28, 2012

This technical note contains information on these topics:


 About this document................................................................................. 2
 Introduction ................................................................................................ 3
 eRoom Scheduler Service : The Indexer Thread .................................... 5
 Full-text Search........................................................................................... 7
 Diagnosis and Troubleshooting ............................................................... 9
 Configuring Indexing Activities: The Debug Registry Key ............... 11

1
EMC Confidential – Internal use only
About this document

About this document


This document explains how the background process of full-text
indexing takes place in Documentum eRoom Server and provides some
insight into how it can be configured for optimal performance and
monitored for any erroneous behavior.
This document explains the process of full-text indexing of data using
the Documentum eRoom server and configuring the server for optimal
performance, and monitoring the server for erroneous behaviour.
This document can be used as a starting point for users who are
responsible for administering the eRoom Server and must understand
the indexing and search features, or for users who are responsible for
troubleshooting problems associated with the indexing and search
features. It is recommended that the reader go through the document
sequentially, without skipping any section, as some of the latter sections
require knowledge of content discussed in the earlier sections.
The introduction of this document explains the third-party server that
eRoom employs for full-text indexing and provides a brief description of
the features/modes that the third-party software exposes to eRoom. The
document then moves on to describing the process of indexing,
including the various eRoom processes and threads that are responsible
for indexing. Next, it addresses the eRoom Search feature and explains
the steps and components involved in constructing an eRoom Search
query, and conversely, how to deconstruct a search query to determine
the scope. Finally, the document deals with some of the common themes
seen while diagnosing customer issues related to indexing and search,
and lists generic steps to address the issues. The last section deals with
useful debug registry keys exposed by eRoom for the purpose of tuning
and configuring the indexing process.

Audience
This document is intended for eRoom Administrators, EMC Technical
Support engineers, and Triage engineers for eRoom. It is assumed that
the reader is familiar with the eRoom functionality including but not
limited to eRoom MMC Navigation, Site Settings page, and eRoom
Search, and a basic understanding of full-text indexing.

2 Technical Note
EMC Confidential – Internal use only
Introduction

Introduction
Q: What is Hummingbird Search Server?
A: A third-party party software that eRoom uses for full-text indexing.
Q: What is the eRoom Index Server(s)?
A: Any machine(s) on which the Hummingbird Search software resides
is referred to as the Index Server.
Q: What are the operations performed by the Index Server?
A: Index Server operations take place in two steps:
Step 1 Step 2 (Index Validation)

eRoomRead the textual data Hummingbird Index the Tables Indexes


Pages/Files Tables
And write to tables

Q: What are the different indexing modes that Hummingbird allows?


A: Hummingbird Search Server can index data in one of the following modes:

Immediate Quick. Newly added data is available for search immediately


Mode after Step 1.

Periodic Slow. Newly added data must be indexed in Step 2 before it


is reflected in the searches.
Mode

The immediate mode is fast, but at the cost of increased disk space usage and
therefore longer search time. The periodic mode is slow to index, but when
complete, you have a compact index (using lesser disk space) and therefore
shorter search time. This is the tradeoff between quick indexing and fast searches.
Q: Which indexing mode does eRoom employ?
A: eRoom switches between the Immediate and Periodic modes based on the
circumstances. The eRoom application performs regular indexing activities in the
Immediate Mode and skips Step 2, by design. Since step 2 is an expensive
operation, eRoom performs that step only under specific circumstances.
Q: What are the files available on the Index Server?
A: Each eRoom facility has a separate full-text table and index. These files are

Technical Note 3
EMC Confidential – Internal use only
Introduction

located in the ~FullText folder of the Index Server machine. Each eRoom facility
has nine different file types in ~FullText.
File types that make up an Index table

Extension Function/Purpose Importance


.dct Dictionary File. List of unique words occurring Mandatory.
in eRoom.
.ref Periodic Index. Contains pointers to word Mandatory.
occurrences.
.refx Periodic Index. Contains pointers to word Mandatory.
occurrences.
.dyx Differential Index. Contains latest changes. Mandatory.
.log Full-text table log Mandatory. Useful
to OpenText.
.stm Stem File Not Required. May
be missing.
.cix Security File Mandatory
.cat Security File Mandatory
.cfg Configuration File Mandatory

Note: Temporary files such as .rup, .rupx and .dup files are created during
Step 2 (Index Validation). These files are changed to dictionary and reference
files after table validation is completed. Since the Search Server limits a single
reference file to 2GB in size, in the case of very large facilities, you may see
multiple reference files (.ref.001 .ref.002 etc.).

Q: The ~FullText folder contains several files. How do I match a file to its
corresponding facility?
A: Every file in the ~FullText folder contains a <facilityGUID> in the file name
(prefixed by an “i” or “f”) that uniquely determines the facility with which it is
associated. The <facilityGUID> occurs in the filename after the initial “i” or “f”.
Q: I only know the URL name of the facility. How do I obtain the
<facilityGUID>?
A: Run the following script using the ERSQLExec7.exe utility
USE DATABASE eRoomSite;
SELECT UrlName,UniqueID FROM Facilities where UrlName like
'%<FacilityURL>%' and Deleted=0;
Q: For any <facilityGUID> I see some file names starting with “I” and some
starting with “F”. What is the difference between i-files and f-files?
A: For any eRoom facility, there are eight f-files (f<facilityGUID>) that
correspond to indexing of custom fields only and largely remain unused after
creation. The remaining nine i-files (i<facilityGUID>) are the bread and butter

4 Technical Note
EMC Confidential – Internal use only
eRoom Scheduler Service : The Indexer Thread

files of the facility full-text index. In most situations, the f-files can be ignored
with the focus on the i-files only. Henceforth, we will focus on the i-files
(ignoring the f-files) and any subsequent references to the full-text table/indexes
will mean i-files only.
Q: How do immediate mode indexing and periodic mode indexing work?
A: In immediate mode indexing, the eRoom application reads textual data and
writes it as is to the differential index (.dyx) files. As a result, the size of the
differential index file grows at least by the size of the text that is read. Although
the immediate mode indexing is space inefficient, data is made available in
eRoom searches, immediately. Although the .dyx file grows, the .ref and .refx
files do not grow, because they are not written to.
In periodic mode indexing, new data is first read and written and then the
VALIDATE INDEX operation is performed, that digests all changes made and
incorporates them into the periodic index (.ref and .refx) files. The index
validation step is an expensive operation that exploits redundancies in text. The
size of the periodic index grows by only a fraction of the newly added textual
data. Thus the periodic index is compact and provides faster search times.
Although the .ref and .refx files grow, the .dyx file is re-initialized at the end and
is very small.

Read and write new textual data to Differential


Index (.dyx). The growth in .dyx file size is
Immediate more than the text size.
Mode The Periodic Index (.ref and .refx) is not
updated.

Periodic After reading/writing new textual data,


Validate
Mode Index is issued.
Index Validation is designed to digest the
changes efficiently.
Growth in the size of the periodic index
(.ref .refx) file is a fraction of the text size.
Differential Index (.dyx) is re-initialized.
Contains no indexed data.

eRoom Scheduler Service : The Indexer Thread


The eRoom Scheduler service runs through a multithreaded process
ERNotifier.exe. All indexing activities are performed by a single thread. To
capture indexer thread activities in eRoom eTrace you must enable the subsystem
“FullText” at level 1 or higher. The thread name <FullTextIndexer> is listed in the

Technical Note 5
EMC Confidential – Internal use only
eRoom Scheduler Service : The Indexer Thread

eTrace.
Indexer thread activities are broadly classified into the following types:
1. Incremental Indexing - This is a recurring activity that always
runs in the background and automatically updates the latest
changes in eRoom to the full-text tables.
2. Rebuild Indexing – This is a one time activity and must be
triggered explicitly. It purges all existing index tables for the
specific facility and prepares fresh full-text tables.

Incremental Indexing
The indexer thread wakes up every 5 minutes (this interval is configurable) to
perform incremental indexing activities.
Step 1 Index Server eRooms: Detect all eRooms marked as not indexed and
index them.
Step 2 Process Event Logs: Iterate through all facility events and changes
that have occurred since the last run of incremental indexing, and
update full-text tables.
Incremental indexing takes into account different kinds of eRoom events,
including new and imported eRooms, new items, deleted items, edited or moved
items, or items for which access control was modified, and so on. Incremental
indexing is crucial to keep the facility full-text tables up-to-date with the latest
changes made in eRoom.
Q: How can the eRoom administrator configure or disable incremental
indexing activities?
A: Incremental indexing is enabled by default. However, the eRoom administrator
may disable it from the Scheduler dialog box on the Site Settings page (Index
New Files). Even the indexing interval (set to five minutes by default) is
configurable as per requirements using the eRoom Server MMC by navigating to
Server Tuning, General tab, and selecting Background Tasks (Full Text Indexing
Interval). Keeping incremental indexing disabled for extended periods while the
server is in use will impact the end-user search experience because newly added
data will not be indexed and thus will not reflect in searches.

Rebuild Indexing
You can perform the Rebuild indexing operation by navigating to the relevant
facility from the Site Settings page. When a facility index rebuild is triggered, the
facility is immediately marked for a rebuild and then the following activities are
performed (immediately if the Scheduler service is running, or if it is not running,
then whenever it starts up next):
1. Purge all existing full-text table data for this facility (all 16 files
in ~FullText) and mark the facility for rebuild.

6 Technical Note
EMC Confidential – Internal use only
Full-text Search

2. Mark all contained rooms as not indexed.


3. Prepare indexes for all server eRooms that are marked not
indexed.

Note: When the rebuild indexing operation is initiated, the rebuild index
event takes priority over incremental indexing activities. This is of paramount
importance in a customer environment where the large size of a single facility
could mean that a rebuild will take a long time to complete (up to several
hours or even days, depending on the size of the facility), putting the
incremental indexing on hold and impacting the search experience.
Therefore, it is generally recommended that an index rebuild for a large
facility be triggered only during a scheduled down time or during low load
hours.

Q: How often does eRoom perform Step 2 or Index Validation?


A: Incremental indexing activities are always performed in the
Immediate Mode skipping Step 2 (Index Validation). Index Validation is
performed by issuing the VALIDATE INDEX command to the full-text
table. This command is issued only under the following circumstances:
 eRoom will validate every full-text index table on Scheduler service
start-up and subsequently, once every night.
 eRoom will always execute index rebuilds in PERIODIC mode and
Validate the table on completion of all inserts.
 If the can-open access of any item is modified, eRoom will follow up
the incremental indexing with a Validate command.
 If eRoom performs incremental indexing in the immediate mode and
the differential index (.dyx) is close to exceeding 90 MB, eRoom will
issue the Validate index command to digest the changes and re-
initialize the .dyx file.
The VALIDATE INDEX command takes a long time to complete
execution, and during this time, the indexer thread does not pass on any
new information to eTrace. Therefore, a lack of eTrace activity for the
indexer thread immediately following the issue of this command must
not automatically be interpreted as a hung or terminated thread. The
duration of the VALIDATE INDEX depends on the extent of changes
made since the last VALIDATE INDEX was issued on the table. The
more the changes, the longer the VALIDATE INDEX will take to
complete.

Full-text Search
When a user performs a search operation within eRoom, the scope of the
operation is limited by the permissions granted to the user. Although the

Technical Note 7
EMC Confidential – Internal use only
Full-text Search

SearchServer full-text tables are set up at the facility level, the scope of
any search operation within eRoom, is defined not in terms of the
facilities to search, but in terms of the eRooms to search.
The search scope depends on the following parameters:
type of search (room-level or site-wide)
room memberships of the user (for site-wide search)
community/site administrator privileges of the user (For the
Search all rooms that I administer option)
 rooms that are indexed and ready for searching
Any eRoom search operation consists of the following steps:
 Accepting keywords from the user
 Defining the scope of the search
 Setting up the list of indexed rooms available for searching
(ignoring the un-indexed rooms)
 Utilizing the keywords and scope to construct the
appropriate FIND command to issue to the Hummingbird
SearchServer
 Building a result set of rows returned by the SearchServer
 Mapping each row in the result set to the corresponding
eRoom item
 Displaying the list of items to the user
Consider the following sample of the FIND command issued to the
SearchServer:
Sample eRoom eTrace snippet
08/19/2011 17:34:15 w3wp[730:780]: FullText,3:
CERSearchServer::Find - msec(0) SELECT RELEVANCE('2:1') AS
REL,FT_CID,DOCID,NAME,DISPLAYDOC,DISPLAYNAME,FOUNDIN,FOUNDINNA
ME,ROOMID,DBID,MODIFYDATE,OWNERS,CANOPEN,UIFLAGS,ISROOM,FT_SFN
AME FROM IF2857508_1595_445D_889A_B01527A99E8D UNION
I111AEF5C_B0A7_401D_9FA0_2D91509F4529 WHERE (ROOMID IN (6,11))
AND ((NAME CONTAINS 'project' WEIGHT 5) OR (FINDTEXT CONTAINS
'project') OR (FT_TEXT CONTAINS 'project')) AND (ISROOM = 0)
ORDER BY REL DESC, DOCID DESC

The keyword in this snippet is ‘project’. Only 2 facilities are searched.


The search scope is limited to 2 rooms (ID 6 and ID 11).

8 Technical Note
EMC Confidential – Internal use only
Diagnosis and Troubleshooting

Sample eRoom eTrace snippet


08/19/2011 15:53:59 w3wp[d70:c98]: FullText,3:
CERSearchServer::Find - msec(0) SELECT RELEVANCE('2:1') AS
REL,FT_CID,DOCID,NAME,DISPLAYDOC,DISPLAYNAME,FOUNDIN,FOUNDINNA
ME,ROOMID,DBID,MODIFYDATE,OWNERS,CANOPEN,UIFLAGS,ISROOM,FT_SFN
AME FROM IF2857508_1595_445D_889A_B01527A99E8D UNION
I111AEF5C_B0A7_401D_9FA0_2D91509F4529 WHERE ((NAME CONTAINS
'project' WEIGHT 5) OR (FINDTEXT CONTAINS 'project') OR
(FT_TEXT CONTAINS 'project')) AND (ISROOM = 0) ORDER BY REL
DESC, DOCID DESC

The keyword in this snippet is ‘project’. Only two facilities are searched.
The absence of the ROOMID IN condition in the WHERE clause
indicates that all rooms in both facilities are included in the scope.
Q: How do I map the ROOMID to a room name?
A: To map the ROOM IDs in the FIND command with their URL names,
run the following script via ERSqlExec7:
USE DATABASE eRoomSite;
SELECT UrlName FROM Rooms WHERE InternalID in (<IDs seen the
FIND command>);

Diagnosis and Troubleshooting


While troubleshooting full-text indexing/search related problems,
always ensure that you are working with the latest data, preferably from
the last 24 hours. Old trace data/error logs may exhibit problems that are
no longer present in the system. So it is imperative to work with up-to-
date data to ensure the problem you are trying to solve is the actual
problem you are facing. Since all indexing activities are performed
sequentially by a single thread, a single problem may spawn a variety of
symptoms. You must clearly differentiate between problems that are
side-effects and problems that are root causes.
Example
If the indexer thread failed to index a specific room, the search operation
can fail. So, it is recommended that you investigate the reason why the
indexer thread did not index the room.
Q: List the important tools used for diagnosis and troubleshooting and
their significance
A:
 The eRoom eTrace - To diagnose a failing search
 The eRoom error log - To examine past indexer thread activity and
errors

Technical Note 9
EMC Confidential – Internal use only
Diagnosis and Troubleshooting

 SearchServer full-text table log file - To view details about a facility’s


full-text table
 Windows application event log - To view information about the
successful completion of the index rebuild operation
Q: What are some of the typical root causes that manifest as indexing
and search-related problems?
A:
1. The indexer thread encountered an unhandled exception and
terminated itself. Although the Scheduler service is running, the
indexer thread is not running, and no full-text indexing activities
are performed.
2. The full-text table of a specific facility has become corrupted and
eRoom consistently fails to index it.
3. The indexer thread is stuck on a long running VALIDATE
INDEX and no incremental indexing activities are performed in
this period.
4. An interruption occurred during the execution of a VALIDATE
INDEX and the relevant full-text table is now corrupted.
Q: How can I identify the root cause of the problem I am facing and
rectify it? (Points 1 to 4 here, map to points 1 to 4 of the previous
answer.
A:
1. No indexer thread activity in eTrace with FullText, 5 enabled
even after the incremental indexing interval has elapsed, is a
necessary but not sufficient condition for this. The
<FullTextIndexer> thread generates a stack trace that is captured
in the eRoom error log. Perform a rebuild stack to understand
the root cause and proceed with the analysis. Meanwhile, restart
the Scheduler service to bring up the thread again.
2. The consistent failure of eRoom to update or insert into or
validate the facility full-text table (eTrace or eRoom error log)
and work smoothly with others, strongly supports this
possibility. Generally, a rebuild is the only option. But since this
task involves an overhead for large facilities, you must consult
OpenText first.
3. The presence of .rup and .dup files supports this possibility but
does not confirm it. The occurrence of other incremental
indexing activities refutes this. You must examine the
background traces to check whether the last command executed
by the indexer thread was a VALIDATE INDEX and wait for the
command to complete execution.

10 Technical Note
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key

4. Check the SearchServer full-text table .log file for any


discrepancies or breaks in the pattern. The presence of .rup and
.dup files for the relevant facility is necessary but not a sufficient
condition to reach this conclusion. The occurrence of other
incremental indexing activities confirms this. You must
manually move the files (refer
http://solutions.emc.com/emcsolutionview.asp?id=esg92637).

eTrace Settings and Usage


Exceptions,1 + FullText,5 + Thin Client,1 (Trace Mode – Log all trace
data)
Ensure that you do not enable the preceding settings for extended
periods of time, particularly during a rebuild index as it will cause
extremely voluminous trace data, that can adversely impact
performance. In general, the preceding settings must only be enabled for
very brief periods of time while reproducing search related issues and
must be disabled after you have reproduced the search related issues.
When a search operation fails, you can narrow down the problem to a
few rooms or facilities by capturing an eTrace and examining the FIND
command parameters closely. In case of a site-wide search, you can
perform a room level search in each room one at a time until you isolate
the offending facility.
Using the eRoom error log - Focus on the errors generated by the
<FullTextIndexer> thread. However, note that a more generic problem
with the Notifier can affect all its other threads.
Using full-text tables – You can determine the health of the facility full-
text table by checking the nine file types discussed earlier. In addition, if
the last modified time for all facility full-text files exceeds 24 hours, it
indicates that facility is faulty because the Scheduler must validate all
facility full-text tables every night.

Configuring Indexing Activities: The Debug Registry Key


 FullTextMaxDyxSizeInBytes – The DWORD value that specifies the
maximum differential index size in bytes. Limiting the size of the
differential index by setting the MAX_DYX_SIZE parameter on the
SearchServer has been discussed earlier. The configured value must
at least match the minimum value of 8192 (8KB), failing which this
parameter is ignored and the default value of 90MB is used. Requires
the eRoom Server account to have Full Control permissions on the
HKLM/Software/eRoom/Debug key.
 FullTextFacilitiesToIndexFirst – The STRING value that contains a

Technical Note 11
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key

comma-separated list of facility URL names. Example: <facility url


name>,<facility url name>
 FullTextRoomsToIndexFirst – The STRING value containing a list
of facility and relevant room URL names. Example: <facility url
name>,<room url name>,<room url name>,…;<facility url
name>,<room url name>,<room url name>,…
The purpose of this key is to allow users to override the predefined order
in which rooms are processed during index rebuilds or at Step 1 of
incremental indexing, and to specify that some rooms must always be
indexed first.
Prioritizing rooms and facilities for indexing during rebuild or during
Step 1 (index server rooms) of incremental Indexing
Since all indexing activities are performed by a single thread, the order
in which rooms and facilities are indexed, is important.. These keys
allow the administrator to prioritize the task of indexing specific rooms
and facilities to ensure that they are indexed prior to the other rooms and
facilities, without waiting in a queue. This key value takes precedence
during index rebuilds and at Step 1 (index server rooms) of incremental
indexing. It will have no effect on Step 2 (event log processing) of
incremental indexing.
The key value will only take effect when the concerned room or facility is
marked as un-indexed.

 FullTextFacilitiesToSkipIndexing – The STRING value containing a


comma-separated list of facility URL names. Example: <facility url
name>,<facility url name>
 FullTextRoomsToSkipIndexing – The STRING value containing a
list of facility and corresponding room URL names. Example:
<facility url name>,<room url name>,<room url name>,…;<facility
url name>,<room url name>,<room url name>,…
 FullTextDocIDsToSkipIndexing – The STRING value containing a
list of facility and corresponding DocID values. Example: <facility
url name>,<docid from url>,<docid from url>,…;<facility url
name>,<docid from url>,<docid from url>,…
DocIDs must be specified exactly as they appear in the URL in the
“low_high” format
Skipping the task of indexing specific rooms and facilities during
rebuild or at Step 1 (index server rooms) of incremental indexing
If any of the specified rooms or facilities has an index in place, Step 2
(Process Event Logs) will continue to run and all new data will be

12 Technical Note
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key

searchable. This key value will only take effect when the specified room
or facility is marked as un-indexed.

 FullTextFileChunkSizeInMB – Size of data chunks (in MB)


processed, after which validation is performed.
 FullTextObjectChunkSize – Size of object chunks (rows in Objects
table) processed, after which validation is performed.
Controlling the volume of data that must be processed
As explained earlier, when you change the can-open permissions for any
eRoom item(s), incremental indexing activities will include table
validation. In such cases, you can control the volume of data that must
be processed in the IMMEDIATE mode before each VALIDATE
command is executed to avoid extended validation time by specifying
that the validation be performed in chunks every <specified> amount of
data. The first key value applies to external files. The second key value is
applicable to all other types of items.

Technical Note 13
EMC Confidential – Internal use only
Configuring Indexing Activities: The Debug Registry Key

Copyright © 2009 – 2012 EMC Corporation. All rights reserved.


Published June, 2012
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY
DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires
an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks
on EMC.com.
All other trademarks used herein are the property of their respective owners.

14 Technical Note

You might also like