Search Overview
The search engine is bundled with the Sambar Server. All files being indexed
must reside under the Sambar Server document directory and be available
to the HTTP server. URLs are created by the index server for all files
found as part of the indexing task. Should files be removed or new files
added, the index must be regenerated. Search indexes can be scheduled
to be automatically re-indexes using the System Administration GUI;
indexes can be re-generated daily, weekly, or monthly.
The indexing process is initiated from the System Administration console
of the Sambar Server (WWW interface).
Search Indexer
By default, all pages under the Documents Directory are indexed
(see the configuration management section for details on indexing
specific directories). No words found in the stopword.ini
file are indexed. A hash table index is built of all alpha-numeric
strings found in the files searched. This hash index is very fast to
search, but relatively bulky from a disk usage standpoint.
The Search Indexer provides the ability to specify the files to be indexed.
The WWW Server must have read access to all the files being indexed.
Files may be filtered by file extension, individual files, directory, or by a
directory and all its sub-directories. All index files are placed in
the search sub-directory located in the installation directory of the
Sambar Server.
Documents are indexed by file name, file size and last modified date.
In addition, in the case of HTML files, the TITLE is parsed and used
as the description of the file. In this release, the only weighting
used is a count of the number of times a word appears in a document,
as well as additional weighting for words appearing in the title or heading.
Multiple indexes may be built and individually searched.
Additional indexes are defined by editing the search.ini (via the
system administration GUI) adding additional search indexes.
Each [section] entry in the search.ini file results in an index
of that name being made available.
The System Administration GUI should be used to manage these entries.
Indexes are restricted to files found within the default directory identified
in the config.ini file.
Directories associated with virtual-hosts cannot presently be indexed.
In a future release the ability to search across multiple indexes will
be supported as will the ability to index files associated with virtual-host
directories.
Indexing Microsoft Word Documents
The search engine includes a rudementary Microsoft Word parser and
indexer. By default, no *.doc files are indexed, but this feature
can be enabled by editing the config.ini file and modifying
the Index Only parameter of the [search] section
to include .doc.
Search Spider
The Sambar Server Pro distribution includes a search engine spider.
This feature allows an index to be defined that consists of one-or-more
remote sites rather than a collection of local files. To configure the
index to use the spider, simply add the directive Search Site
with the starting URL of the site to index
(i.e. Search Site = http://www.sandbox.com/index.htm). Each
index definition can have one-or-more of these Search Site directives.
The Exclude Directories directive is applied to the spider
searches in a similar manner as they are when local files are searched.
So the directive: Exclude Directories = samples/* will result in
the spider not following/indexing links that begin with samples/.
In addition, the spider builds an exclude list from the /robots.txt
file of the remote site. If the robots.txt is found, there is no
override mechanism for indexing pages that the site administrator has
excluded.
See the search.ini format section below for additional
configuration parameters.
Stop Word List
The server administrator has the ability to specify a list of stop words
that are judged to be trivial with respect to the content of the source files.
Prior to the index server initiating the indexing of a group of files, the
stop words file is loaded in from the config/stopword.ini file.
After the files have been indexed, the administrator has access to a list of
all the words which have been processed (search/search.wrd) by the
index engine. This may be used as a guide for customizing the stop word list
for subsequent re-indexing.
META tags
The Sambar server indexes all words on a page except those words found
within an HTML tag or comment. The search engine can be configured
to index HTML META tags. META tags are often used to specify additional
keywords. Words found within the "keywords" META tag are ranked
with the same weight as words found in the TITLE. Example use:
<META name="description" content="Western hat specialty shop!">
<META name="keywords" content="wild west hat shop">
The Sambar Search engine will then index both fields as words (the
keywords field may optionally contain commas) and will return
the description with the URL rather than the file name or TITLE.
Important: The META tags must appear in the first 4096 bytes
of the HTML file.
Query String
The following rules govern search patters:
- paris galerie louvre
- Finds documents containing as many of these words and phrases as
possible, ranked so that documents with the most matches are presented first.
- Lower-case search will find matches of capitalized words also. For example,
paris will find matches for paris, Paris, and
PARIS.
- noir +film -pinot
- Matches may be required, optional, or prohibited. Precede a required
word or phrase with + and a prohibited one with -. This query finds documents
containing film and noir, but not containing pinot.
These boolean operators are used to determine if a statement is true
or false. The following chart illustrates the usage samples:
Searching... | Results in... |
cable + car | Documents with both words. |
cable car | Documents with either words. This results in the greatest amount of matches |
cable -car | Documents about cable, but not about cable cars. |
Wildcard Searches
If the Allow Wildcarding flag in the config/config.ini
file is set to true, the arguments to the search engine
will be examined for wildcard characters. If found, an search index
will be walked comparing entries with the pattern.
Wildcard search patterns are:
* The star (*) character performs
an expansive pattern match.
? The question-mark (?) character
matches any single character.
[] Brackets ([]) can be used to match a single
character in the string being searched with a character found within the
brackets.
Phrase Searches
If the Proximity Index flag in the config/config.ini
file is set to true, the search engine will include word location
information when the indexes are built (Sambar Server Pro only).
This results in considerably larger indexes, but permits phrase searches.
Phrase searches allow users to find words that are adjacent. Multiple
phrases may be searched for in the same query. For example the query:
"health club" "New York"
will match documents containing both phrases somewhere in the document.
Phrase searches may contain up to three terms, with the adjacent terms
quoted.
Ranking Simple Queries
The Sambar Search engine ranks the results based on a scoring algorithm;
documents with a higher score appear at the head of the ranking list.
A document has a higher score if the following hold:
- the query words or phrases are found in the special sections of the
document such as the title or headings.
- the document contains multiple instances of the query word or phrase.
Multiple Indexes
Multiple search indexes can be created and used with the Sambar
Server. Each section in the in the search.ini file
identifies a different search index. Initially, the search index
is empty; by using the System Administration console, one or more
directories can be indexed for use by the search engine.
Searches can be performed across multiple indexes by providing a space
separated list of the indexes to be searched with the indexname
parameter to the /session/find search request.
search.ini format
The search.ini contains a "section" for each index. The
"section" is the search index name and may contain only
alpha-numeric characters (no spaces are allowed).
Additional search indexes can be added simply by appending
a section with the following elements:
Directive | Value | Description |
Index Directories | directories |
A space-separated list of directories to index (i.e. help samples).
To index all directories, the star (*) characters should be used. |
Exclude Directories | directories |
A space-separated list of directories to exclude (i.e. other/* *look* ).
The wild-card star (*) character must be supplied to match against the
directory path. |
Automatic Reindex | never | daily | weekly | monthly |
Indicates how often the index should automatically be re-indexed;
all rebuilding occurs at midnight. Optionally, this entry can be
a full cron (i.e. 0 1 * * * , indicating every night at 1AM).
|
Use META tags | true | false |
Boolean to indicate whether META tags found at the top of the
HTML file should be indexed. |
Only index META tags | true | false |
Boolean to indicate whether only META tags found at the top of the
HTML file should be used to index the file. |
Proximity Index | true | false |
Boolean to indicate that the index should be build using proximity-based
indexing. This allows searches to specify the realtive position of
one-or-more words in a search. For example, foo bar results in
all pages which contain foo and bar anywhere in the document,
whereas "foo bar" returns all pages that contain foo bar
next to each other in a document. Proximity search is only available
with Sambar Server Pro. |
Maximum Pages | ## |
The maximum number of pages to index before marking the index
as "full". This option may be useful to prevent the runaway
indexing of a remote site with more pages than anticipated. If not set
or set to 0, no maximum pages are enforced (internal limits still apply).
|
Cache Pages | ## |
The number of pages to cache during index building. A page is
2K in the Sambar Server and 10K in the Sambar Server Pro. Cached pages
significantly improve the time it takes to create an index. |
Search Depth | ## |
The maximum depth that the search engine spider should traverse when
indexing pages on a remote site. A value of 0 indicates that all links
should traversed. |
Max Page Size | ## |
The maximum size (in bytes) to fetch for a single URL when spidering
a remote site. For example, 100000 would indicate that only the first
100,000 bytes of a page should be indexed. By default, this value
is set to 80K bytes. Note: A buffer of the size
specified is allocated prior to the start of spidering. |
Search Site | ## |
The URL to begin the remote site search, i.e. http://www.wired.com/.
The search engine spider will then traverse the site up to the maximum depth
specified. Links to sites other than the site being indexed are not followed,
no are URLs longer than 255 bytes. In addition any robots.txt
exclusions are automatically applied to restrict the spider's search.
There may be more than one Search Site directive in a single index
definition. |
|