Search Engine

Sambar Server Documentation

Search Engine

Search Overview
The search engine is bundled with the Sambar Server. All files being indexed must reside under the Sambar Server document directory and be available to the HTTP server. URLs are created by the index server for all files found as part of the indexing task. Should files be removed or new files added, the index must be regenerated. Search indexes can be scheduled to be automatically re-indexes using the System Administration GUI; indexes can be re-generated daily, weekly, or monthly.

The indexing process is initiated from the System Administration console of the Sambar Server (WWW interface).

Search Indexer
By default, all pages under the Documents Directory are indexed (see the configuration management section for details on indexing specific directories). No words found in the stopword.ini file are indexed. A hash table index is built of all alpha-numeric strings found in the files searched. This hash index is very fast to search, but relatively bulky from a disk usage standpoint.

The Search Indexer provides the ability to specify the files to be indexed. The WWW Server must have read access to all the files being indexed. Files may be filtered by file extension, individual files, directory, or by a directory and all its sub-directories. All index files are placed in the search sub-directory located in the installation directory of the Sambar Server.

Documents are indexed by file name, file size and last modified date. In addition, in the case of HTML files, the TITLE is parsed and used as the description of the file. In this release, the only weighting used is a count of the number of times a word appears in a document, as well as additional weighting for words appearing in the title or heading.

Multiple indexes may be built and individually searched. Additional indexes are defined by editing the search.ini (via the system administration GUI) adding additional search indexes. Each [section] entry in the search.ini file results in an index of that name being made available. The System Administration GUI should be used to manage these entries.

Indexes are restricted to files found within the default directory identified in the config.ini file. Directories associated with virtual-hosts cannot presently be indexed. In a future release the ability to search across multiple indexes will be supported as will the ability to index files associated with virtual-host directories.

Indexing Microsoft Word Documents
The search engine includes a rudementary Microsoft Word parser and indexer. By default, no *.doc files are indexed, but this feature can be enabled by editing the config.ini file and modifying the Index Only parameter of the [search] section to include .doc.

Search Spider
The Sambar Server Pro distribution includes a search engine spider. This feature allows an index to be defined that consists of one-or-more remote sites rather than a collection of local files. To configure the index to use the spider, simply add the directive Search Site with the starting URL of the site to index (i.e. Search Site = http://www.sandbox.com/index.htm). Each index definition can have one-or-more of these Search Site directives.

The Exclude Directories directive is applied to the spider searches in a similar manner as they are when local files are searched. So the directive: Exclude Directories = samples/* will result in the spider not following/indexing links that begin with samples/. In addition, the spider builds an exclude list from the /robots.txt file of the remote site. If the robots.txt is found, there is no override mechanism for indexing pages that the site administrator has excluded.

See the search.ini format section below for additional configuration parameters.

Stop Word List

The server administrator has the ability to specify a list of stop words that are judged to be trivial with respect to the content of the source files. Prior to the index server initiating the indexing of a group of files, the stop words file is loaded in from the config/stopword.ini file. After the files have been indexed, the administrator has access to a list of all the words which have been processed (search/search.wrd) by the index engine. This may be used as a guide for customizing the stop word list for subsequent re-indexing.

META tags
The Sambar server indexes all words on a page except those words found within an HTML tag or comment. The search engine can be configured to index HTML META tags. META tags are often used to specify additional keywords. Words found within the "keywords" META tag are ranked with the same weight as words found in the TITLE. Example use:

<META name="description" content="Western hat specialty shop!"> <META name="keywords" content="wild west hat shop">

The Sambar Search engine will then index both fields as words (the keywords field may optionally contain commas) and will return the description with the URL rather than the file name or TITLE. Important: The META tags must appear in the first 4096 bytes of the HTML file.

Query String
The following rules govern search patters:

paris galerie louvre

Finds documents containing as many of these words and phrases as possible, ranked so that documents with the most matches are presented first.

Lower-case search will find matches of capitalized words also. For example, paris will find matches for paris, Paris, and PARIS.

noir +film -pinot

Matches may be required, optional, or prohibited. Precede a required word or phrase with + and a prohibited one with -. This query finds documents containing film and noir, but not containing pinot.

These boolean operators are used to determine if a statement is true or false. The following chart illustrates the usage samples:

Searching... Results in...

cable + car Documents with both words.

cable car Documents with either words. This results in the greatest amount of matches

cable -car Documents about cable, but not about cable cars.

Wildcard Searches
If the Allow Wildcarding flag in the config/config.ini file is set to true, the arguments to the search engine will be examined for wildcard characters. If found, an search index will be walked comparing entries with the pattern.

Wildcard search patterns are:

* The star (*) character performs an expansive pattern match.
? The question-mark (?) character matches any single character.

[] Brackets ([]) can be used to match a single character in the string being searched with a character found within the brackets.

Phrase Searches
If the Proximity Index flag in the config/config.ini file is set to true, the search engine will include word location information when the indexes are built (Sambar Server Pro only). This results in considerably larger indexes, but permits phrase searches.

Phrase searches allow users to find words that are adjacent. Multiple phrases may be searched for in the same query. For example the query:

"health club" "New York"

will match documents containing both phrases somewhere in the document. Phrase searches may contain up to three terms, with the adjacent terms quoted.

Ranking Simple Queries

The Sambar Search engine ranks the results based on a scoring algorithm; documents with a higher score appear at the head of the ranking list. A document has a higher score if the following hold:

the query words or phrases are found in the special sections of the document such as the title or headings.
the document contains multiple instances of the query word or phrase.

Multiple Indexes
Multiple search indexes can be created and used with the Sambar Server. Each section in the in the search.ini file identifies a different search index. Initially, the search index is empty; by using the System Administration console, one or more directories can be indexed for use by the search engine.

Searches can be performed across multiple indexes by providing a space separated list of the indexes to be searched with the indexname parameter to the /session/find search request.

search.ini format
The search.ini contains a "section" for each index. The "section" is the search index name and may contain only alpha-numeric characters (no spaces are allowed). Additional search indexes can be added simply by appending a section with the following elements:

Directive Value Description

Index Directories directories A space-separated list of directories to index (i.e. help samples). To index all directories, the star (*) characters should be used.

Exclude Directories directories A space-separated list of directories to exclude (i.e. other/* *look* ). The wild-card star (*) character must be supplied to match against the directory path.

Automatic Reindex never | daily | weekly | monthly Indicates how often the index should automatically be re-indexed; all rebuilding occurs at midnight. Optionally, this entry can be a full cron (i.e. 0 1 * * *, indicating every night at 1AM).

Use META tags true | false Boolean to indicate whether META tags found at the top of the HTML file should be indexed.

Only index META tags true | false Boolean to indicate whether only META tags found at the top of the HTML file should be used to index the file.

Proximity Index true | false Boolean to indicate that the index should be build using proximity-based indexing. This allows searches to specify the realtive position of one-or-more words in a search. For example, foo bar results in all pages which contain foo and bar anywhere in the document, whereas "foo bar" returns all pages that contain foo bar next to each other in a document. Proximity search is only available with Sambar Server Pro.

Maximum Pages ## The maximum number of pages to index before marking the index as "full". This option may be useful to prevent the runaway indexing of a remote site with more pages than anticipated. If not set or set to 0, no maximum pages are enforced (internal limits still apply).

Cache Pages ## The number of pages to cache during index building. A page is 2K in the Sambar Server and 10K in the Sambar Server Pro. Cached pages significantly improve the time it takes to create an index.

Search Depth ## The maximum depth that the search engine spider should traverse when indexing pages on a remote site. A value of 0 indicates that all links should traversed.

Max Page Size ## The maximum size (in bytes) to fetch for a single URL when spidering a remote site. For example, 100000 would indicate that only the first 100,000 bytes of a page should be indexed. By default, this value is set to 80K bytes. Note: A buffer of the size specified is allocated prior to the start of spidering.

Search Site ## The URL to begin the remote site search, i.e. http://www.wired.com/. The search engine spider will then traverse the site up to the maximum depth specified. Links to sites other than the site being indexed are not followed, no are URLs longer than 255 bytes. In addition any robots.txt exclusions are automatically applied to restrict the spider's search. There may be more than one Search Site directive in a single index definition.