"Cache mode" storage
====================


Indroduction
-----------

  Beginning from version 3.1.5 mnoGoSearch supports new words 
"cache" storage mode able to index and search 
quickly through several millions of documents.


Cache mode word indexes structure
---------------------------------
  The main idea of cache storage mode is that word index is stored on 
disk rather than SQL database. URL information (table "url") however 
is kept in SQL database. Word index is divided into 1 million files using 32 bit
word_id. Index is located in nested structure under /var/tree directory 
of mnoGoSearch installation. /var/tree structure is divided into three 
subdirectories levels. First level has 256 directories with names in 
the range 00-FF. Each first level directory has 16 subdirectories in the range 0-F. 
Each second level directory has 256 files with word information:

         /var/tree/00/0/00000
         ...
         /var/tree/00/0/000FF
         ...
         ...
         /var/tree/FF/F/FFF00
         ...
         /var/tree/FF/F/FFFFF

Each low level file contains information about up to 4096 different
words. Low level files are sorted in the order suitable for search purposes.
This word distribution scheme allows search program to quickly access required words.

We call this storage mode "cache mode" because words index distribution 
is very similiar to squid cache structure.


Cache mode tools
----------------
There are two additional programs "cachelogd" and "splitter" used 
in "cache mode" indexing. cachelogd is a TCP daemon which collects
word information from indexers and stores it on your hard disk. splitter
is a program to create fast word indexes using data collected by cachdlogd.
Those indexes are used later in search process.


Starting cache mode
-------------------------------

  To start "cache mode" follow these steps:

1. Start "cachelogd" server:

    cd /usr/local/mnogosearch/sbin (/sbin directory of base mnoGoSearch installation)
    ./cachelogd &  2>cachlogd.out

  It will write some debug information into cachelogd.out file. Cachelogd
also creates a pid file in /var directory of base mnoGoSearch installation.

  Cachelogd listens to TCP connections and can accept several indexers from 
different machines. Theoretical number of indexers is about 128. Cachelogd 
stores information sent by indexers in /var/raw/ directory of mnoGoSearch 
installation.

You can specify port for cachelogd to use without recompiling. In order to do that, 
please run 
    
    ./cachelogd -p8000 
    
where 8000 is the port number you choose.

You can as well specify a directory to store data (it is /var directory by default) with this command:

     ./cachelogd -w /path/to/var/dir

2. Configure your indexer.conf as usual and add these two lines:

DBMode   cache
LogdAddr localhost:7000

"LogdAddr" command  is used to specify cachelogd location. Each indexer 
will connect to cachelogd on given address at startup.


3. Run indexers. Several indexers can be executed simultaniously.
Note that you may install indexers on different machines and then
execute them with the same chachelogd server. This distributed system 
allows to make indexing faster.


4. Creating word index. When some information is gathered by indexers 
and collected in /var/raw/ directory by cachelogd it is possible to
create fast word indexes. "splitter" program is responsible for this.
It is installed in /sbin directory. Note that indexes can be created
anytime without interrupting current indexing process.

Indexes are to be created in the following three steps:

   A. Sending -HUP signal to cachelogd. cachelogd will close current working 
      logs and reopen new logs. You can use cachelogd pid file to do this:

         kill -HUP `cat /usr/local/mnogosearch/var/cachelogd.pid`

   B. Preparing cachelogd logs for creating word indexes:
        
        Run splitter with "-p" command line argument:
        
        /usr/local/mnogosearch/sbin/splitter -p

      This operation takes all available logs in /var/raw/ directory,
      devides logs into 4096 parts (one file for each low level word index
      directory) and stores data acceptable by splitter in /var/splitter/ 
      directory. All processed logs in /var/raw/ directory are renamed
      to *.done automatically after this operation. You can remove them
      or keep for backup purposes.
      
      If you wish not to use /var directory to store data, run splitter with 
      the following command:
      
      ./splitter -w /path/to/var/dir
      

   C. Building word index.
  
      Run splitter without any arguments:

      /usr/local/mnogosearch/sbin/splitter

      It will take sequentially all 4096 prepared files in /var/splitter/ 
     directory and use them to build fast word index. Processed logs in
     /var/splitter/ directory are removed after this operation.


Cleaning processed information
------------------------------

 Note that after running step C, it is better to delete (or backup) files
in /var/splitter/ directory. However splitter can detect old 
data, so you may keep those files in their original place. But in this case 
splitter probably will run slowly at least after big indexed volumes.


Optional usage of several splitters
-----------------------------------

  splitter has two command line arguments: -f <first file> -t <second file>
which allows to limit used files range. If no parameters are specified 
splitter distributes all 4096 prepared files. You can limit files range 
using -f and -t keys specifying parameters in HEX notation. For example, 
"splitter -f 000 -t A00" will create word indexes using files in the range 
from 000 to A00. These keys allow to use several splitters at the same time. 
It usually gives more quick indexes building. For example, this shell script 
starts four splitters in background:

#!/bin/sh
splitter -f 000 -t 3f0 &
splitter -f 400 -t 7f0 &
splitter -f 800 -t bf0 &
splitter -f c00 -t ff0 &


Using "run-splitter" script
---------------------------

There is a "run-splitter" script in /sbin directory of mnoGoSearch
installation. It helps to execute subsequently all three indexes building 
steps.

"run-splitter" has these three command line parameters:

   run-splitter --hup --prepare --split

or short version:

   run-splitter -k -p -s

  Each parameter activates corresponding indexes building step.
  "run-splitter" executes all three steps of index building in proper
   order:

   A. Sending -HUP signal to cachelogd.

         --hup (or -k) run-splitter arguments are responsible for this.

   B. Preparing cachelogd logs for indexes building. 

         Keys to activate this are --prepare (or -p).

   C. Running splitter. Keys --split  (or -s).


In 3.1.9 version "run-splitter" script can't execute several splitters 
at the same time. This is on TODO.

In most cases just run "run-splitter" script with all "-k -p -s" arguments. 
Separate usage of those three flags which correspond to three steps of indexes
building is rarely required. 


Doing search
------------
To start using search.cgi in the "cache mode" edit as usually your search.htm 
template and add this line:

DBMode cache


Phrase search
-------------

Please use --enable-phrase to enable phrase search in cache mode.
 

Fast search with tag, site and category limits
----------------------------------------------

  To activate fast search with tag, site, category limits use
  --enable-fast-tag, --enable-fast-cat and --enable-fast-site
  arguments to configure script. This features are disabled by default.
  
  Please note, that if --enable-fast-tag --enable-fast-cat
  or --enable-fast-site are not activated during configuration, such restriction
  will not work in cachemode at all. The thing is that in case of large volumes, 
  the restrictions will work rather slow. In case of small volumes it is better 
  to use SQL mode rather than cachemode.

  Indexer will store tag, category, site values directly in cachemode
  word index. It makes search with tag, category and site limits very
  fast, however it requires more disk space. Note that recompilation
  with or without --enable-fast-XXX configure arguments requires
  reinstalling of splitter and search.cgi. You have also to rebuild
  word indexes either with reindexing or using *.done files in /var/raw/
  directory. In last case just remove ".done" endings from those files and 
  run splitter.

   Tag is nested and should be composed as a sting presentation of hex number 
  (for example  3355221112) in  the same way like category. 
  It has 5 levels with 128-128-64-64-64 members  on each level. Each level consist
  of two hex digits. Total length of tag and category value is 10.

  Thus, first and second levels should be in the range of 00-7F and
  other levels in the range of 00-3F. 
  Whole range of tag and category value is 0000000000 - 7F7F3F3F3F.

  To limit search by site just pass ul=http://www.somthing.com/ in search.cgi
  query string. This can be done using HIDDEN HTML form variable:

  <INPUT TYPE=HIDDEN NAME=ul VALUE=http://www.something.com/>
  
  Please note, that in cachemode only this kind of restriction will work: http://site/
  I.e. even if you enter http://site/path/, restriction will still be http://site/
  

3.1.9 release notes
-------------------

1. "indexer -C" WITHOUT subsection control does not do anything
with word index tree. You should delete /var/tree directory manually.

2. "indexer -C ... " WITH one or several subsection control options 
(-t, -u, -s, -c) writes to logs and you have to run splitter after it.

3. After running splitter you may delete indexer logs located in /var/splitter
directory or move them into enother place for backup pusposes.

4. IMPORTANT! Never launch several run-splitter scripts at the same time. 
(We'll add simultanious execution blocking in next version)


  Things to be implemented soon
  ----------------------------------------------

1. Make it possible to distribute database between several machines.

2. Make file formats platform independent to take in account bytes ordering.
This will allow using logs created on i386, SGI and Sparc machines.