"Cache mode" storage ==================== Indroduction ----------- Beginning from version 3.1.5 mnoGoSearch supports new words "cache" storage mode able to index and search quickly through several millions of documents. Cache mode word indexes structure --------------------------------- The main idea of cache storage mode is that word index is stored on disk rather than SQL database. URL information (table "url") however is kept in SQL database. Word index is divided into 1 million files using 32 bit word_id. Index is located in nested structure under /var/tree directory of mnoGoSearch installation. /var/tree structure is divided into three subdirectories levels. First level has 256 directories with names in the range 00-FF. Each first level directory has 16 subdirectories in the range 0-F. Each second level directory has 256 files with word information: /var/tree/00/0/00000 ... /var/tree/00/0/000FF ... ... /var/tree/FF/F/FFF00 ... /var/tree/FF/F/FFFFF Each low level file contains information about up to 4096 different words. Low level files are sorted in the order suitable for search purposes. This word distribution scheme allows search program to quickly access required words. We call this storage mode "cache mode" because words index distribution is very similiar to squid cache structure. Cache mode tools ---------------- There are two additional programs "cachelogd" and "splitter" used in "cache mode" indexing. cachelogd is a TCP daemon which collects word information from indexers and stores it on your hard disk. splitter is a program to create fast word indexes using data collected by cachdlogd. Those indexes are used later in search process. Starting cache mode ------------------------------- To start "cache mode" follow these steps: 1. Start "cachelogd" server: cd /usr/local/mnogosearch/sbin (/sbin directory of base mnoGoSearch installation) ./cachelogd & 2>cachlogd.out It will write some debug information into cachelogd.out file. Cachelogd also creates a pid file in /var directory of base mnoGoSearch installation. Cachelogd listens to TCP connections and can accept several indexers from different machines. Theoretical number of indexers is about 128. Cachelogd stores information sent by indexers in /var/raw/ directory of mnoGoSearch installation. You can specify port for cachelogd to use without recompiling. In order to do that, please run ./cachelogd -p8000 where 8000 is the port number you choose. You can as well specify a directory to store data (it is /var directory by default) with this command: ./cachelogd -w /path/to/var/dir 2. Configure your indexer.conf as usual and add these two lines: DBMode cache LogdAddr localhost:7000 "LogdAddr" command is used to specify cachelogd location. Each indexer will connect to cachelogd on given address at startup. 3. Run indexers. Several indexers can be executed simultaniously. Note that you may install indexers on different machines and then execute them with the same chachelogd server. This distributed system allows to make indexing faster. 4. Creating word index. When some information is gathered by indexers and collected in /var/raw/ directory by cachelogd it is possible to create fast word indexes. "splitter" program is responsible for this. It is installed in /sbin directory. Note that indexes can be created anytime without interrupting current indexing process. Indexes are to be created in the following three steps: A. Sending -HUP signal to cachelogd. cachelogd will close current working logs and reopen new logs. You can use cachelogd pid file to do this: kill -HUP `cat /usr/local/mnogosearch/var/cachelogd.pid` B. Preparing cachelogd logs for creating word indexes: Run splitter with "-p" command line argument: /usr/local/mnogosearch/sbin/splitter -p This operation takes all available logs in /var/raw/ directory, devides logs into 4096 parts (one file for each low level word index directory) and stores data acceptable by splitter in /var/splitter/ directory. All processed logs in /var/raw/ directory are renamed to *.done automatically after this operation. You can remove them or keep for backup purposes. If you wish not to use /var directory to store data, run splitter with the following command: ./splitter -w /path/to/var/dir C. Building word index. Run splitter without any arguments: /usr/local/mnogosearch/sbin/splitter It will take sequentially all 4096 prepared files in /var/splitter/ directory and use them to build fast word index. Processed logs in /var/splitter/ directory are removed after this operation. Cleaning processed information ------------------------------ Note that after running step C, it is better to delete (or backup) files in /var/splitter/ directory. However splitter can detect old data, so you may keep those files in their original place. But in this case splitter probably will run slowly at least after big indexed volumes. Optional usage of several splitters ----------------------------------- splitter has two command line arguments: -f -t which allows to limit used files range. If no parameters are specified splitter distributes all 4096 prepared files. You can limit files range using -f and -t keys specifying parameters in HEX notation. For example, "splitter -f 000 -t A00" will create word indexes using files in the range from 000 to A00. These keys allow to use several splitters at the same time. It usually gives more quick indexes building. For example, this shell script starts four splitters in background: #!/bin/sh splitter -f 000 -t 3f0 & splitter -f 400 -t 7f0 & splitter -f 800 -t bf0 & splitter -f c00 -t ff0 & Using "run-splitter" script --------------------------- There is a "run-splitter" script in /sbin directory of mnoGoSearch installation. It helps to execute subsequently all three indexes building steps. "run-splitter" has these three command line parameters: run-splitter --hup --prepare --split or short version: run-splitter -k -p -s Each parameter activates corresponding indexes building step. "run-splitter" executes all three steps of index building in proper order: A. Sending -HUP signal to cachelogd. --hup (or -k) run-splitter arguments are responsible for this. B. Preparing cachelogd logs for indexes building. Keys to activate this are --prepare (or -p). C. Running splitter. Keys --split (or -s). In 3.1.9 version "run-splitter" script can't execute several splitters at the same time. This is on TODO. In most cases just run "run-splitter" script with all "-k -p -s" arguments. Separate usage of those three flags which correspond to three steps of indexes building is rarely required. Doing search ------------ To start using search.cgi in the "cache mode" edit as usually your search.htm template and add this line: DBMode cache Phrase search ------------- Please use --enable-phrase to enable phrase search in cache mode. Fast search with tag, site and category limits ---------------------------------------------- To activate fast search with tag, site, category limits use --enable-fast-tag, --enable-fast-cat and --enable-fast-site arguments to configure script. This features are disabled by default. Please note, that if --enable-fast-tag --enable-fast-cat or --enable-fast-site are not activated during configuration, such restriction will not work in cachemode at all. The thing is that in case of large volumes, the restrictions will work rather slow. In case of small volumes it is better to use SQL mode rather than cachemode. Indexer will store tag, category, site values directly in cachemode word index. It makes search with tag, category and site limits very fast, however it requires more disk space. Note that recompilation with or without --enable-fast-XXX configure arguments requires reinstalling of splitter and search.cgi. You have also to rebuild word indexes either with reindexing or using *.done files in /var/raw/ directory. In last case just remove ".done" endings from those files and run splitter. Tag is nested and should be composed as a sting presentation of hex number (for example 3355221112) in the same way like category. It has 5 levels with 128-128-64-64-64 members on each level. Each level consist of two hex digits. Total length of tag and category value is 10. Thus, first and second levels should be in the range of 00-7F and other levels in the range of 00-3F. Whole range of tag and category value is 0000000000 - 7F7F3F3F3F. To limit search by site just pass ul=http://www.somthing.com/ in search.cgi query string. This can be done using HIDDEN HTML form variable: Please note, that in cachemode only this kind of restriction will work: http://site/ I.e. even if you enter http://site/path/, restriction will still be http://site/ 3.1.9 release notes ------------------- 1. "indexer -C" WITHOUT subsection control does not do anything with word index tree. You should delete /var/tree directory manually. 2. "indexer -C ... " WITH one or several subsection control options (-t, -u, -s, -c) writes to logs and you have to run splitter after it. 3. After running splitter you may delete indexer logs located in /var/splitter directory or move them into enother place for backup pusposes. 4. IMPORTANT! Never launch several run-splitter scripts at the same time. (We'll add simultanious execution blocking in next version) Things to be implemented soon ---------------------------------------------- 1. Make it possible to distribute database between several machines. 2. Make file formats platform independent to take in account bytes ordering. This will allow using logs created on i386, SGI and Sparc machines.