Using external parsers ====================== Beginning from version 2.1 indexer can use external parsers to index various file types (mime types). Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout. Supported parser types ====================== Indexer supports four types of parsers that can: * read data from stdin and send result to stdout * read data from file and send result to stdout * read data from file and send result to file * read data from stdin and send result to file Setting up parsers ==================== 1. Configure mime types ----------------------- Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there. If you want to index local files use "AddType" command in indexer.conf to accociate file name extensions with their mime types. For example: AddType text/html *.html 2. Add parsers -------------- Add lines with parsers definitions. Lines have the following format with three arguments: Mime For example, the following line defines parser for man pages: # Use deroff for parsing man pages ( *.man ) Mime application/x-troff-man text/plain deroff This parser will take data from stdin and output result to stdout. Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this: Mime application/msword text/plain "/usr/bin/catdoc -a $1" If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example: Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2" The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondently. Pipes in parser's command line: =============================== You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk: AddType application/x-gzipped-man *.1.gz *.2.gz *.3.gz *.4.gz Mime application/x-gzipped-man text/plain "zcat | deroff" Charsets and parsers ==================== Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents: Mime application/msword "text/plain; charset=windows-1251" "catdoc -a $1" UDM_URL variable ================ When executing a parser indexer creates UDM_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts. Parser examples =============== Nice RPM parser by Mario Lang ------------------------------------------------------ /usr/local/bin/rpminfo: #!/bin/bash /usr/bin/rpm -q --queryformat="RPM: %{NAME} %{VERSION}-%{RELEASE}(%{GROUP})%{DESCRIPTION}\n" -p $1 indexer.conf: Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1" It renders to such nice RPM information: 3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4] Mysql is a SQL (Structured Query Language) database server. Mysql was written by Michael (monty) Widenius. See the CREDITS file in the distribution for more credits for mysql and related things.... (application/x-rpm) 2088855 bytes catdoc MS Word to text conveter ------------------------------- Home page: http://freshmeat.net/redir/catdoc/1055/url_homepage/ Also listed at: http://freshmeat.net/ indexer.conf: Mime application/msword text/plain "catdoc $1" xls2csv MS Excel to text converter ---------------------------------- It is supplied together with catdoc. indexer.conf: Mime application/vnd.ms-excel text/plain "xls2csv $1" pdftotext Adobe PDF converter ----------------------------- Supplied together with xpdf project. Home page: http://freshmeat.net/redir/xpdf/12080/url_homepage/ Also listed at: http://freshmeat.net/ indexer.conf: Mime application/pdf text/plain "pdftotext $1 -" rthc RTF to text converter -------------------------- Home page: http://so.dis.ulpgc.es/~a2092/rthc/index.en.html Also listet at: http://freshmeat.net/ indexer.conf: Mime "text/rtf*" text/html "rthc --use-stdout $1 2>/dev/null" It also produces some output to stderr, so redirect it to /dev/null --------------------------------------------------------------------- Please feel free to contribute your scripts and parsers configuration to general@mnogosearch.org.