Before any link is tested, the destination site is looked-up in a
table of recently accessed sites (the definition of "recently" is set
via the configuration default CheckInterval
or via the instruction directive SitesCheck).
If it is not found, that site's
/robots.txt
document is requested and parsed for
restrictions to be placed on MOMspider robots. Any such restrictions
are added to the user's avoid list and the site is added to the site
table, both with expiration dates indicating when the site must be
checked again. Although this opens the possibility for a discrepancy
to exist between the restrictions applied and the contents of a
recently changed /robots.txt
document, it is necessary to
avoid a condition where the site checks cause a greater load on the
server than would the maintenance requests alone.
One example of a /robots.txt
can be seen at
my site. Note that
I place fewer restrictions on MOMspider than I do on other robots
(a user-agent of *
represents the default for all robots
other than those specifically named elsewhere in the file). URLs
that almost all information providers should be encouraged to
"Disallow" (what MOMspider refers to as "Avoid") are those that
point to dumb scripts (scripts that don't understand the difference
between HEAD and GET requests.
It is important to note that avoid and sites files must always be used in pairs. The entries in the avoid file should expire at the same time as their corresponding site entry in the sites file. Otherwise, the spider will fail to act properly.
In addition to the robot exclusion standard, the avoid files can be edited by hand (when the MOMspider process is not running) such that the user can specify particular URL prefixes to be avoided or leafed. Avoided means that no request can be made on URLs containing that prefix. Leafed means that only HEAD requests can be made on those URLs, thus having the effect of preventing their traversal.
Finally, URL prefixes can be temporarily leafed (for the duration of one task) by including an Exclude directive to that effect in the task instructions.
Examples are included of some real avoid and sites files. The system avoid file provided with the release lists many URLs which are known to cause trouble with spiders. Of particular importance is to avoid gateways to other protocols (e.g. wais and finger) that would cause a great deal of unnecessary computation if they are traversed by a spider.
Before running MOMspider on your site, you should create a
/robots.txt
on your server so that the spider can read it
on its first traversal. You can then look at the resulting avoid and sites
files to see how well MOMspider parsed the file. To force MOMspider to
re-read a site's /robots.txt
, delete that site's entry from
the sites file before running the spider.