Writing a .site File
Writing a Sitescooper .site FileThis is a step-by-step guide to writing a .site file for your favourite site. I'll use an imaginary news site, www.whizzynews.com, to demonstrate.
First, start with the URL. This is the URL that sitescooper will start crawling that site from. Also you might as well add a readable name for the site, and a description of what's on offer at the site.# a comment # URL: xxxx Parameter1: value Parameter2: value [... etc.]
URL: http://www.whizzynews.com/ Name: Whizzy News Description: News about whizzy stuff, updated daily
You can leave out the http:// bit in the URL if you like, but I think it's a little more readable if it's there. The URL line must always be first in a .site file.
Finding a good URL can be tricky -- the front page isn't always the best one, as it's sometimes "optimized" for MSIE and/or Netscape. If the site has a "Palmpilot version", maybe that would be a better URL to start with.
A handy way to write a site file is to use the "AvantGo version" of many sites; quite often these are not easy to track down, but if you search AltaVista using the keywords url:avantgo -url:avantgo.com sitename, or link:.subs avantgo sitename, it may find it. Another way to do it is to search for link:avantgo.com/mydevice/autoadd.html, which is AvantGo's interface allowing site authors to add their own sites, providing the details in the URL.
Alternatively if you can find the
The next most important thing is the number of levels to the site. A 1-level site will only have the initial page downloaded. Typically this would be something like NTK, Slashdot, Memepool, RobotWisdom etc. To indicate a site like this, use the line:
in your .site file.Levels: 1
A 2-level site would have an index page with links to the full stories elsewhere. This is quite common too: LinuxToday, The Register, Wired News (when you're downloading it section-by-section that is). To do this you'd use "Levels: 2", unsurprisingly.
A 3-level site is something like The Onion, New Scientist or other periodicals, which come out with "issues", each of which is a page of links to stories: "Levels: 3".
If you're writing a 1-level site, or you're using a "Palmpilot version", you could almost stop here. You may want to trim off bits from the top and bottom, in which case check the Trimming section; also, you might want to only download the differences between the current page and what you'd previously downloaded; check Diffing in that case. Otherwise carry on to...
This may do the trick for you -- the default behaviour is for sitescooper to get the index page ("URL"), then follow any links that go to another page on the same host (using the same protocol and port if present). In the example above, that means any other page under http://www.whizzynews.com/.URL: http://www.whizzynews.com/ Name: Whizzy News Levels: 2
In sitescooper terms, in the example above, the URL http://www.whizzynews.com/ is the "contents" page; any pages linked to from that page are the "story" pages. Sitescooper is optimised to treat story pages on a multi-level site as static pages (i.e. their content does not change), while contents pages are dynamic (their content may change over time).
Regular expressions are a very powerful way to match text strings, and once you get the hang of it you'll wonder how you ever got by without them. If you're a regexp newbie, don't worry, I'll give a quick overview as we go by.
Anyway, StoryURL is a regexp, and it must match the URL of any links found fully, i.e. specifying
will *only* match "http://www.whizzynews.com/stories/", which is not what you want. If you want to match any page under /stories, do this:StoryURL: http://www.whizzynews.com/stories/
(".*" means "match any number of characters"). Even better, narrow it down to avoid non-HTML pages like this:StoryURL: http://www.whizzynews.com/stories/.*
("\." means "match a dot", "." usually has a special meaning for regexps). Just for convenience, StoryURL (and ContentsURL, see later) allow you to leave off the hostname and protocol if they're the same as the URL, so you could write the above line as:StoryURL: http://www.whizzynews.com/stories/.*\.html
See, easy. Some other quick regexp tips:StoryURL: /stories/.*\.html
So a really advanced StoryURL example would be:
Which means "match any document under a subdirectory of /stories, with a filename consisting of 2 numbers separated by an underscore, ending in either .html or .htm".StoryURL: http://www.whizzynews.com/stories/.+/\d+_\d+\.(html|htm)
Check the .site files in the "site_samples" directory for good examples of StoryURLs.
For more information on regular expressions in general, and the Perl variety that sitescooper uses, I recommend checking out some of these pages. First of all, Steve Litt at Troubleshooters.com has put together a great guide here: Steve Litt's PERLs of Wisdom: PERL Regular Expressions. Highly recommended. Next up, there's A Tao Of Regular Expressions, it's quite good. Also, weblogger Jorn Barger has written up a good page of to regexps at Regular Expressions Resources on the Web. The "Perl" section is most relevant to sitescooper's regexp format. The About.com guide to Perl has a great tutorial on them too.
"Diffing" is when sitescooper compares the page with what it had previously seen on a previous snarfing run, and only reports the differences. (The word "diffing" comes from the UNIX tool, diff(1), BTW.)
Quite frequently sites that need this are 1-level sites, although sometimes it can be handy to use diffing on 2-level or 3-level sites' contents pages, if the text from these pages is downloaded as well as the links to stories.
Diffing is enabled for a site by specifying one of the following parameters:
Enable diffing of story pages -- typically used for 1-level sites, where the URL specifies the story page.StoryDiff: 1
Enable diffing of the contents page. As explained above, this is typically only needed if you will be printing the text from the contents page (see the ContentsPrint parameter, described later). Many of the parameters that affect story pages can be applied to contents pages as well, by the way -- more on this later.ContentsDiff: 1
To work around this, sitescooper supports the ContentsStart/ContentsEnd and StoryStart/StoryEnd parameters. These are Perl regular expressions (again) which should match the beginning and end of the contents or stories. E.g. if you want to trim the retrieved stories so that the text output only runs from the end of the navigation bar, which is helpfully commented in the HTML with a comment "end of nav bar", to the HTML comment "end of story text", you could use the line:
StoryStart: -- end of nav bar -- StoryEnd: -- end of story text --
This will cause sitescooper to look for that pattern in the HTML source of the story page and use whatever is in the brackets, namely .*?in the example above, as a headline for the bookmark.StoryHeadline: -- HEADLINE=(.*?) --
Sitescooper has built-in support for some headline mechanisms, namely PointCast headline tags in the HTML, and the headline tag used in My Netscape-style RSS files.
URL: http://www.whizzynews.comAs you can see, you can specify IssueLinksStart and IssueLinksEnd parameters, which work in a similar fashion to ContentsStart and ContentsEnd, and a ContentsURL keyword which works a la StoryURL. Most of the keywords used for 2-level sites have a parallel keyword for 3-level sites too.
# is the site file comment character, so anything after that character is ignored by sitescooper. Try to avoid using it if possible. In patterns, use the . character (which matches any character) instead.How do I handle sites with more than one possible URL pattern for stories?
Just specify more than one StoryURL line, sitescooper will know what to do.How do I print out the contents- or issue-level pages as well as the story pages?
Set the ContentsPrint parameter to 1. Note that when you output in HTML format, or a HTML-based format like iSilo, this happens automatically -- but for text or DOC output it needs to be switched on by hand.How do I handle sites where the story's URL stays the same, but I don't want it cached, or the contents page can use a cached version if it's URL is the same as on a previous run?
Use the StoryCachable or ContentsCacheable keywords. A value of 0 indicates that the page should not be cached, and a value of 1 means that it can.My site is downloaded OK, but some of the text is lost because it's in a narrow table, and sitescooper is trimming it out automatically. How do I stop this?
Set StoryUseTableSmarts or ContentsUseTableSmarts to 0.Is there any way to snarf a site in frames?
Yes. You can either treat it as a site with an extra level (sitescooper will follow FRAME SRC tags as if they were A HREF links), or, if the framed document has a static URL, just treat it as a site with the right number of levels and start crawling the site from the framed document's URL.How do I skip some stories, even though they match the StoryURL?
Use StorySkipURL (or ContentsSkipURL if you need to skip contents pages). This should contain the regexp pattern of URLs you want to skip.Can I add start URLs to a site, in addition to the one in the URL parameter?
Yep, use the AddURL keyword in the site file. Each additional URL will start the downloading again, and the output will all be in one file.I want to do some mangling of the output from sitescooper, after it's been converted from HTML to text, but before it gets send to MakeDoc. Can I do this?
Sure, but you need to know some Perl. Sitescooper provides the StoryPostProcess keyword to do this. For example, here's one I use on one of my personal site files:The site I'm crawling has multi-page stories. How do I deal with this?
Is there an easy way to scoop sites that provide links to new content using the My Netscape or Scripting News RSS XML formats?
One of my sites produces lots and lots of output on its first run, because the contents pages have links to really old articles. How can I fix this?
Set the StoryLifetime parameter to something lower than the default, which is 60. The parameter is specified in days. Alternatively, try using the [[MM]] , [[YYYY]], [[DD]] parameters in the URLs to restrict the months, days, and years that articles should be chosen from.