wgrab

a tool for automated retrieval of selected parts of websites

wgrab is a perl script that can be used to selectively download parts of a foreign website and store things in the local filesystem. Instead of the unspecified way, in which 'wget -r' downloads and stores everything, wgrab allows you to use iteration over dates and numbers, as well as regular expressions to specify, which references to follow.

Here is the help you get when you call 'wgrab -h' (i need to write a more polished documentation - i know):

wgrab [opts] [-a saveAs] {counters} startURL {patterns}  v1.18 of Oct. 31st 2001
         make http downloads based on dates, numbers and regular expressions
         (c) Heiko Hellweg (hellweg@snark.de), 2001; see http://snark.de/wgrab/
options include:
-h: long help , -hX prints some eXamples...
-v: verbose (multiple -v make it more verbose)
-p: don't get last pattern - just print matches to stdout
-P substPat: like -p, but apply % substitution (like in saveAs) before printig
-n: noClobber - overwrite existing files instead of renaming
-w seconds: wait between downloads (don't overrun the other host)
-A: save all (not just the files retrieved with the last pattern)
-r: rewrite saved (make links to unsaved docs absolute; links to saved relative)
-H: span hosts - auto-patterns may contain ":" (like http://)
-R: follow HREFs only
-S: follow SRC only
--user string: username for basic www-authentification
--pass string: passphrase for basic www-authentification
--userAgent string: UserAgent header transmitted to server

counters are:   type start end [-s step]   with these types:
-d : iterate over dates, relative to now (start and end are integer)
-D : iterate over absolute dates (start and end are YYYYMMDD)
-e enumerate over integers (may occur multiple times - use %e...%h)
-E enumerate over characters (may occur multiple times - use %E...%H)

All patterns following the startURL may contain everything that makes up a perl
regex. With multiple patterns, wgrab gets the startURL document, extracts all
quoted strings, tries to match the first pattern, interprets the results as
URLs, downloads them, applies the next pattern to their content... and saves
the documents downloaded via the last pattern.
 
If a pattern contains "(" and ")", it is left unmodified. Otherwise a new
pattern is constructed: ["']([^"']*yourpattern)["'] (details of the prefix
depend on -H, -R and -S options). Anyway: perls $1 [the matched part in '()']
is used for going on.

Patterns starting with "+" are recursed into on the same level and applied to
their own result again (this way you can easily iterate thru "next" links)
 
additional rexexp shorthand: \I (image) expands to "[gGjJpP][iIpPnN][eE]?[fFgG]"
(i.e. gif or jpg or jpeg or png - with arbitrary mixes of upper/lower case).
 
saveAs and all the patterns may contain %[[+-]number[.fillchar]]X
('+' truncates from the right and '-' truncates from the left) where X is one of
d: Day of Month         m: Month of Year
y: Year (4 digit)       D: Day of Week [0..6]
w: Day of Week name lowercase   W: Day of Week name Uppercase
n: Month name lowercase         N: Month name Uppercase
T: That day (shorthand for %4y%2m%2d)
t: today (shorthand for %4y%2m%2d with current date)
e/f/g/h: counter 1..4 (depends on number of -e/-E parameters)
E/F/G/H: chr(counter 1..4) (depends on number of -e/-E parameters)
=: referenced filename in -a or -P: URL split at "/"(from right: %1= = filename)
i: counter for saved files in saveAs
       ...better look at the examples with "-hX" to see how %substitution works

Tricks with the saveAs pattern (-a):
%.= flattens the filename, replacing '/' in the url with a '.' in the filename.
use "-" for the saveAs pattern to print all results to stdout.
if the saveAs pattern starts with a "|", it is executed as a shell command.
(e.g. mail each document to yourself with -a '|mutt -s "%=" myself@mail.edu')
 
LICENSE: use, modify and redistribute as much as you want - as long as credit
to me (hellweg@snark.de) remains in the docs. No warranty that wgrab does or
does not perform in any specific or predictable way...
It would be nice (not a condition of the license) if you told me about bugs
you encounter (maybe even with a fix) and/or the uses you find for wgrab.

The % patterns work a bit (but only a bit) like in printf - and a bit more (but still only a bit) like the '+FORMAT' mechanism of the gnu date tool...

Here are some examples (assuming, the current month is September):

wgrab -p '%m' => 9
wgrab -p '%2m' => 09   (left side padding to length 2 - 0 is the default fillchar)
wgrab -p '%3.#m' => ##9 (padding with a specific char)
wgrab -p '%-3.#m' => 9## (padding on the right)
wgrab -p '%N' => September
wgrab -p '%3N => Sep (truncating on the right)
wgrab -p '%-3N => ber (truncating on the left)
wgrab -p '%15._N => ______September (padding on the left again)

If you need a real '%' sign - and wgrab might confuse it with a substitution pattern like %2d - escape it with a backslash (\%) or a second '%' (%%).

Here are some examples, what wgrab can get you (these work today [20 Dec. 2000] - if the sites reorganize or change their naming scheme, you will have to find your own patterns):

save the last 4 weeks of dilbert with ungarbled names:

wgrab -v -v -d -1 -28 -a ./dilbert%T.gif http://www.dilbert.com/comics/dilbert/archive/dilbert-%T.html '/archive/images/dilbert[^"]+\I'

get the whole SinFest archive:

wgrab -v -v -n http://sinfest.net/strips_page.htm '\d+.html' 'sf\d+.\I'

get all my german translations of Dela from this very site:

wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/archiv.html '\d.html' '\d.\I'

or get the same Dela translations by iterating thru the "next" buttons with a '+' pattern
(usefull for sites without a central archive page)

wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/first.html '+\d.html' '\d.\I'

i am sure, you will find many more...

Version History:

v1.01, dec. 22 2000:
initial announcement on freshmeat,
v1.02, jan. 13 2001: bux fixed:
removed bug: "0" arguments to -e not accepted
v1.03, feb. 02 2001: feature added:
added statistical summary (-v) on finish
v1.04, feb. 08 2001: feature added:
-p now prints the _last_ pattern _after_ eval
=> can be used to extract and print text from foreign sites
v1.05, feb. 16 2001: feature added:
set saved files timestamp according to http Last-Modified Header (using HTTP::Date for parsing)
v1.06, feb. 20 2001: conceptual error corrected:
escape real '%' in urls as '\%' or '%%'
v1.07, feb. 21 2001: feature added:
%.= in saveAs can flatten url paths' to plain files
v1.08, mar. 02 2001: bugfix and feature added:
%E/%F/%G expand to chr(%e/%f/%g)
-E start end counts from ord(start) to ord(end)
-w waits between individual downloads ... save the servers
fixed a bug in %substitution with multiple %expr ambiguity
v1.09, mar. 11 2001: feature added:
implemented +pattern syntax to iterate on same level
+ complete rewrite of the iterator looping (now recursive)
v1.10, mar. 12 2001: feature added:
honor the new URL (document base) if redirected
Warnings and errors now go to STDERR (like they should)
saveAs '-' cats everything to STDOUT
saveAs '| shellcode' executes shellcode with content on STDIN so you can do
-a '|mutt -s "%=" myself@mail.edu' for example to mail all docs to yourself
added (experimantal) pager support for showing the -h help
v1.11, mar. 26 2001: potential error avoided and feature added:
catch partition full on save and issue an error message
(open for writing succeeds but actual storing fails)
added -q (quiet) option: not even errors are reported
\I now expands not only to gif/jpg suffixes but also to png
v.1.12, jun. 21 2001: feature added:
added --nocache option to send pragma:no_cache headers to proxy...
v.1.13, jul. 11 2001 : feature added:
added 'rewrite' feature: make links in saved documents relative, if the target was also downloaded, make them absolute (pointing to the original website) if the target was not downloaded. most usefull with -A (a bit like getleft)
v.1.14, jul. 13 2001:cleanup and bugfix
just some cleanup of the help/doc
(and a typo fix in the parameter analysis... instead of -r for rewrite, i double-used -R (already used for 'follow hRef only') by mistake).
v.1.15, jul. 15 2001:semi-bug fixed
perl 5.6 seems to be strict about unknown escape sequences...
some of the examples (stored in one String) used the string '\s' which produced a warning (thanks to Michael M. Tung for the hint)
v.1.16, aug. 7 2001: feature added
manage lists of 'known' urls - not to download again(-K)
v.1.17, aug. 24 2001: feature added
send a static referrer in all downloads (--referer)
v.1.18, oct. 31 2001: bug fixed
found a hint here, why the code sometimes dies with an "Unexpected field value ... at .../Message.pm at line 189": it seems like the HTTP Response sometimes returns an object reference instead of an actual string when asked about the 'Location'. Passing that reference as a 'Referer' for the next request may (sometimes) produce an error. Interpolating (use "$ref" instead of plain $ref) helps.
v.1.19, jan 29 2002: non-unix compatibility:
store non-text documents in binmode.

to use wgrab, you only need a perl installation (for windows machines get it from Active State, for other architectures see cpan: the Comprehensive Perl Archive Network) and my script wgrab.
If you are running Windows (or your perl installation sits in some other place than /usr/bin/perl) you may have to type "perl wgrab ..." instead of just "wgrab ...".

Caveat: you'd better not use this script to load really enormous files - the current download is allways sucked completely into memory before being written to disk (it's not the number of files that matters, just the size of the biggest). A few MB won't hurt but i would not advise you to get the ISO CD-ROM images for your favourite Linux distro this way...

Have fun with it and tell me, if you find it usefull...
btw.: another usefull (but much more specific) tool is my comix collector. And there are some other tools available too.

Author: Heiko Hellweg,
last modified: 31. Oct. 2001