wgrab

a tool for automated retrieval of selected parts of websites

wgrab is a perl script that can be used to selectively download parts of a foreign website and store things in the local filesystem. Instead of the unspecified way, in which 'wget -r' downloads and stores everything, wgrab allows you to use iteration over dates and numbers, as well as regular expressions to specify, which references to follow.

Here is the help you get when you call 'wgrab -h' (i need to write a more polished documentation - i know):

wgrab [opts] [-a saveAs] {counters} startURL {patterns}  v1.18 of Oct. 31st 2001
         make http downloads based on dates, numbers and regular expressions
         (c) Heiko Hellweg (hellweg@snark.de), 2001; see http://snark.de/wgrab/
options include:
-h: long help , -hX prints some eXamples...
-v: verbose (multiple -v make it more verbose)
-p: don't get last pattern - just print matches to stdout
-P substPat: like -p, but apply % substitution (like in saveAs) before printig
-n: noClobber - overwrite existing files instead of renaming
-w seconds: wait between downloads (don't overrun the other host)
-A: save all (not just the files retrieved with the last pattern)
-r: rewrite saved (make links to unsaved docs absolute; links to saved relative)
-H: span hosts - auto-patterns may contain ":" (like http://)
-R: follow HREFs only
-S: follow SRC only
--user string: username for basic www-authentification
--pass string: passphrase for basic www-authentification
--userAgent string: UserAgent header transmitted to server

counters are:   type start end [-s step]   with these types:
-d : iterate over dates, relative to now (start and end are integer)
-D : iterate over absolute dates (start and end are YYYYMMDD)
-e enumerate over integers (may occur multiple times - use %e...%h)
-E enumerate over characters (may occur multiple times - use %E...%H)

All patterns following the startURL may contain everything that makes up a perl
regex. With multiple patterns, wgrab gets the startURL document, extracts all
quoted strings, tries to match the first pattern, interprets the results as
URLs, downloads them, applies the next pattern to their content... and saves
the documents downloaded via the last pattern.
 
If a pattern contains "(" and ")", it is left unmodified. Otherwise a new
pattern is constructed: ["']([^"']*yourpattern)["'] (details of the prefix
depend on -H, -R and -S options). Anyway: perls $1 [the matched part in '()']
is used for going on.

Patterns starting with "+" are recursed into on the same level and applied to
their own result again (this way you can easily iterate thru "next" links)
 
additional rexexp shorthand: \I (image) expands to "[gGjJpP][iIpPnN][eE]?[fFgG]"
(i.e. gif or jpg or jpeg or png - with arbitrary mixes of upper/lower case).
 
saveAs and all the patterns may contain %[[+-]number[.fillchar]]X
('+' truncates from the right and '-' truncates from the left) where X is one of
d: Day of Month         m: Month of Year
y: Year (4 digit)       D: Day of Week [0..6]
w: Day of Week name lowercase   W: Day of Week name Uppercase
n: Month name lowercase         N: Month name Uppercase
T: That day (shorthand for %4y%2m%2d)
t: today (shorthand for %4y%2m%2d with current date)
e/f/g/h: counter 1..4 (depends on number of -e/-E parameters)
E/F/G/H: chr(counter 1..4) (depends on number of -e/-E parameters)
=: referenced filename in -a or -P: URL split at "/"(from right: %1= = filename)
i: counter for saved files in saveAs
       ...better look at the examples with "-hX" to see how %substitution works

Tricks with the saveAs pattern (-a):
%.= flattens the filename, replacing '/' in the url with a '.' in the filename.
use "-" for the saveAs pattern to print all results to stdout.
if the saveAs pattern starts with a "|", it is executed as a shell command.
(e.g. mail each document to yourself with -a '|mutt -s "%=" myself@mail.edu')
 
LICENSE: use, modify and redistribute as much as you want - as long as credit
to me (hellweg@snark.de) remains in the docs. No warranty that wgrab does or
does not perform in any specific or predictable way...
It would be nice (not a condition of the license) if you told me about bugs
you encounter (maybe even with a fix) and/or the uses you find for wgrab.

The % patterns work a bit (but only a bit) like in printf - and a bit more (but still only a bit) like the '+FORMAT' mechanism of the gnu date tool...

Here are some examples (assuming, the current month is September):

wgrab -p '%m' => 9
wgrab -p '%2m' => 09   (left side padding to length 2 - 0 is the default fillchar)
wgrab -p '%3.#m' => ##9 (padding with a specific char)
wgrab -p '%-3.#m' => 9## (padding on the right)
wgrab -p '%N' => September
wgrab -p '%3N => Sep (truncating on the right)
wgrab -p '%-3N => ber (truncating on the left)
wgrab -p '%15._N => ______September (padding on the left again) 
If you need a real '%' sign - and wgrab might confuse it with a substitution pattern like %2d - escape it with a backslash (\%) or a second '%' (%%).

Here are some examples, what wgrab can get you (these work today [20 Dec. 2000] - if the sites reorganize or change their naming scheme, you will have to find your own patterns):

  • save the last 4 weeks of dilbert with ungarbled names:
    wgrab -v -v -d -1 -28 -a ./dilbert%T.gif http://www.dilbert.com/comics/dilbert/archive/dilbert-%T.html '/archive/images/dilbert[^"]+\I'
  • get the whole SinFest archive:
    wgrab -v -v -n http://sinfest.net/strips_page.htm '\d+.html' 'sf\d+.\I'
  • get all my german translations of Dela from this very site:
    wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/archiv.html '\d.html' '\d.\I'
  • or get the same Dela translations by iterating thru the "next" buttons with a '+' pattern
    (usefull for sites without a central archive page)
    wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/first.html '+\d.html' '\d.\I'
    i am sure, you will find many more...

    Version History:


    to use wgrab, you only need a perl installation (for windows machines get it from Active State, for other architectures see cpan: the Comprehensive Perl Archive Network) and my script wgrab.
    If you are running Windows (or your perl installation sits in some other place than /usr/bin/perl) you may have to type "perl wgrab ..." instead of just "wgrab ...".

    Caveat: you'd better not use this script to load really enormous files - the current download is allways sucked completely into memory before being written to disk (it's not the number of files that matters, just the size of the biggest). A few MB won't hurt but i would not advise you to get the ISO CD-ROM images for your favourite Linux distro this way...

    Have fun with it and tell me, if you find it usefull...
    btw.: another usefull (but much more specific) tool is my comix collector. And there are some other tools available too.


     up

    Author: Heiko Hellweg,
    last modified: 31. Oct. 2001