Webxref 0.3.5

Webxref is a Perl5 program to quickly check links in your web documents. Webxref is intended to be easy to use, without any configuration. To check links in index.html and report errors simply call:

  webxref index.html

Webxref makes cross references from a html document and the html documents linked from that html document. I.e. the links found in that document are checked for missing links or files, then the links in that document are checked and so on.

A development version (0.3.5) is now available, with new features and all the goodness of treesed included. Use this with caution!

Usage: webxref  -help/-h -noxref -xref/-x -fluff -htmlonly 
                -nohttp -delay seconds
		-silent/-s -verbose/-v -errors/-e
		-long/-l -brief -html
		-islocal <address>>-avoid/-a <regexp>
		-one/-1 -depth <depth>
		-root/-r <rootdir> -fullpath
		-date <yymmdd> -time <hhmmss> -before -after
		-find <string> -findexpr <regexp>>
		-replace <string> -replaceexpr <regexp> -by <string/expr>
		[-files/-f] file1 file2
		file.html

[ Top - Get Webxref - Output - Parameters - How to - Examples - Get Webxref ]

What the parameters do:
While checking webxref prints output according to:

-silent/-s	Only list files with problems at the end of the run.
-verbose/-v	Print information while checking files.
-errors/-e	Print errors when they occur, even when -silent.

Webxref generates a report according to:

-long		List all files found
-brief		Only list files with problems
-xref/-x        List which files reference files (cross-references).
-html		List report in HTML form

Webxref inspects files/directories according to:

-fluff		List which files/directories are never used.
-htmlonly	Only inspect files with the .html/.htm extension.
-root rootdir	The server root where cgi-bin, icons etc reside
		default: the directory where webxref is called.
		Links like <a href=/index.html> are looked for
		in the rootdir directory.
-fullpath	Print full-length filenames, e.g. /u/people/rick/www.html
-islocal url	'www.mymachine.nl' is actually a local file reference.
-avoid regexp	Avoid files with names matching regexp for inspection.
-depth number	The maximum directory nesting level.
		0 means: current directory only,
		1 means: directories from the current directory.
		100 probably means there is no restriction in
		how deep webxref is allowed to find files.
-one/-1		Specify -one if you just want to check the links
		from the given file(s) and no further link following.
-nohttp		Do not check external URLs via the network.
-delay seconds  Wait the specified number of seconds between HTTP checks
-date -time	Date [yymm<dd>], time [hhmm<ss>].
-before -after	List files that are modified before or after
		the date/time given with -date and -time.
-files/-f files	If you want webxref to test a series of files
		user the -files parameter, else simply list the
		file to test last. 

Default is: 
- While checking webxref prints only a '+' or '-' for each file checked.
- Short reports, that is only errors are listed. 
- No cross-references
- No fluff detection
- Root being the current directory webxref is called in.
- HTTP:// URLs are checked.

[ Top - Get Webxref - Output - Parameters - How to - Examples - Get Webxref ]

Which parameters to use for what purpose:
Default webxref checks the given file and follows the links in that file. While working it lets you know it's alive by printing a '+' for each file checked ok, and a '-' for each file with a problem. Like this:

$ webxref index.html
Webxref version 0.3.3, 07-MAR-97 by Rick Jansen (rja@euronet.nl)
+++++++++++++++++++++++++++++++++++++-++++++++-+++
++++++--++++++++++

A webxref run can take some time. You can, however, interrupt webxref with ctrl-c (Unix). Webxref will report on the files it has inspected up to that moment and exit. (*New!*) Note: this is not reliable on all systems.

When the whole site has been searched and all links have been inspected webxref prints a report. Default only problems are reported. Specify -long to obtain a long report. Specify -HTML te get a report in HTML form.

If you want more information while webxref is working specify -verbose to get messages on every file or -errors to see only files with problems. With -silent webxref prints nothing at all while working.

If you need to know if there are files and/or directories in your site that are not referenced at all by any pages in your site specify -fluff

If you want to only inspect files that really have the .html or .htm extension specify -htmlonly

References starting with a '/', like <a href=/icons/icon.gif> refer to the server "root" directory. Specify where this directory is with -root

If you use full URLs in your site referring to your own site, say "www.sara.nl" is your www-address and you use links like <a href=www.sara.nl/rick/index.html> then tell webxref that "www.sara.nl" actually can be found on the local machine with: -islocal 'www.sara.nl'

If you want to avoid certain files use the -avoid parameter to specify which files to avoid.

If you want to limit the number of files webxref inspects you may want to limit the scan to 1 or 2 directories deep in the file system. If you specify -depth 0 only files in the current directory are inspected.

If you just want to check if links in a file are valid specify -one (or -1). Only the links present in the file are tested, but no more. Use this with -files to specify a collection of files to just check those files.

When all local files are inspected webxref goes out into the net to check if the http:// links work. This may be time-consuming. Specify -nohttp if you don't want that. To avoid overloading a webserver there is a delay of 1 second between checks. If you want longer or shorter delays specify the number of seconds with -delay. (Longer delays may be necessary if a lot of links refer to the same webserver.)

To see if you have files or directories that were modified last before or after a certain date/time use: -before/-after -date yymmdd -time hhmmss. If -before is given files are reported that were modified before the date given, with -after files last modified after the date given are reported.

To tell webxref which files to inspect simply list the file or files at the end of the command, or use -files or -f

Webxref can search and even search-replace text, see later.

Find/replacement: ** EXPERT ONLY **
Webxref can scan your site for files containing certain text. To find fixed text use -find. To find text using e.g. wildcards use -findexpr. The Perl expression is matched with the text of the file under test. Take care to not have the shell interpret '*' and '/' by using appropriate quoting. Search is always case- insensitive. Webxref does search/replace beyond end-of-line. I.e. newlines are matched, and can even be inserted (use \n).

To replace text with something else use -replace and -replaceexpr and -by. The string or expression you specify with -replace or -replaceexpr is replaced by the string you specify with -by. In case of editing, a backup file with a random numeric extension is placed next to the resulting file. E.g. when index.html is edited there'll be a file "index.html.1234" or something similar. (DISCLAIMER: the author cannot be held responsible for any damage resulting from using the edit- or any other functions of webxref or indeed any software, hardware, chemical substance, imagined or real (or seeming to be real) effects or by-effects of anything, at all, whatsoever.)

-find string		report files containing the given string
-findexpr regexp	report files containing the given expression
-replace string		*REPLACE* string by the string given with -by
-replaceexpr regexp	*REPLACE* regexpr by the string given with -by
-by string		replacement string (or regexp)
-nobackup		Not implemented on purpose.

[ Top - Get Webxref - Output - Parameters - How to - Examples - Get Webxref ]

Examples

webxref file.html
	Checks file.html and files/URLs referenced from file.html
webxref index.html another.html
	checks index.html and another.html
webxref -one index.html
	just check the links in index.html, don't follow the links
webxref -one *.html
	Check only the links in the html-files in the current dir.
webxref -depth 0 index.html
	Check index.html, but don't check files in directories
	that are deeper in the file system. 
webxref -nohttp file.html
	checks file.html, but not external URLs
webxref -htmlonly file.html
	checks file.html, but only files with the .html/htm extension
webxref -avoid '.*Archive.*' file.html
	checks file.html but avoids files with names containing
	'Archive'
webxref -avoid '.*Archive.*|.*Distribution.*' file.html
	Same as above, but also files with names containing
webxref -islocal www.sara.nl
	Treat things like '<a href=http://www.sara.nl/rick' as a 
	local reference, as if it would have been '<a href=/rick'
webxref -root /u/webserver/ index.html
	Links to things starting with a slash, like /cgi-bin, /icons etc
	are now looked for in /u/webserver/, the directory your webserver
	knows as the 'root'
webxref -fluff index.html
	Checks index.html and reports files in the directories 
	encountered that were not referenced by index.html or any 
	file linked to from there.
webxref -silent index.html
	Just report problems at the end of the run. This may take
	a while with a big website.
webxref -silent -errors index.html
	Prints only problems while scanning, and the final report.
webxref -verbose index.html
	Prints a message for every file under test.
webxref -long -silent index.html
	Does not print anything while scanning, but generates a
	long report, i.e. lists every file encountered.
webxref -before -date 970823 -time 1200 index.html
	Reports files last modified before August 23rd 1997
webxref -find 'me.gif' index.html
	Reports a list of pages containing the text 'me.gif'
webxref -findexpr '<img .*\.gif' index.html
	Reports files containing links to gif files.
webxref -replace 'me' -by 'you' -one index.html
	Replace 'me' by 'you' in index.html one-ly.

[ Top - Get Webxref - Output - Parameters - How to - Examples - Get Webxref ]

Output
When ready a list (and direct and indirect references) is printed of:

OK Failed
html files files that can't be found
directories files that are not world readable
named anchors directories that can't be found
mailto's files that can't be found
news files that are not world readable
ftp directories that can't be found
telnet name anchors that can't be found
gopher files and directories never actually used
external URLs
cgi-bin scripts
file:'s
files older/younger than a certain date/time
files whose content matched the find parameter
files in which text was search-replaced
http:// ok references http:// failed references

OK	Failed
html files	files that can't be found
directories	files that are not world readable
named anchors	directories that can't be found
mailto's	files that can't be found
news	files that are not world readable
ftp	directories that can't be found
telnet	name anchors that can't be found
gopher	files and directories never actually used
external URLs
cgi-bin scripts
file:'s
files older/younger than a certain date/time
files whose content matched the find parameter
files in which text was search-replaced
http:// ok references	http:// failed references

[ Top - Get Webxref - Output - Parameters - How to - Examples - Get Webxref ]

Get Webxref

Get webxref 0.3.5 (56386 bytes)
Version 0.3.5 is available as a Perl 5 development version:
Last change: March 13th 1997

Feedback is very welcome: mail rja@euronet.nl

Other link checkers and validation tools:

WWW on Yahoo
Perl WWW utilities
Weblint - a perl script which picks fluff off off html pages

Webxref written 1995 by Rick Jansen (rja@euronet.nl)