Validation Graphs

Validation Graphs are a method for visualizing a website's structure and the HTML validation status of its pages.
Based on my PageGraphGUI, I wrote a new tool, which spiders a website and validates HTML pages. The results are visualized as a graph, which is created with a real physics particle engine.
Again it uses some free Java libraries and some code snippets of the website as graph applet by aharef.info. The HTML content is validated with the public available HTML validator from the W3C.

How it looks

The program's output for my blog looks like this: The color of each node represents the node's filetype: JSP pages, images, plain text, office, CSS, Javascript, ASP, PHP, PS, HTML, zip, perl, XML. all others.
Pages which deliver content-type text/html are validated by the W3C validator and the validation result is shown as an red or green outer circle around the node, where red means that the page contains validation errors. Yellow circles denote URLs which raised a server error, e.g. 404: Page not found. This is usually a result of broken links. The console output of the program lists all invalid pages and those with server errors.
Nodes can be clicked and dropped around.

How it works

Beginning with the URL given by the user, a HEAD request is done for each new URL. The response of this request contains the content-type and the HTTP status of the URL. If the content-type is text/html and the status code is valid (no 404 etc.), then the URL is fetched with a GET request and the received HTML file is parsed. All found outbound links in the HTML file will be undertaken the same procedure. If the status code is negative (e.g. 404) or the content-type is not text/html, then the page is not fetched with GET. After all links are extracted, the current page is passed to the validator thread. By default the link parser leaves out URLs which server name is not the server name given in the start URL, to avoid spidering the whole web. The default search depth is 3. These settings can be changed in the ValidationGraph.properties file.

It is not a graph, it is a TREE

Yes it is a tree, but trees are also graphs, just without cycles. ;-)
Actually it as a rooted DAG (Directed Acyclic Graph) which paints the spanning tree of the website, where the tree's root denotes the user given page and the parent of each node is the first node seen, which contains a link to it.

Download

I just gzipped my complete Eclipse project directory with all sources etc. It already contains the file validationgraph.jar which can directly be run by java -jar validationgraph.jar. If you want it to run anywhere else, do not forget to put the jar files contained in the lib directory in the classpath. You need the Java 1.5 Runtime installed.
Download sources.

All in One Jar

For ease of use, I created a jar file, which already contains all required libraries. It can be run anywhere by java -jar validationgraph.jar.
Download jar file.

How To ...

... run it.

When running the program, you can override several default properties, affecting the program's behaviour.
These are the defaults:
validationgraph.validationEnabled=true
validationgraph.maxDepth=3
validationgraph.maxValidators=3
validationgraph.stayInHost=true
If you want to modify these, just add a system property to the java call:
java -Dvalidationgraph.maxDepth=5 -jar validationgraph.jar

.. use it.

In the GUI, just enter the URL and hit the Start button. After some seconds the first nodes will appear. More and more nodes will be added until the max search depth is reached or no more new links will be found. For making a screenshot of the current state just hit Save. The image will be saved as validationgraphX.png (with X >= 0) in the current directory.
Peter | 2006-06-11