Testing HTML

So, further to my previous entry about standard HTML, I wanted to talk about another approach to writing standards-compliant HTML. It’s all well and good to use tools like the W3C Validator to ensure that the page is standards-compliant, but that doesn’t help ensure that people who make changes to those pages won’t break the standard-ness of the page.

Oh, sure, we have Important Conversations about the importance of agreeing to a particular HTML DocType, but in practice people forget, or a new team member joins, and isn’t correctly informed.

So, taking a cue from JUnit, I created a test case that I added to my test suite to validate my HTML. My unit test finds all the .html files in my web project, and checks for violations to the HTML 4.01 strictness rules. If anyone changes the .html I’ve carefully written (and validated) and then runs tests, the test will fail if the HTML doesn’t validate.

The test case works like this:

public void testHTMLValidity() throws Exception {
  File webRoot = getWebRoot();

First, I need to find the root directory under which I put all my .html files. Since we use Eclipse and MyEclipse, that’s generally a single folder in the web project (e.g. myprojectWeb) called “WebRoot/”.

Next, we select all the html files in the directory and validate them one at a time. Then we find all subfolders and recursively call the same process on them.

private void validateAllFiles(File directory) throws IOException {
  File[] files = directory.listFiles(new HTMLFileFilter());
  for (int i = 0, length = (files == null ? 0 : files.length);
      i < length; i++) {
  File[] directories = directory.listFiles(new FileFilter() {
    public boolean accept(File pathname) {
      return pathname.isDirectory();
  for (int i = 0, length = (directories == null ? 0 : directories.length);
      i < length; i++) {

Now, the validate() method is the trickiest thing. It requires that we be able to parse HTML and then look for validation errors. Parsing HTML is a well-known problem. There are two common open source HTML parsers available: JTidy and HTMLParser.

JTidy is, unfortunately, showing its age, since it hasn’t been updated since 2000, but I’ve used it a lot over the years, and I have a heavily customized version of that code. (It looks like the project is starting to reactivate, too).

Here’s my validate method:

private void validate(File file) throws IOException {
  Node root = parse(file);

  assertDocType("doc type for "
    + getName(file)
    + " should be HTML 4.0.1 Strict", root, DocType.HTML_4_01_STRICT);
  Report report = this.tidy.getReport();

  if (report.getValidationErrorCount() > 0) {
    Message[] messages = report.getValidationErrors();
    for (int i = 0, length = (messages == null ? 0 : messages.length);
        i < length; i++) {
    fail(getName(file) + " has validation errors");

This test, therefore, allows me to ensure that all my HTML is HTML 4.0.1 Strict, and that it stays that way. That’s way more powerful than relying on developers to remember to validate their HTML.

There are some limitations, the most noteworthy of which is that I haven’t figured out a way to validate JSPs. Certainly, this same approach could be plugged in to the output of an HTTPUnit test, but how can we check the source file?

It's only fair to share...
Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

Leave a Reply