It’s been a fun project. My first impulse to manage the web communication was to use httpunit. It makes things so easy. Want to click the “Send” button on the form named “mainform”? Just call webResponse.getFormWithName(“mainform”).getButton(“Send”).click(). Want to walk the DOM? Just call webResponse.getDOM() for a full DOM of the web page.
There were some annoyances, like the fact that httpunit doesn’t let you do anything a browser wouldn’t let you do (like change the value of hidden form parameters) but nothing a little reflection couldn’t get around.
There was unfortunately one limitation that was just too annoying. It was slow. It took an average of 4 seconds per page. After doing some profiling, my suspicion is that the slowness was caused by validating the DOM against an HTML DTD every time, but I wasn’t able to confirm this.
Thankfully, I’d made the web access layer a Spring service so it was easy enough to swap out for another provider, and I turned my attention to Apache Commons httpclient.
Fast. Boy was it fast. My 4 second page calls were now down sub-second which was on par with a browser. But gone was all the nice sugar. I had to write my own wrappers for everything: finding forms, managing form parameters etc. But it was worth it given the speed advantage. (Does anyone know if there is an httpclient wrapper out there that does this for me? I felt like I shouldn’t have had to write this stuff myself…)
The main challenge for working with httpclient was getting a decent (and fast) DOM out of the web page that came back. I settled on jtidy. It took some fiddling to get the parameters right so that it didn’t get bogged down in validating against a DOCTYPE. The following is what turned out to be the configuration that did the trick:
tidy = new Tidy(); tidy.setQuiet(true); tidy.setShowWarnings(false); tidy.setXHTML(true); tidy.setDocType("omit"); tidy.setNumEntities(true); tidy.setWraplen(0);
the first two parameters stop it from logging to stdout. XHTML gives you the nice DOM. setDocType(“omit”) is the KEY to making it go fast. Ommitting the doctype will require you to setNumEntities(true) (otherwise it will trip up on entities like ™). I called setWraplen(0) because if I didn’t it messed up my parsers of the output that came back.
From here, you just go:
InputStream page = getMyPage(); ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); tidy.parse(page, outputStream); ByteArrayInputStream inputStream = new ByteArrayInputStream(outputStream.toByteArray()); DocBuilder docBuilder = new DocBuilder(); Document document = docBuilder.parse(inputStream);