Distributed Builds with Ant and Cruisecontrol

Last Friday (Feb 10, 2006) I gave a presentation on the challenges we had with regards to the build of one of our projects. Here are some of the details – the names have been changed to protect the innocent. I’ll call this project “Project X”.

Here’s the problem:
The Project X build is a monster:

  • Checkout, compile and deploy the code
  • Run JUnit tests (4000 test files).
  • Run integration and regression tests (2000 tests).

Total build time: around 7 hours on one machine

The Goal:

  1. Continuous Integration

The Challenge:

  1. Reduce the build time down to something more reasonable.

How’d we do it?

  • Used CruiseControl to drive the build, capture and post the results.
  • Designed and built a distributed build mechanism using tasks from the ant-contrib package.

We didn’t do anything too unusual with CruiseControl. The interesting part was what we did with the ant-contrib tasks.

We had to bring the build time down. To do this we distributed the test suites among six build machines. Each machine is “told” to run a build and is passed all of the parameters it needs to do so. The build machines do their thing then report back the results to the CruiseControl machine by writing a results file onto a shared directory on the CruiseControl box.

The CruiseControl thread starts the build by sending the requests to each build machine which start the build then return immediately. The CruiseControl machine then spins in a loop waiting for the results to show up.

Here are the details:

The build starts from theĀ Distributed Cruise Control Project. CruiseControl drives the show but the crux of it is this:

The default target in the project file iterates over a list of files in the pods directory, calling run_remote_build with the settings defined in each Build machine property file

Before this happens we start up the antserver.xml file on each of the build machines. The details of the antserver.xml file are shown below:

<project name="antserver" default="run" basedir=".">
	<taskdef resource="net/sf/antcontrib/antcontrib.properties"/>

	<property name="root.dir" value="/cvsroot/projectX"/>

	<target name="run">

	<target name="run_build">
		<echo message="Starting build on ${build.machine} at ${build.date}"/>
		<echo message=" target: ${target}"/>
		<echo message=" logfile: ${log.file}"/>
			run inside a "forget" so that we can return immediately to the client
			<trycatch property="exception">
					<echo message="get latest build.xml out of CVS"/>
					<exec executable="cvs" dir="${root.dir}" failonerror="true" resultproperty="cvs.update.result">
						<arg value="update"/>
						<arg value="-C"/>
						<arg value="-A"/>
						<arg value="build.xml"/>
					<echo message="calling ant"/>
					call as a separate ant call so that we can start a new log file each time
					<ant antfile="build.xml" dir="${root.dir}" target="${target}" output="${log.file}">
						<property name="new.current.build" value="true"/>
						<property name="server" value="true"/>
						<property name="serverName" value="localhost"/>
						<property name="build.date" value="${build.date}"/>
						<property name="cvs.update.date" value="${build.date}"/>
						<property name="max.wait.time" value="${max.wait.time}"/>
						<property name="max.wait.unit" value="${max.wait.unit}"/>
					<echo message="back from ant call"/>
					<echo message="cvs.update.result ${cvs.update.result}"/>
					<property name="pod.build.result" value="${build.success}"/>
					<echo message="${line.separator}${pod.build.result}${line.separator}" file="${log.file}" append="true"/>
					<echo message="${build.failure}, Exception: ${exception}"/>
					<echo message="${line.separator}${build.failure}, Exception: ${exception}${line.separator}" file="${log.file}" append="true"/>
					<property name="pod.build.result" value="${build.failure}"/>
					<echo message="name=${build.machine}${line.separator}build.result=${pod.build.result}${line.separator}" file="${semaphore.file}" append="true"/>
					<echo message="build.result=${pod.build.result}${line.separator}" file="${semaphore.file}" append="true"/>
						<format property="end.time" pattern="yyyy-MM-dd HH:mm:ss z"/>
					<echo message="Build complete on ${build.machine} at ${end.time}" />
					<echo message="${line.separator}Build complete on ${build.machine} at ${end.time}" file="${log.file}" append="true"/>

	<target name="tag_build">
		<ant dir="${root.dir}" antfile="build.xml" target="tag_build">
			<property name="successful.build.tag" value="${successful.build.tag}"/>
			<property name="build.tag" value="${build.tag}"/>
			<property name="new.current.build" value="true"/>
		<echo message="Tagging build ${build.tag} complete"/>


The key item in this file is the <antserver> tag. Notice that the default target in this file is run so the listener starts by default. It starts up and listens for requests to come in from the CruiseControl machine. The CruiseControl machine sends the requests using the <remoteant>call. See the project file for an example of using <remoteant>.

The <remoteant> request defines what target we actually want to call on the remote machine via the <antserver> element in the remote ant script.

CruiseControl Build Machine
starts antserver
remoteant call specifying run_remote_build as target
antserver receives request and forwards to run_remote_build target

The run_build target in the antserver.xml file makes the call to the build.xml file in the target project itself. This call is made from inside a <forget> task so that control is returned immediately back to the CruiseControl machine. The distributed_build target in the build.xml file is interesting because it must ensure that regardless of the outcome of the build it reports back to the CruiseControl machine with its build results. The target snippet is shown below:

<target name="distributed_build" depends="init, clean, cvs_update_for_distributed_build, deploy_local, clean_weblogic_log, clean_amakihi_output">
	<trycatch property="exception">
			<limit maxwait="${max.wait.time}" maxwaitunit="${max.wait.unit}" property="time.out" failonerror="true">
				<antcall target="loadDatabase"/>
					<equals arg1="${run.unit.tests}" arg2="true"/>
						<antcall target="run_microtests_localhost"/>
						<antcall target="run_macrotests_localhost"/>
				<antcall target="restart_mocks"/>
				<antcall target="start_weblogic"/>
				<antcall target="ensureAppServerStartedIfServer"/>
					<equals arg1="${run.unit.tests}" arg2="true"/>
						<antcall target="run_ejb_integration_tests_localhost"/>
				<ant antfile="client-integration-tests.xml" dir="${basedir}" target="run_client_iterative_integration_test">
					<property name="dbHostName" value="${env.COMPUTERNAME}"/>
					<property name="dbHostPort" value="1526"/>
					<property name="dbServerName" value="${env.INFORMIXSERVER}" />
		        		<property name="control.file" value="${distributed.build.tests.dir}/${distributed.build.control.file}"/>
					<property name="sleepIntervalInMillis" value="0"/>
			<property name="build.failed" value="true"/>
			<antcall target="shutdown_mocks_and_weblogic"/>
			<fail if="build.failed" message="${exception}"/>

Also we have to make sure that each machine involved in the build gets the same version of the code. To do this the build machines check the code out of CVS by specifying a date rather than just getting the latest code off the HEAD (or the branch tag if running a branch build). See below for details on how this is done.

<target name="cvs_update_for_date" depends="init, cvspass" if="new.current.build">
	<echo message="updating from CVS with date: ${cvs.update.date}"/>
	<cvs cvsRoot="${cvs.root}"
		 passfile="d:cvslocal.cvspass" command="update -P -d -C -A -D '${cvs.update.date}'" failonerror="true"/>

The build results are written back to the CruiseControl host as a file (java properties file). An example is shown below:

end.time=2006-02-13 17:15:33 EST

We currently use six build machines per build. Each machine runs a unique set of tests. The CruiseControl machine waits to hear back (via the results files) from all six machines before posting the results of the build. The results from the machines are “anded” together to give the overall result.

By distributing the tests amongst six machines running in parallel we’re able to get our total build time down to around 60 minutes.

It's only fair to share...
Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn

One thought on “Distributed Builds with Ant and Cruisecontrol

  1. Pingback: Array | i-proving.com

Comments are closed.