The build system at work was comprised of an ant build for Java with various node and npm scripts, as well as lots of bash to bundle our software. It did the job, but showed its age due to lack of parallelization and good system dependency. It was time to migrate to something more robust and modern.
Context
The main backend code was composed of java and scala. Ant was used as a build tool, and Ivy for the dependency management. The frontend build and packaging was done with node and npm.
A Makefile
was used as a starting point for the whole build, with a bunch of shell scripts. Calling the makefile with parallel execution (-j
) made the whole build unstable and crashed.
So the build was:
- orchestrated by a
Makefile
- dependent on multiple tools:
make
,ant
,ivy
,sbt
,node
,npm
,bash
,python
, … - dependent versions of the same tools: it was relying on Operating System installed versions, or the versions which were installed by the developers at the time of their environment setup. Specific versions were not really enforced
- no real caching: even if some parts could be cached a little bit, the build process had no real knowledge of what was done previously and what needed to be redone
- sequential and slow: as all those tools were doing stuff sequentially. As a result, the whole build itself was also sequential, it took about 15 minutes to download dependencies, compile, prepare stuff…
A new tool
Someone on the team started moving the build to a new tool, and he chose Gradle, which I never used, but only heard about.
It provides:
- parallel builds
- better task and dependency management
- single tool, that could provide a same version of Java via toolchains or node, npm via plugins
- potential remote cache for CI to further speed up the build
- the whole system is written in Groovy, which I am not fond of… we just need to live with it
Due to some circumstance, the person who started the work left, and I was left alone with no real external support (see below). I nonetheless decided to continue the migration on my free time, maybe about half a day / week.
Migration steps
As I was left mostly alone on the project, I decided on the following steps to ensure a smooth migration from one build system to another. It would be more work, but for such a task, safety and confidence come first:
- do not modify (or as little as possible) the original build: everything has to be done on the side
- as a consequence, both must live side by side and do not interact or impact with each other
- add the new build little by little, without interfering with my other main daily tasks
- the main entry point for the build was a
Makefile
, keep it with the same targets, so that developers and third-party scripts calling it would be unchanged, so a newMakefile.gradle
was introduced - ability to rollback, or switch from one system to another easily. By having a new
Makefile.gradle
, it would just be switching file - ensure full binary compatibility for jars and other outputs between the two build systems
- add a new CI task in parallel to build with Gradle, while keeping the old system
Reproducibility
As mentioned above, I wanted to ensure binary compatibility between the legacy and the new Gradle build.
- make the Gradle output reproducible by itself:
tasks.withType(AbstractArchiveTask) {
preserveFileTimestamps = false
reproducibleFileOrder = true
}
- write a script that will compare the jar output, it needed to check the bytecode itself (just compare the two
.class
files), to prevent differences in dates from the jars - do the same things for frontend generated files
- use the CI to build the two systems, and then calling the previous script to check for differences, run the task daily
Example of differences in outputs between the two builds:
- scala files had debug mode activated on the legacy build. The
-g
option needed to be added to the Gradle build. I had to look at the compiled output to find it out. Even if this wouldn’t lead to change in behavior, it gave me confidence in the approach - checking front end build highlighted a few differences and non determinism in the legacy build system. It was fixed for both build systems
With the use of Java via toolchains or node, npm via plugins, the sofware version used for the build could be pinned, no longer rely on the OS provided version. This helped migrating node and java versions more smoothly as developers and CI didn’t need to do anything: Gradle did the job of checking the installation and downloading the required dependencies.
Unit tests
The legacy build was using ant
to create a jar with all test classes, and then run junit on this jar, and then exporting results as xml files.
At first, I wanted to keep exactly the same output, so I decided to build the same test jar using Gradle
, and keep using ant
to produce the test results.
This allowed me to compare the two outputs knowing that the different would only be in the test jar itself and not in the pass generating the outputs.
At the end, all tests were green, and the same number of tests were executed.
Some numbers
From the CI running the legacy build, and the new Gradle build, we could get some numbers.
System | Legacy build | Gradle build (no cache) | Gradle build (with cache) |
---|---|---|---|
2 Cores VM | 16 mins | 14 mins (no real gain here, due to core count) | 6 mins |
8 Cores VM | 14 mins | 7 mins | 3 mins |
Local | 15 mins | 8 mins | 4 mins |
Build time were reduced by 50% on a multi-core system. The gains are mainly due to Gradle parallelization, the more the cores, the faster the build. On a system with few cores, as Gradle cannot parallelize enough, there was almost no gains.
For developers, depending on what was modified and the use of the Gradle cache, a new build could just take a few seconds to a few minutes max.
Politics
As much as I hate politics, migrating build system was a heavy political challenge, maybe more than the move itself. There was a lot of friction on introducing a new tool, why change the build that currently works? The gains were not obvious.
Like everything, I also decided to go step by step:
- I worked alone for a few weeks in understanding the legacy build, preparing the project, learning Gradle and setting stuff in place on the side little by little
- once I had a working build and confidence in the build itself, I started talking about it with colleagues and management. They knew I was working on it, but the status was always a bit fuzzy as I only worked on it a few hours max every week
- do not force adoption, that is important as I was the minority. Show the numbers, and let them speak by themselves, I made several presentations of the work to different engineering teams
- as the new Gradle build was faster and worked, more and more colleagues started using it and helped testing the migration. Of course, they also found a few bugs, and things to improve, but no blocker
- and one day, once the CTO started using the build himself because the legacy build was too slow, I knew it was a win
It then became accepted that the Gradle build system was better, faster and safer due to pinning versions and reproducibily. It was ok to enable the build by default for a new release:
- how to move from one build to another? As I worked on creating a separate
Makefile.gradle
with the same targets, it was as simple as:
$ mv Makefile Makefile.legacy
$ mv Makefile.gradle Makefile
- keep the legacy build as a backup for a few releases, in case of unforeseen problems: I wanted to make sure to be able to rollback should anything really bad happened (fortunately nothing happend)
- when we were confident enough, delete
Makefile.legacy
- then remove the old build stuff little by little, it is also a tedious task, but not really urgent
Conclusion
The new build system has now been used for at least 2 or 3 years. The whole migration took almost a year of working mostly alone for 0.5 day / week, but I can say it was a success.
What is important:
- have a way to measure both builds side by side to show the gains
- both build must always be compatible, it really helps when adopting it, as there is no real stopper. If it does not work one day, developers can just switch back to the legacy build and we can fix the new one without (too much) pressure
- tedious project, but tenacity pays. Let it go by itself, it was not obvious that the project could be completed successfully at first. I communicated a bit, without enforcing anything, and once it was obvious that the new build was better, people started using it by themselves and the job was (mostly) done
- we could leverage the build cache to improve on build performance for the CI
- much faster on machines with lots of cores, so as we change hardware the build is even faster. On my new M3 Pro laptop, everything builds in 4 minutes (!)
- I hate groovy