Optimizing Docker Builds: Lessons from an 85% Speed Boost

One of the developers for a microservice I scaffolded reached out recently asking what could be done about image build times as they had ballooned to 18 minutes. As this was getting closer to a production launch with dozens of deployments into pre-production environments each day it had become an issue causing longer deployments which slowed both delivery and troubleshooting.

The application itself is fairly straightforward. It is a Java application that uses Spring Boot and is compiled with Gradle. The repo is a mono repo for this team. Two services were created in it, and the code for both was shared. A web admin was statically generated with NodeJS and included in the Java war file.

To verify it was taking 15-20 minutes per build I reviewed the CI logs and could see it increasing from about 5 minutes when it was initially created to now almost 4x as long. Interestingly on my local laptop, it was still taking as little as 5 minutes when building with no cache (to match the CI agents which intentionally do not share any local disks).

With cache as a first step, I started working on these updates. Since I had already introduced Gradle caching for a different monolith I started there.

As we are an AWS shop I went with a S3 backend for Gradle. It works well for our use case, no network configurations are needed, all CI tasks happen in AWS already and each has an existing IAM role to which we add the permissions. For the monolith Gradle is called directly and the plugin uses the standard AWS Java SDK to pick up the role from the docker host. These microservice builds called Gradle via the Dockerfile and needed a different way to pass along the credentials. This needed to be done in a secure manner, just passing in the credentials as build arguments would not work since they would be baked into the image that would later be deployed. Luckily we can use docker secret mounts with an AWS configuration file to provide access. I wrote a small script that retrieves the credentials from the container and writes them to disk.

Now with the Gradle cache in place, I could see that many of the Gradle build steps were coming from the cache and did not have to rerun. However, it wasn’t saving much time. Going back to the CI logs I saw 100’s of log lines saying it was downloading a required jar. Doing a quick find on the page counted that there were over 1800 such lines. As those jars are versioned pinned it didn’t make sense that there were changes that needed to be downloaded each time. They also are not cached in the Gradle cache, since that is for build artifacts only. The Gradle command used to create the jar was gradle dp:bootJar¹ which wraps around all steps needed for creating the bootable jar and includes the download step. My next step was to find a way to cache the downloaded jars for re-use. I split the download step out into a new Gradle task gradle dp:downloadDependencies. Running this did pre-download the jars and they were picked up by the bootJar command after it, but it still was not cached.

To cache the download command I turned to the Docker Cache which supports a few different backend options. While S3 is a supported backend option it is still in the experimental stage, so I went with the registry backend using AWS ECR. To enable it I added --cache-to mode=max,image-manifest=true,oci-mediatypes=true,type=registry,ref=registry.host/image:build-cache --cache-from type=registry.ref=registry.host/image:build-cache to the docker build command. Right away this helped by caching the steps done for downloading the jars, in fact, it allowed skipping that whole section and going straight to the gradle dp:bootJar command.

At this point, many of the steps were cached but builds were still taking much longer than expected. So I started going over the build logs from CI line by line, build by build. Dockerfiles log each step it runs with timestamps and going over each of these it looked like the same steps were taking either two or ten minutes for the same command. This was a very odd discrepancy so I dug further. Comparing build to build I saw that the builds for AMD containers were two-minute builds and the ARM container builds were ten minutes. The CI jobs run docker buildx build --platform linux/amd64,linux/arm64 ... which instructs docker to run two parallel builds, one for each platform. Both platforms are requested, AMD for developer machines so they can use the built images for local testing and ARM for deployments since AWS Graviton using ARM gives a 30% cost savings with no code changes needed.

While I needed to keep support for both architectures Java is a cross-platform language meaning that the compiled code can be run on any architecture. This means the jar only has to be compiled once and it can be run in both containers. I replaced the logic specifying platform for the builder steps from FROM --platform=$TARGETPLATFORM gradle:8.5-jdk21 AS builder to FROM --platform=linux/amd64 gradle:8.5-jdk21 AS gradle-builder and now it only built on linux/amd64 and still used $TARGETPLATFORM to define the platform for the final image that the application runs in.

With all these changes together the build times went from eighteen to two minutes by minutes incrementally building on previous builds.

Summary of changes#

While this was done for a specific Dockerfile many of these changes are applicable to any project. Here are the changes that were made:

Pinned all docker images to specific versions to prevent cache invalidation
Use Docker Cache to reuse steps
Order steps to encourage caching, including using multi-stage builds
Split build steps by task type
- Tasks that change frequently (code compilation)
- Tasks that are static (dependency downloads)
Use language caches
Cross compile for multiple platforms

Original Dockerfile#

FROM node:latest AS admin-vite-builder
WORKDIR /app
COPY /dp/admin-frontend ./
RUN npm install
RUN npm run build

#  build backend, and copy admin frontend into static resources folder
FROM --platform=$TARGETPLATFORM gradle:8.5-jdk21 AS builder
COPY .. .
RUN rm -rf dp/src/main/resources/static/admin && mkdir -p dp/src/main/resources/static/admin
COPY --from=admin-vite-builder app/dist dp/src/main/resources/static/admin
RUN --mount=type=cache,id=gradle-dp,target=/home/gradle/.gradle \
RUN gradle dp:bootJar --info

# package everything into a single jar
FROM --platform=$TARGETPLATFORM amazoncorretto:21-alpine-jdk
LABEL team="ds"
LABEL service="dp"
VOLUME /tmp
WORKDIR /opt/ds
COPY --from=builder home/gradle/dp/build/libs/dp-*.jar app.jar

EXPOSE 9021
EXPOSE 9022
ENTRYPOINT ["java","-jar","app.jar"]

Modified Dockerfile#

The indentation isn’t important but I prefer using it to visually group steps.

FROM --platform=linux/amd64 gradle:8.5-jdk21 AS gradle-cache
    COPY common/build.gradle common/build.gradle
    COPY buildSrc/src/main/groovy/ds-spring-app.gradle buildSrc/src/main/groovy/ds-spring-app.gradle
    COPY buildSrc/src/main/groovy/ds-java.gradle buildSrc/src/main/groovy/ds-java.gradle
    COPY buildSrc/build.gradle buildSrc/build.gradle
    COPY ../gradle.properties ../settings.gradle ./
    COPY dp/build.gradle dp/build.gradle
    COPY gradle/libs.versions.toml gradle/libs.versions.toml

    RUN gradle dp:downloadDependencies

# build admin frontend
FROM --platform=linux/amd64 node:22.14 AS nodejs-cache
    WORKDIR /app
    COPY /dp/admin-frontend/package.json ./
    COPY /dp/admin-frontend/package-lock.json ./
    RUN npm install

FROM nodejs-cache AS nodejs-build
    COPY /dp/admin-frontend ./
    RUN npm run build
    
#  build backend jar, and copy admin frontend into static resources folder
FROM gradle-cache AS gradle-builder
    COPY .. .
    RUN rm -rf dp/src/main/resources/static/admin && mkdir -p dp/src/main/resources/static/admin
    COPY --from=nodejs-build app/dist dp/src/main/resources/static/admin
    #RUN --mount=type=cache,id=gradle-dp,target=/home/gradle/.gradle \
    ENV S3_CACHE=true
    RUN --mount=type=secret,id=aws AWS_SHARED_CREDENTIALS_FILE=/run/secrets/aws \
        gradle dp:bootJar --info

# Copy over the built jar file and run it
FROM --platform=$TARGETPLATFORM amazoncorretto:21-alpine-jdk
    LABEL team="ds"
    LABEL service="dp"
    VOLUME /tmp
    WORKDIR /opt/ds
    COPY --from=gradle-builder home/gradle/dp/build/libs/dp-*.jar app.jar

    EXPOSE 9021
    EXPOSE 9022
    ENTRYPOINT ["java","-jar","app.jar"]

I’ve shorted the service name to dp to anonymize the data, it’s not part of my improvements but is used by Gradle to allow calling a task for a specific sub-project. ↩︎