Multi-Arch Build Pipeline — AWS Batch with Docker-outside-of-Docker (DooD)

Status: IMPLEMENTED

This document describes the multi-architecture Docker image build pipeline using AWS Batch with Docker-outside-of-Docker (DooD).

Problem

  • AWS Batch simulation tasks need ARM64 containers (Graviton compute — cheaper, better perf)

  • EKS worker nodes are AMD64 only (GovCloud us-gov-west-1 limitation)

  • A single-arch build node cannot build both architectures

  • CodeBuild ARM64 is not available in us-gov-west-1

Solution

Two parallel AWS Batch jobs — one on ARM64 compute, one on AMD64 compute — each builds Docker images natively for its target architecture using Docker-outside-of-Docker (DooD).

DooD mounts the host EC2 instance’s Docker socket (/var/run/docker.sock) into the docker:cli container. No Docker daemon runs inside the container — it uses the host’s daemon directly. This avoids all DinD cgroup/storage driver issues.

Architecture

sms-api (submit_build_image_job)
│
├── AWS Batch Job (ARM64 Graviton compute)
│   └── docker:cli container (DooD via mounted socket)
│       └── Clones vEcoli, builds + pushes vecoli:{commit}
│           (ARM64 task image for Batch simulation tasks)
│
└── AWS Batch Job (AMD64 x86 compute)
    └── docker:cli container (DooD via mounted socket)
        └── Clones vEcoli, builds + pushes vecoli:{commit}-submit
            (AMD64 submit image: vEcoli + Java + Nextflow for EKS K8s Job)

What Gets Built

Image

Architecture

Used By

Built By

vecoli:{commit}

ARM64

AWS Batch tasks (Graviton)

Batch ARM64 job

vecoli:{commit}-submit

AMD64

K8s Nextflow head Job (EKS)

Batch AMD64 job

Infrastructure (CDK stack: smsvpctest-build-batch)

Resource

Details

ARM64 compute env

Graviton4 (m8g.xlarge, c8g.xlarge), on-demand, 16 max vCPUs

AMD64 compute env

x86 (m7i.xlarge, c7i.xlarge), on-demand, 16 max vCPUs

ARM64 job queue

vecoli-build-arm64

AMD64 job queue

vecoli-build-amd64

Job definition

vecoli-dind-builddocker:cli, unprivileged, host Docker socket mounted

GitHub PAT

vecoli-github-pat in Secrets Manager

ECR repository

vecoli

Build Script Flow

Each Batch job runs this sequence inside the docker:cli container:

  1. apk add aws-cli git bash — install tools (Alpine base)

  2. docker info — verify host Docker socket is accessible

  3. aws secretsmanager get-secret-value — fetch GitHub PAT

  4. git clone — clone vEcoli private repo via HTTPS + PAT

  5. git checkout {commit} — checkout specific commit

  6. aws ecr get-login-password | docker login — authenticate to ECR

  7. bash build-and-push-ecr.sh — build Docker image and push to ECR

The AMD64 job additionally builds the submit image (vEcoli + Java + Nextflow) on top of the base image.

Key Learnings

  • DinD doesn’t work in AWS Batch — Docker daemon inside containers fails due to cgroup/storage driver issues even with privileged mode on EC2-backed compute

  • DooD works perfectly — mounting host Docker socket is simple and reliable

  • docker:cli (not docker:dind) — only need the CLI, not the daemon

  • build-and-push-ecr.sh requires USER env var — set export USER=${USER:-sms-api} before calling it

  • GitHub private repos don’t support git fetch --depth 1 origin {hash} — use full single-branch clone instead

  • GitHub PAT authhttps://x-access-token:${PAT}@github.com/... is the standard convention

sms-api Code

  • simulation_service_k8s.py: _build_command() generates the shell script, _submit_batch_build() submits to Batch, _poll_batch_jobs() polls for completion, _run_build() orchestrates both jobs

  • config.py: build_arm64_queue, build_amd64_queue, build_job_definition, build_git_secret_arn

  • Build status polling via batch:DescribeJobs — both must succeed for overall build to complete