Skip to content

fix: use public Chainguard base images instead of private ECR#170

Merged
RoxyFarhad merged 1 commit intomainfrom
RF/fix-public-dockerfiles
Mar 18, 2026
Merged

fix: use public Chainguard base images instead of private ECR#170
RoxyFarhad merged 1 commit intomainfrom
RF/fix-public-dockerfiles

Conversation

@RoxyFarhad
Copy link
Collaborator

@RoxyFarhad RoxyFarhad commented Mar 18, 2026

Testing:

  • Built the Agentex UI image locally
  • Built the Agentex Server Image locally

Greptile Summary

This PR removes the dependency on private ECR golden base images by switching both the Python backend (agentex/Dockerfile) and the Node.js frontend (agentex-ui/Dockerfile) to use publicly available Chainguard images pulled directly from cgr.dev. The corresponding AWS OIDC authentication and ECR login steps are removed from the integration-test workflow. As a secondary change, the Python Dockerfile is refactored to use a standard /opt/venv virtual environment instead of installing packages directly into the system Python prefix, which simplifies the multi-stage COPY logic considerably. The UI Dockerfile also fixes a pre-existing bug where NODE_ENV=production was set before npm ci, inadvertently excluding dev dependencies needed for the Next.js build.

Key changes:

  • Both Dockerfiles now pull from cgr.dev/chainguard instead of 022465994601.dkr.ecr.us-west-2.amazonaws.com
  • Workflow removes AWS OIDC (id-token: write) permissions and the ECR login steps; job-level permissions block is dropped and the job correctly inherits workflow-level contents: read + packages: read
  • Python backend now installs into /opt/venv instead of the system Python, simplifying the production-stage COPY to a single directory
  • UI Dockerfile correctly installs all deps (including dev) before the build, then prunes dev deps — fixing the broken NODE_ENV=production-before-npm ci ordering
  • Both Dockerfiles use unpinned latest-dev tags (previously python:3.12-dev and node:20-dev), which introduces non-determinism and risks silent major-version upgrades on future builds

Confidence Score: 3/5

  • Functionally correct and a meaningful simplification, but the switch to unpinned latest-dev tags introduces build non-determinism that could silently break things on future runs.
  • The core goal (removing private ECR dependency) is achieved cleanly, and the workflow permission changes are correct. The main risk is that both Dockerfiles now use cgr.dev/chainguard/python:latest-dev and cgr.dev/chainguard/node:latest-dev without version pins. Chainguard's latest tag tracks the newest stable release and can advance major versions (e.g., Python 3.12 → 3.13, Node 20 → 22) automatically. The original ECR images were pinned to python:3.12 and node:20, so this is a regression in reproducibility. Additionally, using latest-dev independently in both the base and production stages of the Python Dockerfile creates a subtle risk: if Chainguard pushes an update between the two pulls within a single build, the build-time and run-time Python versions could differ.
  • Both agentex/Dockerfile and agentex-ui/Dockerfile use unpinned latest-dev image tags and should be revisited to restore version pins.

Important Files Changed

Filename Overview
agentex/Dockerfile Migrates from private ECR to public Chainguard images. Replaces system-level Python installation with an isolated venv at /opt/venv, significantly simplifying the multi-stage COPY logic. Both the base and production stages use unpinned latest-dev tags, which introduces non-determinism and the risk of silent major-version upgrades.
agentex-ui/Dockerfile Migrates from private ECR to public Chainguard node image. Fixes a pre-existing bug where NODE_ENV=production was set before npm ci, causing dev dependencies required for the Next.js build to be omitted. The fix installs all deps, builds, then prunes dev deps. Uses unpinned latest-dev tag.
.github/workflows/integration-tests.yml Removes job-level permissions block and the two AWS/ECR steps from run-integration-tests. The job now correctly inherits workflow-level permissions (contents: read, packages: read), so the GHCR login and checkout steps are unaffected. Clean removal with no side effects.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Before
        A1[Private ECR\n022465994601.dkr.ecr.us-west-2.amazonaws.com] -->|AWS OIDC + ECR Login| B1[python:3.12-dev / node:20-dev]
        B1 --> C1[Build image]
    end

    subgraph After
        A2[Public Chainguard Registry\ncgr.dev/chainguard] -->|No auth required| B2[python:latest-dev / node:latest-dev]
        B2 --> C2[Build image]
    end

    subgraph Workflow Change
        W1[run-integration-tests job] -->|Removed| W2[AWS credentials step]
        W1 -->|Removed| W3[ECR login step]
        W1 -->|Removed| W4[Job-level permissions block]
        W1 -->|Inherits from workflow level| W5[contents: read\npackages: read]
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: agentex/Dockerfile
Line: 1

Comment:
**Unpinned `latest-dev` tag risks non-deterministic builds**

The original ECR image used a pinned `python:3.12-dev` tag. Switching to `cgr.dev/chainguard/python:latest-dev` removes that version pin. Chainguard's `latest` tag follows the newest stable release and could silently advance to Python 3.13 or beyond on the next build, potentially introducing breaking changes (e.g., Python 3.13 removed several deprecated APIs).

Consider pinning to a specific minor version for reproducibility:

```suggestion
FROM cgr.dev/chainguard/python:3.12-dev AS base
```

The same applies to the production stage at line 55.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: agentex/Dockerfile
Line: 55

Comment:
**Unpinned `latest-dev` tag in production stage**

Same concern as the `base` stage — the production image also uses `cgr.dev/chainguard/python:latest-dev`. Because both the `base` and `production` stages use `latest-dev` independently, there is also a risk that two separate image pulls resolve to different digests if Chainguard pushes an update between pulls during a single build, resulting in a mismatched Python runtime between build-time and run-time.

```suggestion
FROM cgr.dev/chainguard/python:3.12-dev AS production
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: agentex-ui/Dockerfile
Line: 2

Comment:
**Unpinned `latest-dev` tag**

The original ECR image was pinned to `node:20-dev`. Using `cgr.dev/chainguard/node:latest-dev` could silently upgrade to Node.js 22, 24, or beyond on any future build. Node.js major-version upgrades can include breaking changes in native addons (relevant here since `sharp` is a native module) and built-in API behaviour.

```suggestion
FROM cgr.dev/chainguard/node:20-dev
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix: use public Chai..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@RoxyFarhad RoxyFarhad requested a review from a team as a code owner March 18, 2026 12:13
@RoxyFarhad RoxyFarhad merged commit 3613637 into main Mar 18, 2026
29 checks passed
@RoxyFarhad RoxyFarhad deleted the RF/fix-public-dockerfiles branch March 18, 2026 12:34
scale-ballen added a commit that referenced this pull request Mar 18, 2026
PR #170 switched to cgr.dev/chainguard/python which requires
authentication. Since scale-agentex is a public open-source repo,
keep python:3.12-slim-trixie (0 OS CVEs, no auth required).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sayakmaity added a commit that referenced this pull request Mar 18, 2026
The public Chainguard base image change (#170) causes
"no users found" CreateContainerError on the dev cluster.
Revert to the golden ECR base image that was working.
sayakmaity added a commit that referenced this pull request Mar 18, 2026
…rror (#171)

## Summary
The public Chainguard base image change (#170) uses `USER nonroot`, but
the golden base image has the user named `node` (not `nonroot`) at UID
65532. This causes `CreateContainerError: no users found` on the dev
cluster.

Switches to `USER 65532` (numeric UID) which works with both base
images.

This unblocks deployment of the SGPINF-1217 fix (#165).

## Test plan
- [ ] Image builds successfully
- [ ] Pod starts without CreateContainerError on dev cluster

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR fixes the `CreateContainerError: no users found` regression on
the dev cluster by changing `USER nonroot` to `USER 65532` (numeric UID)
in `agentex-ui/Dockerfile`. Using a numeric UID avoids a `/etc/passwd`
lookup, which is the correct approach for minimal/distroless-style
images like Chainguard where named users may not be registered.

**Important discrepancy:** The PR title and description say this
restores the golden ECR base image
(`022465994601.dkr.ecr.us-west-2.amazonaws.com/golden/chainguard/node:20-dev`),
but the `FROM` line is not changed — the image remains
`cgr.dev/chainguard/node:latest-dev`. The only actual code change is the
`USER` directive on line 53. This should be clarified to avoid
misleading git history.

Key points:
- The `USER 65532` numeric-UID fix directly addresses the `no users
found` error and is technically sound.
- The base image (`cgr.dev/chainguard/node:latest-dev`) uses a
**floating `latest-dev` tag**, so builds remain non-reproducible — this
was already the case before and is not introduced by this PR.
- If the golden ECR image is needed for reasons beyond the user setup
(e.g., internal CA trust, private registry hardening), the `FROM` line
still needs to be updated.

<details><summary><h3>Confidence Score: 4/5</h3></summary>

- Safe to merge as a targeted fix for the pod-start error; the
discrepancy between the PR description and the actual change (FROM line
not updated) should be confirmed as intentional before landing.
- The change is minimal (one line, `USER nonroot` → `USER 65532`) and
directly resolves the described `CreateContainerError: no users found`.
Numeric UIDs are the idiomatic fix for Chainguard distroless images. The
only concern is that the PR description claims a base-image revert that
did not actually happen, which could cause confusion. Once that intent
is confirmed/clarified, the risk is very low.
- agentex-ui/Dockerfile line 2 — the `FROM` image is still the public
Chainguard image, not the golden ECR image described in the PR.
</details>

<h3>Important Files Changed</h3>

| Filename | Overview |
|----------|----------|
| agentex-ui/Dockerfile | Single-line change switching `USER nonroot` to
`USER 65532` (numeric UID) to fix `CreateContainerError: no users
found`; base image (public `cgr.dev/chainguard/node:latest-dev`) is
unchanged despite PR description claiming a revert to the golden ECR
image. |

</details>

<details><summary><h3>Sequence Diagram</h3></summary>

```mermaid
sequenceDiagram
    participant Docker as Docker Build
    participant Image as cgr.dev/chainguard/node:latest-dev
    participant App as /app (Node.js)

    Docker->>Image: FROM cgr.dev/chainguard/node:latest-dev
    Docker->>App: USER root → apk add libvips-dev, python3, etc.
    Docker->>App: npm ci (all deps incl. dev)
    Docker->>App: npm run build
    Docker->>App: npm prune --omit=dev
    Docker->>App: chown -R 65532:65532 /app
    Note over Docker,App: PR #171 change: USER nonroot → USER 65532
    Docker->>App: USER 65532 (numeric UID — no /etc/passwd lookup)
    App-->>Docker: EXPOSE 3000, CMD ["npm", "start"]
```
</details>

<details><summary>Prompt To Fix All With AI</summary>

`````markdown
This is a comment left during a code review.
Path: agentex-ui/Dockerfile
Line: 2

Comment:
**Base image still public Chainguard, not the golden ECR image**

The PR description states this revert restores the golden ECR base image (`022465994601.dkr.ecr.us-west-2.amazonaws.com/golden/chainguard/node:20-dev`), but the `FROM` line is unchanged and still points to the public image `cgr.dev/chainguard/node:latest-dev`.

The actual fix is only the `USER nonroot` → `USER 65532` change on line 53. That numeric-UID approach is the correct workaround for `CreateContainerError: no users found` in distroless/minimal images that lack a full `/etc/passwd`, and it will likely resolve the immediate pod-start failure.

However, if the golden ECR image provides additional hardening (internal CAs, registry access control, pre-vetted dependency pinning, etc.) that the team relies on, this PR does not restore that. Please confirm whether the intent is:

1. **Just fix the USER directive** (current state — acceptable) — update the PR title/description to avoid confusion in git history.
2. **Actually revert to the golden ECR image** — the `FROM` line needs to be changed to `022465994601.dkr.ecr.us-west-2.amazonaws.com/golden/chainguard/node:20-dev`.

How can I resolve this? If you propose a fix, please make it concise.
`````

</details>

<sub>Last reviewed commit: ["fix: use numeric
UID..."](https://github.com/scaleapi/scale-agentex/commit/f2a90eff90ec89f0e1a824fcbcd9608e5fb1fc6e)</sub>

<!-- /greptile_comment -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants