Ask Runable forDesign-Driven General AI AgentTry Runable For Free
Runable
Back to Blog
Technology14 min read

Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering | VentureBeat

Four supply-chain attacks hit OpenAI, Anthropic, and Meta in 50 days — none inside the model. A 7-row matrix maps what AI vendor questionnaires are missing.

TechnologyInnovationBest PracticesGuideTutorial
Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering | VentureBeat
Listen to Article
0:00
0:00
0:00

Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering | Venture Beat

Overview

Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering

Four supply-chain incidents hit Open AI, Anthropic and Meta in 50 days: three adversary-driven attacks and one self-inflicted packaging failure. None targeted the model, and all four exposed the same gap: release pipelines, dependency hooks, CI runners, and packaging gates that no system card, AISI evaluation, or Gray Swan red-team exercise has ever scoped.

Details

On May 11, 2026, a self-propagating worm called Mini Shai-Hulud published 84 malicious package versions across 42 @tanstack/* npm packages in six minutes flat. The worm rode in on release.yml, chaining a pull_request_target misconfiguration, Git Hub Actions cache poisoning, and OIDC token extraction from runner memory to hijack Tan Stack’s own trusted release pipeline. The packages carried valid SLSA Build Level 3 provenance because they were published from the correct repository, by the correct workflow, using a legitimately minted OIDC token. No maintainer password was phished. No 2FA prompt was intercepted.

The trust model worked exactly as designed and still produced 84 malicious artifacts.

Two days later, Open AI confirmed that two employee devices were compromised and credential material was exfiltrated from internal code repositories. Open AI is now revoking its mac OS security certificates and forcing all desktop users to update by June 12, 2026. Open AI noted that it had already been hardening its CI/CD pipeline after an earlier supply-chain incident, but the two affected devices had not yet received the updated configurations. That is the response profile of a build-pipeline breach, not a model-safety incident.

Model red teams do not cover release pipelines. The four incidents below are evidence for a single architectural finding that belongs in every AI vendor questionnaire.

Open AI Codex command injection (disclosed March 30, 2026). Beyond Trust Phantom Labs researcher Tyler Jespersen found that Open AI Codex passed Git Hub branch names directly into shell commands with zero sanitization. An attacker could inject a semicolon and a backtick subshell into a branch name, and the Codex container would execute it, returning the victim’s Git Hub OAuth token in cleartext. The flaw affected the Chat GPT website, Codex CLI, Codex SDK, and the IDE Extension. Open AI classified it Critical Priority 1 and completed remediation by February 2026. The Phantom Labs team used Unicode characters to make a malicious branch name visually identical to "main" in the Codex UI. One branch name. That is where the attack started.

Lite LLM supply-chain poisoning and Mercor breach (March 24–27, 2026). The threat group Team PCP used credentials stolen in a prior compromise of Aqua Security’s Trivy vulnerability scanner to publish two poisoned versions of the Lite LLM Python package to Py PI. Lite LLM is a widely adopted open-source LLM proxy gateway used across major AI infrastructure teams. The malicious versions were live for roughly 40 minutes and received nearly 47,000 downloads before Py PI quarantined them.

The attack cascaded downstream into Mercor, the $10 billion AI data startup that supplies training data to Meta, Open AI, and Anthropic. Four terabytes exfiltrated, including proprietary training methodology references from Meta. Meta froze the partnership indefinitely. A class action followed within five days. One compromised open-source dependency sitting 40 minutes on Py PI created a cross-industry blast radius that no single vendor’s model red team would have caught.

Anthropic Claude Code source map leak (March 31, 2026). This incident was not adversary-driven. Anthropic shipped Claude Code version 2.1.88 to the npm registry with a 59.8 MB source map file that should never have been included. The map file pointed to a zip archive on Anthropic’s own Cloudflare R2 bucket containing 513,000 lines of unobfuscated Type Script across 1,906 files. Agent orchestration logic. 44 feature flags. System prompts. Multi-agent coordination architecture. All public. All downloadable. No authentication required. Security researcher Chaofan Shou flagged the exposure within hours, and Anthropic pulled the package. Anthropic confirmed it was a “release packaging issue caused by human error.” This was the second such leak in 13 months. The root cause was a missing line in .npmignore. No attacker was involved, but the release-surface gap is identical. No human review gate existed between the build artifact and the registry publish step.

Tan Stack worm and downstream propagation (May 11–14, 2026). Wiz Research attributed the Mini Shai-Hulud attack to Team PCP with high confidence. Step Security detected the compromise within 20 minutes. The worm spread beyond Tan Stack to Mistral AI, Ui Path, and 160-plus packages within hours. Mini Shai-Hulud even impersonated the Anthropic Claude Git Hub App identity by authoring commits under the fabricated identity “claude claude@users.noreply.github.com” to bypass code review.

Four incidents. Three frontier labs. One finding. The red-team scope stops at the model boundary, and the build pipeline sits on the other side of it.

On May 10, 2026, Open AI launched Daybreak, a cybersecurity initiative built on GPT-5.5 and a new permissive model called GPT-5.5-Cyber designed for authorized red teaming, penetration testing, and vulnerability discovery. Daybreak pairs Codex Security with partners, including Cisco, Crowd Strike, Akamai, Cloudflare, and Zscaler. Open AI positioned the launch as proof that frontier AI can tilt the balance toward defenders.

The next day, the Tan Stack worm compromised two Open AI employee devices.

Open AI’s own incident disclosure acknowledged the gap directly. The company had already been hardening its CI/CD pipeline after the earlier Axios supply-chain attack, but the two affected devices “did not have the updated configurations that would have prevented the download.” The controls existed. The deployment was in progress. The worm arrived first.

The security community saw the same gap: Security researcher @En Tr 0p Y_88 noted on X that the real signal was the certificate rotation, not the exfiltrated code. "The cert rotation…is what you do when the blast radius reached signing trust, not just source access." @Open Matter_ put the SLSA provenance failure in one sentence. "If an attacker controls your CI runner, they control your attestations. Policy-based security is failing at scale." And @The_Calda compressed the disclosure's internal contradiction into seven words. "'Limited impact' but the next sentence is 'we're rotating signing certs.'"

A company that launched a cyber defense platform on Sunday and disclosed a build-pipeline breach on Tuesday is not failing at model safety. Open AI is demonstrating the exact gap this audit grid exists to close. The model red team and the release-pipeline red team are two different disciplines; four incidents in 50 days suggest only one of them is being funded consistently.

The matrix below maps the seven release-surface classes missing from AI vendor questionnaires, with vendor hit, failure mechanism, detection gap, technical mitigation, and priority tier a security team can execute before Q2 renewals close.

For teams that need to map these rows into existing GRC tooling, rows 2, 3, and 5 align with NIST SSDF PS.1.1 (protect all forms of code from unauthorized access and tampering). Row 4 maps to SSDF PS.2.1 (provide mechanisms for verifying software release integrity). Row 6 maps partially to SLSA Source Track requirements for verified contributor identity, though no published framework directly addresses upstream dependency maintainer credential provenance. Row 7 is not yet addressed by any published framework, which is itself the finding.

Model capability evals (jailbreak, misuse, exfiltration)

Model capability evals (jailbreak, misuse, exfiltration)

Covered. System cards, AISI Expert suite, Gray Swan scope this today.

Covered. System cards, AISI Expert suite, Gray Swan scope this today.

Continue requiring the system card at every renewal.

Continue requiring the system card at every renewal.

Tan Stack pwn-request ran fork code in base-repo context. Poisoned pnpm cache. Extracted OIDC token from runner memory. Two Open AI employee devices compromised.

Tan Stack pwn-request ran fork code in base-repo context. Poisoned pnpm cache. Extracted OIDC token from runner memory. Two Open AI employee devices compromised.

No system card covers CI runner isolation. No AISI eval tests fork-to-base trust boundaries.

No system card covers CI runner isolation. No AISI eval tests fork-to-base trust boundaries.

Audit every repo for pull_request_target + fork SHA checkout. Block fork code from base-repo context. Pin cache keys to commit SHA.

Audit every repo for pull_request_target + fork SHA checkout. Block fork code from base-repo context. Pin cache keys to commit SHA.

Tan Stack minted valid SLSA Build Level 3 provenance for all 84 malicious packages. First known npm worm with valid cryptographic attestation.

Tan Stack minted valid SLSA Build Level 3 provenance for all 84 malicious packages. First known npm worm with valid cryptographic attestation.

SLSA attestation confirms build origin, not build intent. No vendor questionnaire distinguishes the two.

SLSA attestation confirms build origin, not build intent. No vendor questionnaire distinguishes the two.

Pin trusted publisher to branch + workflow, not just repository. Add behavioral analysis at install time.

Pin trusted publisher to branch + workflow, not just repository. Add behavioral analysis at install time.

Release packaging review (human gate before publish)

Release packaging review (human gate before publish)

Missing .npmignore shipped 59.8 MB source map in Claude Code npm package. 513K lines exposed including agent logic, 44 feature flags, system prompts. Second leak in 13 months. Self-inflicted, not adversary-driven.

Missing .npmignore shipped 59.8 MB source map in Claude Code npm package. 513K lines exposed including agent logic, 44 feature flags, system prompts. Second leak in 13 months. Self-inflicted, not adversary-driven.

No red-team exercise checks artifact contents before registry publish.

No red-team exercise checks artifact contents before registry publish.

Human review between build artifact and registry publish. Enforce .npmignore in CI. Fail build on unexpected artifact size.

Human review between build artifact and registry publish. Enforce .npmignore in CI. Fail build on unexpected artifact size.

router_init.js executes on import. tanstack_runner.js self-propagates via optional Dependencies prepare hook. Spread to Mistral AI, Ui Path, 160+ packages in hours.

router_init.js executes on import. tanstack_runner.js self-propagates via optional Dependencies prepare hook. Spread to Mistral AI, Ui Path, 160+ packages in hours.

Lifecycle hooks execute before any scanner runs. Model evals never test package install behavior.

Lifecycle hooks execute before any scanner runs. Model evals never test package install behavior.

Disable lifecycle scripts in CI by default. Explicit allowlist for production. Flag new optional Dependencies in PR review. Set minimum Release Age.

Disable lifecycle scripts in CI by default. Explicit allowlist for production. Flag new optional Dependencies in PR review. Set minimum Release Age.

Team PCP stole Lite LLM maintainer credential via prior Trivy compromise. Two poisoned Py PI versions live 40 min. Mercor cache held Meta training methodology references. 4 TB exfiltrated. Meta froze the partnership.

Team PCP stole Lite LLM maintainer credential via prior Trivy compromise. Two poisoned Py PI versions live 40 min. Mercor cache held Meta training methodology references. 4 TB exfiltrated. Meta froze the partnership.

Vendor questionnaires ask about encryption and access control, not maintainer credential provenance for upstream dependencies.

Vendor questionnaires ask about encryption and access control, not maintainer credential provenance for upstream dependencies.

Require hardware-key auth from every maintainer before onboarding. Add package-manager cooldown. Audit transitive dependency tree quarterly.

Require hardware-key auth from every maintainer before onboarding. Add package-manager cooldown. Audit transitive dependency tree quarterly.

Beyond Trust Phantom Labs injected shell commands through Git Hub branch-name parameter. Stole OAuth tokens from Codex container. Scalable across shared repos. Rated Critical P1, patched Feb 2026.

Beyond Trust Phantom Labs injected shell commands through Git Hub branch-name parameter. Stole OAuth tokens from Codex container. Scalable across shared repos. Rated Critical P1, patched Feb 2026.

Agent red teams test prompt injection, not input-parameter injection at the container level.

Agent red teams test prompt injection, not input-parameter injection at the container level.

Sanitize all external input before shell execution. Audit OAuth token scope and lifetime per agent session. Enforce least-privilege on every container.

Sanitize all external input before shell execution. Audit OAuth token scope and lifetime per agent session. Enforce least-privilege on every container.

Add one question to every AI vendor questionnaire. "Does your organization red-team its release pipeline, including CI runner trust boundaries, OIDC token scoping, dependency lifecycle hooks, and registry publish gates? Provide the last assessment date and scope." No date and no scope document is the finding.

Add one question to every AI vendor questionnaire. "Does your organization red-team its release pipeline, including CI runner trust boundaries, OIDC token scoping, dependency lifecycle hooks, and registry publish gates? Provide the last assessment date and scope." No date and no scope document is the finding.

Run rows 2 through 7 against your own CI pipelines this week. Step Security and Snyk both published detection and remediation steps for the Tan Stack worm patterns. Dev teams pull Open AI SDKs, Anthropic packages, and Llama weights through npm, Py PI, and Hugging Face every week. The same patterns that got exploited are in your CI right now.

Run rows 2 through 7 against your own CI pipelines this week. Step Security and Snyk both published detection and remediation steps for the Tan Stack worm patterns. Dev teams pull Open AI SDKs, Anthropic packages, and Llama weights through npm, Py PI, and Hugging Face every week. The same patterns that got exploited are in your CI right now.

Brief the board on the provenance gap. The Tan Stack worm proved that valid cryptographic provenance can sit on top of a malicious package. Attestation tells the board where a package was built. Behavioral analysis tells the board what it does after install. Q2 renewal requires both. Snyk's analysis recommends pinning trusted publisher configurations to specific branches and workflows, not just repositories. That is the language the board presentation needs.

Brief the board on the provenance gap. The Tan Stack worm proved that valid cryptographic provenance can sit on top of a malicious package. Attestation tells the board where a package was built. Behavioral analysis tells the board what it does after install. Q2 renewal requires both. Snyk's analysis recommends pinning trusted publisher configurations to specific branches and workflows, not just repositories. That is the language the board presentation needs.

The worm already knows where your AI credentials live

Mini Shai-Hulud does not stop at CI secrets. Datadog Security Labs documented that the payload reads ~/.claude.json and exfiltrates it. It scans for 1 Password and Bitwarden vaults, Kubernetes service accounts, cloud provider tokens, and shell history files where developers paste API keys. Step Security's deobfuscation confirmed that Mini Shai-Hulud harvests Claude and Kiro MCP server configurations, which store API keys and auth tokens for external services. For developers using AI coding agents, the worm already knows where their credentials live.

Open AI, Anthropic, and Meta will keep publishing system cards. They will keep funding red-team competitions. They will keep passing model evaluations. None of that stops the next worm from riding in on release.yml.

The Tan Stack postmortem team said it directly. Modern supply-chain defenses are important but not sufficient on their own. Teams must proactively identify and close workflow gaps rather than relying solely on the security features of their tools.

Deep insights for enterprise AI, data, and security leaders

By submitting your email, you agree to our Terms and Privacy Notice.

Key Takeaways

  • Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering

  • Four supply-chain incidents hit Open AI, Anthropic and Meta in 50 days: three adversary-driven attacks and one self-inflicted packaging failure

  • On May 11, 2026, a self-propagating worm called Mini Shai-Hulud published 84 malicious package versions across 42 @tanstack/* npm packages in six minutes flat

  • The trust model worked exactly as designed and still produced 84 malicious artifacts

  • Two days later, Open AI confirmed that two employee devices were compromised and credential material was exfiltrated from internal code repositories

Cut Costs with Runable

Cost savings are based on average monthly price per user for each app.

Which apps do you use?

Apps to replace

ChatGPTChatGPT
$20 / month
LovableLovable
$25 / month
Gamma AIGamma AI
$25 / month
HiggsFieldHiggsField
$49 / month
Leonardo AILeonardo AI
$12 / month
TOTAL$131 / month

Runable price = $9 / month

Saves $122 / month

Runable can save upto $1464 per year compared to the non-enterprise price of your apps.