Środowiska
Dla firm
Cennik

DeveloperMay 19, 2026

Audit ML CI Failures with Gemini 3.5 Flash on Eigent

Regina Bai

Audit ML CI Failures with Gemini 3.5 Flash and the Gemini Agent on Eigent

Automate Everything with
AI Workforce on DesktopDownload Eigent

Find the Root Cause of ML CI Failures in Minutes with Gemini 3.5 Flash

Debugging a broken ML training pipeline is slow, tedious work. You pull logs from two different CI runs, diff them against golden values, dig through commit history to find the regression, and then write up a report explaining what went wrong and why, all while your team waits. This use case automates that entire investigation.

By combining the ml-failure-audit skill with Google's Gemini 3.5 Flash model and the Gemini Agent API as a remote reasoning engine, Eigent's multi agent workforce can audit a CI failure end to end: fetching logs, extracting reference values, tracing evidence, delegating the heavy analysis, and producing structured deliverables, all from a single prompt.

Select Gemini 3.5 Flash as Your Model

Go to Settings → Agents → Model and select Gemini 3.5 Flash from the cloud model list. If you prefer to use your own API credentials, bring your own Gemini key by entering it under Settings → API Keys → Gemini.

Gemini 3.5 Flash is optimized for fast, cost effective inference on long context tasks, exactly what CI log analysis demands.

Enable the Gemini Agent API as a Remote Sub Agent

Go to Settings → Agents → Remote Agents and toggle on the Gemini Agent API. This registers the Gemini Agent as a callable sub agent inside Eigent's workforce.

Once enabled, your Developer Agent can hand off computationally intensive reasoning tasks, like root cause analysis across hundreds of log lines, directly to the Gemini Agent, instead of processing everything in a single model call. This gives you a two tier setup: Eigent's local agents handle orchestration and tool use, while the Gemini Agent handles deep reasoning.

Upload the ml-failure-audit Skill

Head to Settings → Agents → Skills and upload the ml-failure-audit skill package. You can also browse the Skill Hub: ml-failure-audit for the skill details and installation steps. The skill defines how Eigent should approach CI failure audits: which artifacts to collect, what comparisons to run, what evidence to gather, and how to structure the final report.

Once uploaded, any agent in the workforce can invoke this skill when handling ML audit tasks.

Send Your Task to Eigent

With everything configured, type your task prompt into Eigent's chat:

Follow the {{ml-failure-audit}} skill, and use remote sub agent to finish complex subtasks.

Please audit this Megatron-LM MIMO VLM pretraining golden metric CI failure. I am giving you a local NVIDIA/Megatron-LM checkout at commit <your-commit-sha> and the CI artifacts I attached (for example, passing and failing run logs). The failing workload is an 8-GPU frozen start convergence check using sequence packing, global batch size 32, total packed sequence length 3200, packing buffer 4, and 100 training iterations.

Please decide whether the failure is a real model convergence/correctness regression or a metric/gating policy issue. Use the repo's golden value comparison code and the CI logs as evidence. Do not rerun GPU training.

Produce answer.json in the repo root with source_refs, extracted_facts, calculations, final_answer, and validation. Also produce a concise answer.md.

Include the repository URL, your target commit checkout, and attach the CI artifacts you want compared. Eigent immediately begins planning the investigation.

Install the ml-failure-audit skill before running this prompt.

Bring your own inputs: replace <your-commit-sha> with the commit you want audited, check out that revision in your workspace, and attach your own CI artifacts (for example, passing vs. failing run logs, stderr captures, or exported CI job output). You can adapt the Megatron-LM example to any repo and failure you are investigating.

Coordinator Agent Plans and Assigns the Task

Eigent's Coordinator Agent reads the prompt and decomposes it into a structured audit plan. It identifies the key phases (log retrieval, data extraction, evidence tracing, and report generation) and assigns the full investigation to a Developer Agent.

The Coordinator doesn't just delegate blindly: it passes along the skill reference, the repo context, and the CI log artifacts so the Developer Agent starts with everything it needs.

Developer Agent Loads the Skill and Fetches the Logs

The Developer Agent's first action is to load the ml-failure-audit skill, reading its instructions to understand the audit methodology.

It then runs 4 commands in parallel to grab the CI log data, pulling the two failure logs and any relevant metadata simultaneously. Parallel tool execution means the data collection phase completes in a fraction of the time it would take sequentially.

Extract Golden Values and Trace the Fix Commit

With the logs in hand, the Developer Agent runs a Python script to extract the golden reference values: the expected training metrics, loss curves, or benchmark numbers that a passing CI run should produce. It then diffs these against the values recorded in the failing logs to identify exactly where and by how much things diverged.

Next, the Developer Agent searches the Megatron-LM commit history to find the fix commit, the specific code change most likely responsible for the regression. This commit serves as concrete evidence in the audit report, giving reviewers a direct link between the observed failure and the underlying code change.

Delegate Deep Reasoning to the Gemini Agent

Once the raw evidence is assembled (log diffs, golden value comparisons, and the traced commit), the Developer Agent calls the Gemini Agent to perform the heavy reasoning step.

The Gemini Agent analyzes the full context: what changed in the code, how that change affected training behavior, and what the most probable root cause is. Minutes later, it returns a complete, structured audit report covering the failure diagnosis, contributing factors, and recommended fix.

Developer Agent Writes the Final Audit Reports

The Developer Agent takes the Gemini Agent's analysis and writes two deliverables into the workspace:

answer.json: a machine readable audit record with structured fields for the failure type, root cause, affected metrics, evidence commit, and recommended resolution. Useful for automated pipelines, ticket systems, or CI dashboards.
answer.md: a concise, human readable audit summary covering what failed, why it failed, the evidence, and what to do next. Ready to paste into a PR comment, Slack thread, or incident report.

Both files are written directly to the workspace folder and are immediately accessible.

Why This Workflow Matters

ML CI failures are notoriously hard to debug because the signal is buried in dense log output and the root cause often lives several commits back from the symptom. This workflow addresses that with three capabilities working in concert:

Parallel log retrieval eliminates the sequential bottleneck of pulling artifacts one at a time.
Python based golden value extraction applies precise numerical comparison instead of relying on pattern matching or manual inspection.
Gemini Agent as a reasoning sub agent offloads the most complex inference step to a model optimized for it, keeping the orchestration lightweight and the analysis deep.

The result is a root cause audit that would take an engineer 30–60 minutes of focused work delivered in a few minutes, with a structured artifact trail.

What to Try Next

Once your first audit is complete, extend the workflow with follow up prompts like:

Run the same audit against the three most recent CI failures and compare the root causes.

After finding the fix commit, open a GitHub issue with the audit report pre filled.

Schedule a nightly trigger to audit any new CI failures and post the answer.md to Slack.

Swap in a different model, try Gemini 3.5 Pro for deeper analysis or Gemini Flash Lite for faster turnaround.

Tips for Better Results

Attach your CI artifacts explicitly. The ml-failure-audit skill works best when you provide the commit checkout plus the logs or exports you want compared (for example, a passing run and a failing run).
Include the repo URL. The Developer Agent uses it to search commit history for the fix commit. A direct link to the repository saves a search step.
Specify your output files. Asking for both answer.json and answer.md tells the Developer Agent to produce both formats, useful if you need machine readable output for a CI pipeline and human readable output for your team.
Use the Gemini Agent for reasoning heavy tasks. The remote sub agent pattern works best when local agents handle data collection and the Gemini Agent handles synthesis. Avoid calling it for simple lookups that local tool use can handle faster.

Other use cases

Long-Horizon Task: GLM-5.1 vs GLM-5.2 on Eigent

Long-Horizon Task: GLM-5.1 vs GLM-5.2 on Eigent

Do a deep-dive research on 26 companies in the AI infrastructure ecosystem — the most certain main thread of the entire AI value chain. Cover these 6 sub-sectors (pick representative companies in each, from large-cap leaders down to smaller players): AI Data Center (compute infrastructure / build-out); GPU / AI Chips (training & inference silicon, ASICs, IP); Servers, Networking & Optical Modules (switches, NICs, optical interconnect); Power, Liquid Cooling & Energy Storage (power supply, thermal, energy management); AI Cloud / Compute Platform (hyperscalers, GPU clouds, compute-rental platforms); Supporting Ecosystem (HBM / advanced packaging, foundry, connectors & other critical components). For each company, research: company name, sub-sector, HQ / country; core products and its specific role in the AI chain; public or private (ticker + exchange if listed; if private, note latest valuation / funding round); market cap or valuation size (used for ranking); positioning and moat in the ecosystem (1–2 sentences); key customers / competitors. Ordering: within each sub-sector, rank from largest to smallest (by market cap / valuation). Structure the whole thing top-down: from the full hardware-ecosystem landscape → down to each individual company. Output requirements: First, generate a structured data file ai_infra_data.json — containing all 26 companies with the fields above, the 6 sub-sector classifications, a public/private flag, and a cross-company comparison matrix (sub-sector × key dimensions). Then generate a polished HTML report from that JSON: include an ecosystem landscape / layered diagram, sector sections, company cards, a clear visual indicator for public vs. private (tags or color coding), a market-cap ranking chart, and a sortable/filterable comparison table. Make the design professional, information-dense, and interactive. Verify the research data for accuracy first (listing status, tickers, valuations — use the latest figures and cite sources), then generate the report. Send the task in single-agent mode.

Build 10 Chinese New Year HTML5 Games with Eigent

Build 10 Chinese New Year HTML5 Games with Eigent

Build 10 separate and COMPLETE games with topics related to 2026 Chinese New Year (Horse) in HTML, CSS and JS (no libraries). Games must be fun, original, polished, mobile-friendly. Include scoring, scaling difficulty, restart buttons, and smooth visuals. Cover: arcade, puzzle, endless runner, reaction, strategy, memory, 2-player local, idle, retro pixel, and 1 experimental game.

Build a 3D Snow Bros Platformer with Gemini 3.1 Pro

Build a 3D Snow Bros Platformer with Gemini 3.1 Pro

Create a modern 3D side-scrolling platformer inspired by Mario, combined with Snow Bros mechanics. The player can shoot snow projectiles to freeze monsters into snowballs, then kick them to chain into other enemies. Include a scoring system, lives display, scaling difficulty, and a restart function with rich 3D layered environments.

Automate everything with AI workforce on desktop

Download Eigent

Wypróbuj Eigent już dziś

Pobierz open-source’ową aplikację desktopową. Twoja SI workforce, działająca na Twoim komputerze.

Otrzymuj najnowsze aktualizacje, poradniki i wydania dotyczące automatyzacji SI workforce.

ProduktEigent Środowiska Cennik Dla firm

OdkrywajRozwiązania Przypadki użycia Umiejętności Wtyczki Blogi

DeweloperzyDokumentacja GitHub CAMEL-AI Fundusz Open Source Partner

PobierzDla open source

FirmaO nas Brand Kariera Warunki korzystania Polityka prywatności Bezpieczeństwo i zaufanie Polityka plików cookie Polityka zwrotów i wersji próbnej

Wszelkie prawa zastrzeżone © 2026 EIGENT UK LTD

Wydano nową wersję Eigent 1.0!