AgentVision — Eyes for AI Agents

Trustworthy by design

Grounded findings, not guesses

Every issue is anchored to a real DOM source — no hallucinated pixel boxes.

📐

DOM Geometry

Precise element boxes via getBoundingClientRect + scroll offset. Coordinate-grounded overflow and clip detection.

🎨

Real WCAG Contrast

Ratios from getComputedStyle — degrades honestly over gradients and images rather than lying.

🔤

OCR Word Boxes

Tesseract-powered text locations for clipped labels, overflowing content, and garbled rendering.

🌐

Network & Console

Catches 4xx errors and console warnings — the #1 “looks fine in code, broken live” cause.

🤖

Vision LLM Critique

Claude / OpenAI / Gemini add semantic critique on top. Pixel boxes flagged bbox_precise:false.

📱

Responsive Sheets

Contact sheets across 375, 768, 1280, 1920 px breakpoints to surface layout bugs at every viewport.

📸

Visual Regression

Named baseline snapshots + diff scoring so agents know exactly what regressed between iterations.

🩺

Doctor Command

Runs a real Chromium launch, checks Tesseract, poppler, and every missing system dependency.

🔌

Provider-Agnostic API

Swap backends via --backend flag or env var. Uniform API surface across all vision providers.

🎞️

Temporal Verification

watch samples frames over time to verify playback, loading & liveness — is the video actually playing, did the spinner clear, are captions on. Deterministic <video> state + pixel liveness. For streaming UIs & live dashboards.

🧩

Full-Coverage Vision

Large artifacts are tiled at full resolution (overview + detail), so no small text or chart data is lost to downscaling. Pixel-based & source-agnostic — HTML, images, PDFs, canvas alike.

Eyes & Brain

The eyes for a brain that decides

AgentVision is the eyes. Pair it with Verel — the brain — where nothing is “done” until a grader returns a verdict. The eyes perceive and grade intent; the brain decides with attestation and compounds only verified work into memory. Then the eyes look again.

Eyes and Brain - AgentVision perceives and grades intent; Verel decides and compounds verified work into memory

New in v0.9.1: intent conformance (does it match what you set out to build?), a generative loop for AI images/infographics, and the eyes→brain Handoff signal. Verel on GitHub →

Flexible integration

Many faces, one core

Use AgentVision from any surface that fits your workflow.

Surface	Who it’s for
Library import agentvision	Python apps and custom agent harnesses
CLI agentvision analyze / loop / sheet	Any agent that can run a shell command; CI pipelines
Skill Claude Code Skill	Claude agents — auto-invokes the loop before claiming done
MCP agentvision-mcp	Cursor, Claude Desktop, any MCP-capable host
REST agentvision-serve	Non-MCP / networked / CI agents via HTTP

Vision backends

Pluggable & provider-agnostic

Switch via --backend or AGENTVISION_VISION_BACKEND

🟣

Anthropic

Default: claude-haiku-4-5
Upgradable to Sonnet / Opus

🟢

OpenAI

GPT-4o and compatible vision models

🔵

Gemini

Google Gemini vision via google-genai

⚫

Local (no key)

CV + OCR heuristics. No API key, no egress — great for CI & air-gapped envs.

Get started

Running in 60 seconds

No API key required for the demo.

Quick start

# install with rendering
pip install "agentvision[render]"

# install Chromium
playwright install chromium

# run the demo (no API key)
agentvision demo

# check system deps
agentvision doctor

CLI usage

# analyze a file / URL
agentvision analyze ./index.html --backend local

# self-correcting loop
agentvision loop ./dashboard.html --max-iter 3

# responsive contact sheet
agentvision sheet ./index.html --breakpoints 375,768,1280

# visual regression
agentvision baseline ./index.html --name home
agentvision regress  ./index.html --name home

Install extras

# everything
pip install "agentvision[all]"

# render + Claude
pip install "agentvision[render,anthropic]"

# MCP server
pip install "agentvision[render,mcp]"

# REST service
pip install "agentvision[render,serve]"

Python API

import asyncio
from agentvision import load_settings
from agentvision.core.loop import LoopSession

async def main():
    settings = load_settings(
        vision_backend="local"
    )
    session = LoopSession(
        "examples/broken.html",
        settings=settings
    )
    result = await session.iterate()
    print(result.report.verdict)

asyncio.run(main())

What we do not claim

Pixel-accurate vision-model bounding boxes (advisory only)
WCAG verdicts on rasterized non-HTML
Bit-reproducible screenshots across runs
Deterministic LLM reports
Uniform behavior across all providers
Forcing non-Claude agents into the loop

Adopt

Drop it into your workflow & agents

CI gate in one step (uses: amitpatole/agent-vision@v0.9.1), a pre-commit hook, or shell out anywhere (exit codes 0/2/3). For agents: the Claude Code Skill, MCP tools, or a drop-in agent contract.

Coding agents are blind.
Now they can see.

The Self-Correcting Visual Loop

Grounded findings, not guesses

DOM Geometry

Real WCAG Contrast

OCR Word Boxes

Network & Console

Vision LLM Critique

Responsive Sheets

Visual Regression

Doctor Command

Provider-Agnostic API

Temporal Verification

Full-Coverage Vision

The eyes for a brain that decides

Many faces, one core

Pluggable & provider-agnostic

Anthropic

OpenAI

Gemini

Local (no key)

Running in 60 seconds

What we do not claim

Drop it into your workflow & agents

Coding agents are blind.Now they can see.

The Self-Correcting Visual Loop

Grounded findings, not guesses

DOM Geometry

Real WCAG Contrast

OCR Word Boxes

Network & Console

Vision LLM Critique

Responsive Sheets

Visual Regression

Doctor Command

Provider-Agnostic API

Temporal Verification

Full-Coverage Vision

The eyes for a brain that decides

Many faces, one core

Pluggable & provider-agnostic

Anthropic

OpenAI

Gemini

Local (no key)

Running in 60 seconds

What we do not claim

Drop it into your workflow & agents

Coding agents are blind.
Now they can see.