Skip to main content

Vision — vision_analyze

vision_analyze is a single-shot tool that answers a prompt about one image (PNG / JPEG / GIF / WEBP) or one PDF. The active personality's LLM — or a separately configured auxiliary vision model — returns the text answer plus token usage and dollar cost.

It is intentionally not a streaming chat surface. Each call returns one envelope. Multi-turn vision conversations layer on top by repeating calls.

Source

The tool factory lives in extensions/tools-vision/src/index.ts (createVisionTools). The capability table lives in extensions/tools-vision/src/pricing.ts. Wiring registers the tool in packages/wiring/src/index.ts, mirroring the auxiliary.compression pattern.

Opting in

Add vision_analyze to the personality's toolset.yaml:

# ~/.ethos/personalities/<id>/toolset.yaml
- read_file
- vision_analyze

The wiring registers the tool unconditionally; the personality toolset allowlist is what gates which personalities can see it. See Personality config reference.

Signature

vision_analyze({
file_path?: string,
file_url?: string,
file_base64?: string,
prompt: string,
model?: string,
format?: { type: 'json_schema', schema: { type: 'object', ... } },
}) → JSON envelope

Exactly one of file_path / file_url / file_base64 must be set.

FieldTypeRequiredDescription
file_pathabsolute pathone-ofLocal file. Must lie inside the personality's fs_reach allowlist.
file_urlHTTPS URLone-ofFetched through the @ethosagent/safety-network SSRF gate. Max 32 MB.
file_base64base64 stringone-ofRaw bytes. Optional data:<mime>;base64, prefix accepted.
promptstringyesQuestion or instruction for the model.
modelstringnoOverride; otherwise follows the model fallback chain below.
format.type'json_schema'noAsk the model for parseable JSON.
format.schemaJSON SchemanoTop-level type must be 'object' in v1.

Return envelope (success)

{
"text": "the model's answer",
"parsed": { "...": "..." },
"model": "claude-opus-4-7",
"cost_usd": 0.0034,
"input_tokens": 1287,
"output_tokens": 48
}

parsed is present only when format.json_schema was supplied. ToolResult.cost_usd matches envelope.cost_usd so the framework's per-session cost counter rendered by /usage increments correctly.

maxResultChars: 8_000 — long transcripts get clipped, with the per-call truncation marker DefaultToolRegistry appends to every over-budget result.

Model fallback chain

Per call, the resolved model is:

  1. args.model (caller override) — useful for "use gpt-5 for this one".
  2. auxiliaryVisionModel — from auxiliary.vision.model in ~/.ethos/config.yaml.
  3. defaultModel — the active personality's main model.

First non-null wins. The capability table then gates whichever model the chain returned. The LLMProvider that serves the resolved model is looked up via a resolveProvider callback wired at registration time — when wiring has no provider for the resolved id, the tool fails with VISION_NOT_SUPPORTED.

Capability table

Two flags per model — vision and pdf. Unknown models default to both flags false, which surfaces as a VISION_NOT_SUPPORTED or PDF_NOT_SUPPORTED error.

Modelvisionpdf
claude-opus-4-7yesyes
claude-sonnet-4-6yesyes
gpt-5yesyes
gpt-5-miniyesyes
gemini-2.5-proyesyes
gemini-2.5-flashyesyes

Helpers: supportsVision(model) and supportsPdf(model) (both exported from @ethosagent/tools-vision). Adding a model is a one-row edit in pricing.ts. Aliases (dated suffixes) are not inferred — every model id is listed explicitly to keep the gate deterministic.

Configuration

# ~/.ethos/config.yaml

# Primary chat config (existing) — vision_analyze defaults to this model
provider: anthropic
model: claude-opus-4-7
apiKey: sk-ant-...

# Optional auxiliary vision routing. Useful when the personality's main
# model is non-vision (e.g. a local llama) but you still want PDF/image
# Q&A available via a cloud model.
auxiliary.vision.model: claude-sonnet-4-6
auxiliary.vision.provider: anthropic # defaults to top-level provider
auxiliary.vision.apiKey: sk-ant-vision-... # defaults to top-level apiKey
auxiliary.vision.baseUrl: https://... # defaults to top-level baseUrl

Mirrors the auxiliary.compression shape. The config type is AuxiliaryVisionConfig in apps/ethos/src/config.ts.

Examples

Image:

vision_analyze({
file_path: "/Users/me/code/repo/screenshots/error.png",
prompt: "What error is shown here? Be specific about the file and line."
})

PDF + structured output:

vision_analyze({
file_path: "/Users/me/docs/invoice.pdf",
prompt: "Extract the total amount and the vendor name.",
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
total: { type: "string" },
vendor: { type: "string" }
},
required: ["total", "vendor"]
}
}
})

Remote URL:

vision_analyze({
file_url: "https://example.com/diagram.png",
prompt: "Describe the architecture diagram."
})

Error codes

Tool failures carry a domain-code prefix in the error string so callers can pattern-match without parsing the framework's code.

PrefixcodeCause
INVALID_INPUTinput_invalidMissing prompt, wrong format shape, or zero / multiple file keys.
FILE_NOT_FOUNDinput_invalidfile_path is outside the fs_reach allowlist or does not exist.
URL_BLOCKEDinput_invalidfile_url is non-HTTPS, points at a private network, or fails the SSRF gate.
FILE_TOO_LARGEinput_invalidImage > 5 MB or PDF > 32 MB.
UNSUPPORTED_FILE_TYPEinput_invalidMagic-byte check did not match PNG / JPEG / GIF / WEBP / PDF.
VISION_NOT_SUPPORTEDnot_availableResolved model is not vision-capable, or no LLMProvider is configured for it.
PDF_NOT_SUPPORTEDnot_availableInput is a PDF but the resolved model does not support PDF.
PDF_TOO_MANY_PAGESexecution_failedProvider rejected the document on page-count grounds.
LLM_ERRORexecution_failedAny other provider failure (rate-limit, network, auth).
RESPONSE_NOT_JSONexecution_failedformat.json_schema requested but the model returned non-JSON or omitted a required field.

Limitations

  • One file per call. No multi-image batching, no document + image mixed prompts.
  • No auto-downscale or paging. Pre-process oversized inputs before calling.
  • No caching. Each call pays the input-token cost.
  • format.json_schema is a tiny validator. Top-level type must be 'object'; only required field presence is checked. Full JSON-schema validation is out of scope for v1.
  • No streaming. The tool buffers the full response before returning.

Video — video_analyze

Companion tool that analyses a video accessible via HTTPS URL. Same provider plumbing as vision_analyze; the model fetches the video itself (Claude / GPT-4o vision endpoints support video-via-URL today).

Source

extensions/tools-vision/src/video.ts (createVideoAnalyzeTool). Capability column lives in the same pricing table: extensions/tools-vision/src/pricing.ts — providers without video support refuse with VIDEO_NOT_SUPPORTED.

Schema

FieldTypeRequiredDescription
file_urlstringyesHTTPS URL to the video. SSRF-checked through the same safety pipeline as vision_analyze.
promptstringnoQuestion or instruction. Default: "Describe this video in detail."
modelstringnoOverride the resolved model. Must be video-capable.

Tool metadata: toolset: 'vision' (same bucket — a personality with vision_analyze typically lists video_analyze alongside), maxResultChars: 30_000, capabilities: { network: { allowedHosts: ['*'] } }, outputIsUntrusted: true.

Limitations

  • URL only. No file_path / file_base64 — the type system has no video-content block for base64 inlining, so local files aren't supported. Upload to an HTTPS-reachable host first.
  • Provider-dependent. Anthropic and OpenAI's vision-capable chat models accept videos via URL today; refuse with VIDEO_NOT_SUPPORTED on other providers.
  • No frame-by-frame extraction. The model summarises; it doesn't return timestamps or per-frame data structure.
  • Cost. Video is significantly more expensive per call than images — token accounting flows through the same usage / cost envelope.

Example

video_analyze({
file_url: "https://example.com/demo.mp4",
prompt: "What is the user trying to do in this screen recording? Identify any errors shown."
})

Returns the model's text answer plus token usage and dollar cost in the standard envelope.

See also