Vision — vision_analyze
vision_analyze is a single-shot tool that answers a prompt about one image (PNG / JPEG / GIF / WEBP) or one PDF. The active personality's LLM — or a separately configured auxiliary vision model — returns the text answer plus token usage and dollar cost.
It is intentionally not a streaming chat surface. Each call returns one envelope. Multi-turn vision conversations layer on top by repeating calls.
Source
The tool factory lives in extensions/tools-vision/src/index.ts (createVisionTools). The capability table lives in extensions/tools-vision/src/pricing.ts. Wiring registers the tool in packages/wiring/src/index.ts, mirroring the auxiliary.compression pattern.
Opting in
Add vision_analyze to the personality's toolset.yaml:
# ~/.ethos/personalities/<id>/toolset.yaml
- read_file
- vision_analyze
The wiring registers the tool unconditionally; the personality toolset allowlist is what gates which personalities can see it. See Personality config reference.
Signature
vision_analyze({
file_path?: string,
file_url?: string,
file_base64?: string,
prompt: string,
model?: string,
format?: { type: 'json_schema', schema: { type: 'object', ... } },
}) → JSON envelope
Exactly one of file_path / file_url / file_base64 must be set.
| Field | Type | Required | Description |
|---|---|---|---|
file_path | absolute path | one-of | Local file. Must lie inside the personality's fs_reach allowlist. |
file_url | HTTPS URL | one-of | Fetched through the @ethosagent/safety-network SSRF gate. Max 32 MB. |
file_base64 | base64 string | one-of | Raw bytes. Optional data:<mime>;base64, prefix accepted. |
prompt | string | yes | Question or instruction for the model. |
model | string | no | Override; otherwise follows the model fallback chain below. |
format.type | 'json_schema' | no | Ask the model for parseable JSON. |
format.schema | JSON Schema | no | Top-level type must be 'object' in v1. |
Return envelope (success)
{
"text": "the model's answer",
"parsed": { "...": "..." },
"model": "claude-opus-4-7",
"cost_usd": 0.0034,
"input_tokens": 1287,
"output_tokens": 48
}
parsed is present only when format.json_schema was supplied. ToolResult.cost_usd matches envelope.cost_usd so the framework's per-session cost counter rendered by /usage increments correctly.
maxResultChars: 8_000 — long transcripts get clipped, with the per-call truncation marker DefaultToolRegistry appends to every over-budget result.
Model fallback chain
Per call, the resolved model is:
args.model(caller override) — useful for "usegpt-5for this one".auxiliaryVisionModel— fromauxiliary.vision.modelin~/.ethos/config.yaml.defaultModel— the active personality's main model.
First non-null wins. The capability table then gates whichever model the chain returned. The LLMProvider that serves the resolved model is looked up via a resolveProvider callback wired at registration time — when wiring has no provider for the resolved id, the tool fails with VISION_NOT_SUPPORTED.
Capability table
Two flags per model — vision and pdf. Unknown models default to both flags false, which surfaces as a VISION_NOT_SUPPORTED or PDF_NOT_SUPPORTED error.
| Model | vision | pdf |
|---|---|---|
claude-opus-4-7 | yes | yes |
claude-sonnet-4-6 | yes | yes |
gpt-5 | yes | yes |
gpt-5-mini | yes | yes |
gemini-2.5-pro | yes | yes |
gemini-2.5-flash | yes | yes |
Helpers: supportsVision(model) and supportsPdf(model) (both exported from @ethosagent/tools-vision). Adding a model is a one-row edit in pricing.ts. Aliases (dated suffixes) are not inferred — every model id is listed explicitly to keep the gate deterministic.
Configuration
# ~/.ethos/config.yaml
# Primary chat config (existing) — vision_analyze defaults to this model
provider: anthropic
model: claude-opus-4-7
apiKey: sk-ant-...
# Optional auxiliary vision routing. Useful when the personality's main
# model is non-vision (e.g. a local llama) but you still want PDF/image
# Q&A available via a cloud model.
auxiliary.vision.model: claude-sonnet-4-6
auxiliary.vision.provider: anthropic # defaults to top-level provider
auxiliary.vision.apiKey: sk-ant-vision-... # defaults to top-level apiKey
auxiliary.vision.baseUrl: https://... # defaults to top-level baseUrl
Mirrors the auxiliary.compression shape. The config type is AuxiliaryVisionConfig in apps/ethos/src/config.ts.
Examples
Image:
vision_analyze({
file_path: "/Users/me/code/repo/screenshots/error.png",
prompt: "What error is shown here? Be specific about the file and line."
})
PDF + structured output:
vision_analyze({
file_path: "/Users/me/docs/invoice.pdf",
prompt: "Extract the total amount and the vendor name.",
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
total: { type: "string" },
vendor: { type: "string" }
},
required: ["total", "vendor"]
}
}
})
Remote URL:
vision_analyze({
file_url: "https://example.com/diagram.png",
prompt: "Describe the architecture diagram."
})
Error codes
Tool failures carry a domain-code prefix in the error string so callers can pattern-match without parsing the framework's code.
| Prefix | code | Cause |
|---|---|---|
INVALID_INPUT | input_invalid | Missing prompt, wrong format shape, or zero / multiple file keys. |
FILE_NOT_FOUND | input_invalid | file_path is outside the fs_reach allowlist or does not exist. |
URL_BLOCKED | input_invalid | file_url is non-HTTPS, points at a private network, or fails the SSRF gate. |
FILE_TOO_LARGE | input_invalid | Image > 5 MB or PDF > 32 MB. |
UNSUPPORTED_FILE_TYPE | input_invalid | Magic-byte check did not match PNG / JPEG / GIF / WEBP / PDF. |
VISION_NOT_SUPPORTED | not_available | Resolved model is not vision-capable, or no LLMProvider is configured for it. |
PDF_NOT_SUPPORTED | not_available | Input is a PDF but the resolved model does not support PDF. |
PDF_TOO_MANY_PAGES | execution_failed | Provider rejected the document on page-count grounds. |
LLM_ERROR | execution_failed | Any other provider failure (rate-limit, network, auth). |
RESPONSE_NOT_JSON | execution_failed | format.json_schema requested but the model returned non-JSON or omitted a required field. |
Limitations
- One file per call. No multi-image batching, no document + image mixed prompts.
- No auto-downscale or paging. Pre-process oversized inputs before calling.
- No caching. Each call pays the input-token cost.
format.json_schemais a tiny validator. Top-leveltypemust be'object'; onlyrequiredfield presence is checked. Full JSON-schema validation is out of scope for v1.- No streaming. The tool buffers the full response before returning.
Video — video_analyze
Companion tool that analyses a video accessible via HTTPS URL. Same provider plumbing as vision_analyze; the model fetches the video itself (Claude / GPT-4o vision endpoints support video-via-URL today).
Source
extensions/tools-vision/src/video.ts (createVideoAnalyzeTool). Capability column lives in the same pricing table: extensions/tools-vision/src/pricing.ts — providers without video support refuse with VIDEO_NOT_SUPPORTED.
Schema
| Field | Type | Required | Description |
|---|---|---|---|
file_url | string | yes | HTTPS URL to the video. SSRF-checked through the same safety pipeline as vision_analyze. |
prompt | string | no | Question or instruction. Default: "Describe this video in detail." |
model | string | no | Override the resolved model. Must be video-capable. |
Tool metadata: toolset: 'vision' (same bucket — a personality with vision_analyze typically lists video_analyze alongside), maxResultChars: 30_000, capabilities: { network: { allowedHosts: ['*'] } }, outputIsUntrusted: true.
Limitations
- URL only. No
file_path/file_base64— the type system has no video-content block for base64 inlining, so local files aren't supported. Upload to an HTTPS-reachable host first. - Provider-dependent. Anthropic and OpenAI's vision-capable chat models accept videos via URL today; refuse with
VIDEO_NOT_SUPPORTEDon other providers. - No frame-by-frame extraction. The model summarises; it doesn't return timestamps or per-frame data structure.
- Cost. Video is significantly more expensive per call than images — token accounting flows through the same usage / cost envelope.
Example
video_analyze({
file_url: "https://example.com/demo.mp4",
prompt: "What is the user trying to do in this screen recording? Identify any errors shown."
})
Returns the model's text answer plus token usage and dollar cost in the standard envelope.
See also
extensions/tools-vision/README.md— package-level reference with the same surface plus the file map.browser-tools— pairbrowser_screenshotwithvision_analyzefor vision-on-page.- Personality config reference — how
toolset.yamlgates which personalities seevision_analyze. config.yamlreference — every field these subcommands read, includingauxiliary.*.