Vision

Image analysis and document processing for multimodal interactions Learn the setup patterns, APIs, and practical examples needed to build reliable Astreus...

Image analysis and document processing for multimodal interactions

Overview

The Vision system enables agents to process and analyze images, providing multimodal AI capabilities for richer interactions. It supports multiple image formats, offers various analysis modes, and integrates seamlessly with OpenAI, Claude, Gemini, and local Ollama providers for flexible deployment options.

Enabling Vision

Enable vision capabilities for an agent by setting the vision option to true:

import { Agent } from '@astreus-ai/astreus';

const agent = await Agent.create({
  name: 'VisionAgent',
  model: 'gpt-4o',  // Vision-capable model
  vision: true      // Enable vision capabilities (default: false)
});

Attachment System

Astreus supports an intuitive attachment system for working with images:

// Clean, modern attachment API
const response = await agent.ask("What do you see in this image?", {
  attachments: [
    { type: 'image', path: '/path/to/image.jpg', name: 'My Photo' }
  ]
});

The attachment system automatically:

Detects the file type and selects appropriate tools
Enhances the prompt with attachment information
Enables tool usage when attachments are present

Vision Capabilities

The vision system provides three core capabilities through built-in tools:

1. General Image Analysis

Analyze images with custom prompts and configurable detail levels:

// Using attachments (recommended approach)
const response = await agent.ask("Please analyze this screenshot and describe the UI elements", {
  attachments: [
    { type: 'image', path: '/path/to/screenshot.png', name: 'UI Screenshot' }
  ]
});

// Using the analyze_image tool through conversation
const response2 = await agent.ask("Please analyze the image at /path/to/screenshot.png and describe the UI elements");

// Direct method call
const analysis = await agent.analyzeImage('/path/to/image.jpg', {
  prompt: 'What UI elements are visible in this interface?',
  detail: 'high',
  maxTokens: 1500
});

2. Image Description

Generate structured descriptions for different use cases:

// Accessibility-friendly description
const description = await agent.describeImage('/path/to/image.jpg', 'accessibility');

// Available styles:
// - 'detailed': Comprehensive description of all visual elements
// - 'concise': Brief description of main elements  
// - 'accessibility': Screen reader-friendly descriptions
// - 'technical': Technical analysis including composition and lighting

3. Text Extraction (OCR)

Extract and transcribe text from images:

// Extract text with language hint
const text = await agent.extractTextFromImage('/path/to/document.jpg', 'english');

// The system maintains original formatting and structure
console.log(text);

Supported Formats

The vision system supports these image formats:

JPEG (.jpg, .jpeg)
PNG (.png)
GIF (.gif)
BMP (.bmp)
WebP (.webp)

Input Sources

File Paths

Analyze images from local file system:

const result = await agent.analyzeImage('/path/to/image.jpg');

Base64 Data

Analyze images from base64-encoded data:

const base64Image = 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQ...';
const result = await agent.analyzeImageFromBase64(base64Image);

Configuration

Vision Model Configuration

Specify the vision model directly in the agent configuration:

const agent = await Agent.create({
  name: 'VisionAgent',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',  // Specify vision model here
  vision: true
});

Environment Variables

# API keys (auto-detected based on model)
OPENAI_API_KEY=your_openai_key               # For OpenAI models
OPENAI_VISION_API_KEY=your_openai_key        # Dedicated vision API key (takes priority)
ANTHROPIC_API_KEY=your_anthropic_key         # For Claude models
ANTHROPIC_VISION_API_KEY=your_anthropic_key  # Dedicated vision API key (takes priority)
GEMINI_API_KEY=your_gemini_key               # For Gemini models
GEMINI_VISION_API_KEY=your_gemini_key        # Dedicated vision API key (takes priority)

# Ollama configuration (local)
OLLAMA_BASE_URL=http://localhost:11434       # Default if not set

The vision system automatically selects the appropriate provider based on the visionModel specified in the agent configuration.

Analysis Options

Configure analysis behavior with these options:

interface AnalysisOptions {
  prompt?: string;                    // Custom analysis prompt
  maxTokens?: number;                 // Response length limit (default: 1000)
  detail?: 'low' | 'high' | 'auto';   // Analysis detail level (OpenAI only)
}

Usage Examples

Screenshot Analysis

const agent = await Agent.create({
  name: 'UIAnalyzer',
  model: 'gpt-4o',
  vision: true
});

// Analyze a UI screenshot
const analysis = await agent.analyzeImage('/path/to/app-screenshot.png', {
  prompt: 'Analyze this mobile app interface. Identify key UI components, layout structure, and potential usability issues.',
  detail: 'high'
});

console.log(analysis);

Document Processing

// Extract text from scanned documents
const documentText = await agent.extractTextFromImage('/path/to/scanned-invoice.jpg', 'english');

// Generate accessible descriptions
const accessibleDesc = await agent.describeImage('/path/to/chart.png', 'accessibility');

Multimodal Conversations

// Using attachments for cleaner API
const response = await agent.ask("I'm getting an error. Can you analyze this screenshot and help me fix it?", {
  attachments: [
    { type: 'image', path: '/Users/john/Desktop/error.png', name: 'Error Screenshot' }
  ]
});

// Multiple attachments
const response2 = await agent.ask("Compare these UI mockups and suggest improvements", {
  attachments: [
    { type: 'image', path: '/designs/mockup1.png', name: 'Design A' },
    { type: 'image', path: '/designs/mockup2.png', name: 'Design B' }
  ]
});

// Traditional approach (still works)
const response3 = await agent.ask(
  "Please analyze the error screenshot at /Users/john/Desktop/error.png and suggest how to fix the issue"
);

Provider Comparison

Feature	OpenAI (gpt-4o)	Claude (claude-3-5-sonnet)	Gemini (gemini-1.5-pro)	Ollama (llava)
Analysis Quality	Excellent	Excellent	Excellent	Good
Processing Speed	Fast	Fast	Fast	Variable
Cost	Pay-per-use	Pay-per-use	Pay-per-use	Free (local)
Privacy	Cloud-based	Cloud-based	Cloud-based	Local processing
Detail Levels	Low/High/Auto	Standard	Standard	Standard
Language Support	Extensive	Extensive	Extensive	Good

OpenAI Provider

Best for: Production applications requiring high accuracy
Default Model: gpt-4o
Features: Detail level control, excellent text recognition

Claude Provider

Best for: Nuanced analysis and detailed descriptions
Default Model: claude-3-5-sonnet-20241022
Features: Strong reasoning, excellent context understanding

Gemini Provider

Best for: Multimodal tasks and document analysis
Default Model: gemini-1.5-pro
Features: Long context support, good for complex images

Ollama Provider (Local)

Best for: Privacy-sensitive applications or development
Default Model: llava
Features: Local processing, no API costs, offline capability

Batch Processing

Process multiple images efficiently:

const images = [
  '/path/to/image1.jpg',
  '/path/to/image2.png',
  '/path/to/image3.gif'
];

// Process all images in parallel
const results = await Promise.all(
  images.map(imagePath => 
    agent.describeImage(imagePath, 'concise')
  )
);

console.log('Analysis results:', results);

// Or use task attachments for batch processing
const batchTask = await agent.createTask({
  prompt: 'Analyze all these images and provide a comparative report',
  attachments: images.map(path => ({
    type: 'image',
    path,
    name: path.split('/').pop()
  }))
});

const batchResult = await agent.executeTask(batchTask.id);

Built-in Vision Tools

When vision is enabled, these tools are automatically available:

analyze_image

Parameters:
- image_path (string, required): Path to image file
- prompt (string, optional): Custom analysis prompt
- detail (string, optional): 'low' or 'high' detail level

describe_image

Parameters:
- image_path (string, required): Path to image file
- style (string, optional): Description style ('detailed', 'concise', 'accessibility', 'technical')

extract_text_from_image

Parameters:
- image_path (string, required): Path to image file
- language (string, optional): Language hint for better OCR accuracy

Response Types

Vision methods return string responses containing the analysis results.

Analyze Image Response

Image analysis returns a descriptive string based on your prompt:

const analysis = await agent.analyzeImage('/path/to/office.jpg', {
  prompt: "What objects are in this image and how is the space organized?",
  detail: "high"
});

// Response: string
"The image shows a modern office workspace with a MacBook Pro laptop, wireless keyboard, and mouse on a wooden desk. To the left is a coffee mug and a notebook. The desk is positioned near a window with natural lighting. The space features a minimalist organization with cable management and a small potted plant."

Describe Image Response

Describeimage returns a formatted description string:

const description = await agent.describeImage('/path/to/product.jpg');

// Response: string
"A professional product photograph featuring a stainless steel water bottle with a matte black finish. The bottle has a wide mouth opening and is photographed against a white background with soft studio lighting creating subtle highlights along the curved surfaces."

Extract Text from Image Response

OCR returns the extracted text as a string:

const text = await agent.extractTextFromImage('/path/to/document.png', {
  language: 'en'
});

// Response: string
"INVOICE\nDate: January 15, 2024\nInvoice #: INV-2024-001\n\nBill To:\nAcme Corporation\n123 Main Street\nNew York, NY 10001\n\nDescription          Quantity    Price    Total\nProfessional Services    8 hrs    $150    $1,200\nConsulting Fee           1        $500    $500\n\nSubtotal: $1,700\nTax (8%): $136\nTotal: $1,836"

Analyze Image from Base64 Response

Base64 image analysis also returns a string:

const base64Image = "data:image/png;base64,iVBORw0KG...";
const result = await agent.analyzeImageFromBase64(base64Image, {
  prompt: "Identify the main subject and mood of this image"
});

// Response: string
"The main subject is a sunset landscape with mountains in the background. The mood is serene and peaceful, with warm orange and pink tones dominating the sky. The composition creates a sense of tranquility and natural beauty."

Last updated: July 20, 2026

In this section

Intro

Open-source AI agent framework for building autonomous systems that solve real-world tasks effectively.

Install

Install Astreus with npm, yarn, or pnpm, confirm the required Node.js version, and prepare a local project for building AI agents with the framework.

Quickstart

Build your first AI agent with Astreus in under 2 minutes Learn the setup patterns, APIs, and practical examples needed to build reliable Astreus agent systems.

Agent

Core AI entity with modular capabilities and decorator-based composition Learn the setup patterns, APIs, and practical examples needed to build reliable...

MCP

Environment