Vision
Image analysis and document processing for multimodal interactions Learn the setup patterns, APIs, and practical examples needed to build reliable Astreus...
Image analysis and document processing for multimodal interactions
Overview
The Vision system enables agents to process and analyze images, providing multimodal AI capabilities for richer interactions. It supports multiple image formats, offers various analysis modes, and integrates seamlessly with OpenAI, Claude, Gemini, and local Ollama providers for flexible deployment options.
Enabling Vision
Enable vision capabilities for an agent by setting the vision option to true:
import { Agent } from '@astreus-ai/astreus';
const agent = await Agent.create({
name: 'VisionAgent',
model: 'gpt-4o', // Vision-capable model
vision: true // Enable vision capabilities (default: false)
});Attachment System
Astreus supports an intuitive attachment system for working with images:
// Clean, modern attachment API
const response = await agent.ask("What do you see in this image?", {
attachments: [
{ type: 'image', path: '/path/to/image.jpg', name: 'My Photo' }
]
});The attachment system automatically:
- Detects the file type and selects appropriate tools
- Enhances the prompt with attachment information
- Enables tool usage when attachments are present
Vision Capabilities
The vision system provides three core capabilities through built-in tools:
1. General Image Analysis
Analyze images with custom prompts and configurable detail levels:
// Using attachments (recommended approach)
const response = await agent.ask("Please analyze this screenshot and describe the UI elements", {
attachments: [
{ type: 'image', path: '/path/to/screenshot.png', name: 'UI Screenshot' }
]
});
// Using the analyze_image tool through conversation
const response2 = await agent.ask("Please analyze the image at /path/to/screenshot.png and describe the UI elements");
// Direct method call
const analysis = await agent.analyzeImage('/path/to/image.jpg', {
prompt: 'What UI elements are visible in this interface?',
detail: 'high',
maxTokens: 1500
});2. Image Description
Generate structured descriptions for different use cases:
// Accessibility-friendly description
const description = await agent.describeImage('/path/to/image.jpg', 'accessibility');
// Available styles:
// - 'detailed': Comprehensive description of all visual elements
// - 'concise': Brief description of main elements
// - 'accessibility': Screen reader-friendly descriptions
// - 'technical': Technical analysis including composition and lighting3. Text Extraction (OCR)
Extract and transcribe text from images:
// Extract text with language hint
const text = await agent.extractTextFromImage('/path/to/document.jpg', 'english');
// The system maintains original formatting and structure
console.log(text);Supported Formats
The vision system supports these image formats:
- JPEG (
.jpg,.jpeg) - PNG (
.png) - GIF (
.gif) - BMP (
.bmp) - WebP (
.webp)
Input Sources
File Paths
Analyze images from local file system:
const result = await agent.analyzeImage('/path/to/image.jpg');Base64 Data
Analyze images from base64-encoded data:
const base64Image = 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQ...';
const result = await agent.analyzeImageFromBase64(base64Image);Configuration
Vision Model Configuration
Specify the vision model directly in the agent configuration:
const agent = await Agent.create({
name: 'VisionAgent',
model: 'gpt-4o',
visionModel: 'gpt-4o', // Specify vision model here
vision: true
});Environment Variables
# API keys (auto-detected based on model)
OPENAI_API_KEY=your_openai_key # For OpenAI models
OPENAI_VISION_API_KEY=your_openai_key # Dedicated vision API key (takes priority)
ANTHROPIC_API_KEY=your_anthropic_key # For Claude models
ANTHROPIC_VISION_API_KEY=your_anthropic_key # Dedicated vision API key (takes priority)
GEMINI_API_KEY=your_gemini_key # For Gemini models
GEMINI_VISION_API_KEY=your_gemini_key # Dedicated vision API key (takes priority)
# Ollama configuration (local)
OLLAMA_BASE_URL=http://localhost:11434 # Default if not setThe vision system automatically selects the appropriate provider based on the visionModel specified in the agent configuration.
Analysis Options
Configure analysis behavior with these options:
interface AnalysisOptions {
prompt?: string; // Custom analysis prompt
maxTokens?: number; // Response length limit (default: 1000)
detail?: 'low' | 'high' | 'auto'; // Analysis detail level (OpenAI only)
}Usage Examples
Screenshot Analysis
const agent = await Agent.create({
name: 'UIAnalyzer',
model: 'gpt-4o',
vision: true
});
// Analyze a UI screenshot
const analysis = await agent.analyzeImage('/path/to/app-screenshot.png', {
prompt: 'Analyze this mobile app interface. Identify key UI components, layout structure, and potential usability issues.',
detail: 'high'
});
console.log(analysis);Document Processing
// Extract text from scanned documents
const documentText = await agent.extractTextFromImage('/path/to/scanned-invoice.jpg', 'english');
// Generate accessible descriptions
const accessibleDesc = await agent.describeImage('/path/to/chart.png', 'accessibility');Multimodal Conversations
// Using attachments for cleaner API
const response = await agent.ask("I'm getting an error. Can you analyze this screenshot and help me fix it?", {
attachments: [
{ type: 'image', path: '/Users/john/Desktop/error.png', name: 'Error Screenshot' }
]
});
// Multiple attachments
const response2 = await agent.ask("Compare these UI mockups and suggest improvements", {
attachments: [
{ type: 'image', path: '/designs/mockup1.png', name: 'Design A' },
{ type: 'image', path: '/designs/mockup2.png', name: 'Design B' }
]
});
// Traditional approach (still works)
const response3 = await agent.ask(
"Please analyze the error screenshot at /Users/john/Desktop/error.png and suggest how to fix the issue"
);Provider Comparison
| Feature | OpenAI (gpt-4o) | Claude (claude-3-5-sonnet) | Gemini (gemini-1.5-pro) | Ollama (llava) |
|---|---|---|---|---|
| Analysis Quality | Excellent | Excellent | Excellent | Good |
| Processing Speed | Fast | Fast | Fast | Variable |
| Cost | Pay-per-use | Pay-per-use | Pay-per-use | Free (local) |
| Privacy | Cloud-based | Cloud-based | Cloud-based | Local processing |
| Detail Levels | Low/High/Auto | Standard | Standard | Standard |
| Language Support | Extensive | Extensive | Extensive | Good |
OpenAI Provider
- Best for: Production applications requiring high accuracy
- Default Model:
gpt-4o - Features: Detail level control, excellent text recognition
Claude Provider
- Best for: Nuanced analysis and detailed descriptions
- Default Model:
claude-3-5-sonnet-20241022 - Features: Strong reasoning, excellent context understanding
Gemini Provider
- Best for: Multimodal tasks and document analysis
- Default Model:
gemini-1.5-pro - Features: Long context support, good for complex images
Ollama Provider (Local)
- Best for: Privacy-sensitive applications or development
- Default Model:
llava - Features: Local processing, no API costs, offline capability
Batch Processing
Process multiple images efficiently:
const images = [
'/path/to/image1.jpg',
'/path/to/image2.png',
'/path/to/image3.gif'
];
// Process all images in parallel
const results = await Promise.all(
images.map(imagePath =>
agent.describeImage(imagePath, 'concise')
)
);
console.log('Analysis results:', results);
// Or use task attachments for batch processing
const batchTask = await agent.createTask({
prompt: 'Analyze all these images and provide a comparative report',
attachments: images.map(path => ({
type: 'image',
path,
name: path.split('/').pop()
}))
});
const batchResult = await agent.executeTask(batchTask.id);Built-in Vision Tools
When vision is enabled, these tools are automatically available:
analyze_image
- Parameters:
image_path(string, required): Path to image fileprompt(string, optional): Custom analysis promptdetail(string, optional): 'low' or 'high' detail level
describe_image
- Parameters:
image_path(string, required): Path to image filestyle(string, optional): Description style ('detailed', 'concise', 'accessibility', 'technical')
extract_text_from_image
- Parameters:
image_path(string, required): Path to image filelanguage(string, optional): Language hint for better OCR accuracy
Response Types
Vision methods return string responses containing the analysis results.
Analyze Image Response
Image analysis returns a descriptive string based on your prompt:
const analysis = await agent.analyzeImage('/path/to/office.jpg', {
prompt: "What objects are in this image and how is the space organized?",
detail: "high"
});
// Response: string
"The image shows a modern office workspace with a MacBook Pro laptop, wireless keyboard, and mouse on a wooden desk. To the left is a coffee mug and a notebook. The desk is positioned near a window with natural lighting. The space features a minimalist organization with cable management and a small potted plant."Describe Image Response
Describeimage returns a formatted description string:
const description = await agent.describeImage('/path/to/product.jpg');
// Response: string
"A professional product photograph featuring a stainless steel water bottle with a matte black finish. The bottle has a wide mouth opening and is photographed against a white background with soft studio lighting creating subtle highlights along the curved surfaces."Extract Text from Image Response
OCR returns the extracted text as a string:
const text = await agent.extractTextFromImage('/path/to/document.png', {
language: 'en'
});
// Response: string
"INVOICE\nDate: January 15, 2024\nInvoice #: INV-2024-001\n\nBill To:\nAcme Corporation\n123 Main Street\nNew York, NY 10001\n\nDescription Quantity Price Total\nProfessional Services 8 hrs $150 $1,200\nConsulting Fee 1 $500 $500\n\nSubtotal: $1,700\nTax (8%): $136\nTotal: $1,836"Analyze Image from Base64 Response
Base64 image analysis also returns a string:
const base64Image = "data:image/png;base64,iVBORw0KG...";
const result = await agent.analyzeImageFromBase64(base64Image, {
prompt: "Identify the main subject and mood of this image"
});
// Response: string
"The main subject is a sunset landscape with mountains in the background. The mood is serene and peaceful, with warm orange and pink tones dominating the sky. The composition creates a sense of tranquility and natural beauty."Last updated: May 26, 2026