Multimodal Perception: How an AI OS Sees, Hears, and Reads
A text-only agent is like a human who can only read — it misses the richness of images, sounds, documents, and the web. OctopusOS Multimodal Perception gives the kernel five senses: Image, Audio, Video, Document, and Link — each flowing through a governed, type-safe pipeline from upload to LLM reasoning.
1. Why Multimodal Matters for an AI OS
Traditional chat interfaces accept only text. But real-world tasks involve screenshots of errors, audio recordings of meetings, PDF contracts, video demonstrations, and web links. Without multimodal perception, users must manually describe visual content or copy-paste document text — losing context and wasting time.
OctopusOS solves this by building a full-stack multimodal pipeline that handles five modalities end-to-end: from browser file selection through server-side processing to LLM reasoning with native vision and audio capabilities.
2. Five Modalities at a Glance
3. The Upload Pipeline: 9 Stages
Every multimodal interaction follows the same governed pipeline, regardless of modality.
4. Four-Layer Architecture
5. Port Pattern: Protocol + Implementation + Fallback
OctopusOS follows a strict Port pattern for multimodal processing. The kernel defines Protocol interfaces (zero I/O), and the server layer provides real implementations with graceful fallback to stubs.
Bootstrap Wiring Logic
# bootstrap.py — conditional port assembly
if llm_provider_port and llm_provider_port.is_available():
vision_port = LLMVisionPort(llm_provider=llm_provider_port)
else:
vision_port = StubVisionPort()
audio_port = WhisperAudioPort.from_config(config) \
if config.get("OPENAI_API_KEY") else StubAudioPort()
This pattern ensures the kernel never crashes due to missing API keys — it simply degrades gracefully.
6. LLM Multimodal Content Blocks
The key innovation is transforming user messages from plain strings into OpenAI’s multimodal content block format.
Message Format Transformation
# Text-only message (before multimodal)
{"role": "user", "content": "What is this?"}
# Multimodal message (after multimodal)
{"role": "user", "content": [
{"type": "text", "text": "What is this?\n\n[Audio transcript]: Meeting about Q1 results..."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]}
7. Video Processing Pipeline
Video is the most complex modality, combining vision and audio analysis through a multi-stage pipeline.
Graceful Degradation
if not _ffmpeg_available():
return f"[Video received: {video_path.name}, {size_kb:.0f} KB
— ffmpeg not available for analysis]"
8. Link Auto-Detection and Unfurling
Unlike other modalities that require explicit file upload, links are automatically detected in message text.
9. Kernel Contracts
All multimodal data flows through frozen, immutable kernel contracts — ensuring type safety and auditability.
# ModalityType — expanded from 4 to 6
ModalityType = Literal["text", "image", "audio", "video", "document", "link"]
# New contracts for document and link modalities
@dataclass(frozen=True)
class DocumentAnalysisResult:
analysis_id: str
extracted_text: str
page_count: int = 0
doc_type: str = ""
confidence: float = 0.0
@dataclass(frozen=True)
class LinkPreview:
url: str
title: str = ""
description: str = ""
extracted_text: str = ""
content_type: str = ""
# ChatMessage — backward-compatible attachment extension
@dataclass(frozen=True)
class ChatMessage:
message_id: str
role: MessageRole
content: str
timestamp_ms: int
metadata: dict[str, Any] = field(default_factory=dict)
attachments: list[dict[str, Any]] = field(default_factory=list) # NEW
10. Security Architecture
Multimodal processing introduces new attack surfaces. OctopusOS addresses each one.
11. Frontend Integration
The React frontend provides a seamless multimodal experience with three interaction patterns.
File Selection
A hidden <input type="file"> is triggered by a paperclip button next to the message input. The accept attribute filters to supported MIME types. Multiple files can be selected simultaneously.
Drag and Drop
The chat area accepts file drops via onDrop / onDragOver handlers, providing a natural interaction for desktop users.
Attachment Preview
The AttachmentPreview component renders uploaded files by modality:
| Modality | Rendering | Compact Mode |
|---|---|---|
| Image | <img> with click-to-enlarge | 120x90 thumbnail |
| Audio | <audio controls> player | 180px width |
| Video | <video controls> player | 160x120 |
| Document | File icon + filename + size | Icon only |
| Link | URL card with title | Compact card |
12. Implementation Metrics
| Metric | Count |
|---|---|
| Supported MIME types | 17 |
| New frozen contracts | 2 (DocumentAnalysisResult, LinkPreview) |
| ModalityType literals | 6 (text, image, audio, video, document, link) |
| Port implementations | 5 (Vision, Audio, Video, Document, Link) |
| Pipeline stages | 9 |
| New files created | 6 |
| Files modified | 10 |
| New tests | 13 |
| Security controls | 4 layers (upload, SSRF, content limits, port isolation) |
| Gate violations introduced | 0 |
13. Design Philosophy
1. Every modality is a first-class citizen. Whether a user sends an image, an audio clip, or a PDF, the same pipeline processes it: upload, store, process, inject into LLM context, render in chat. No modality is a second-class afterthought.
2. Graceful degradation over hard failure. Missing OpenAI API key? Vision falls back to stub. No ffmpeg? Video returns file info. Link fetch fails? Error is logged, message still sent. The system never crashes due to missing dependencies.
3. Security at every boundary. Upload checks MIME types and paths. Link unfurling blocks SSRF. Content is size-limited. Kernel contracts are frozen and I/O-free. Each layer enforces its own security invariants independently.
4. Reuse kernel infrastructure. DocumentProcessor wraps the existing parse_file() domain. VideoProcessor reuses VisionPort and AudioPort. ModalityType extends the existing Literal. ChatMessage adds one field with a default — zero breaking changes.