Multimodal Perception: How an AI OS Sees, Hears, and Reads

A text-only agent is like a human who can only read — it misses the richness of images, sounds, documents, and the web. OctopusOS Multimodal Perception gives the kernel five senses: Image, Audio, Video, Document, and Link — each flowing through a governed, type-safe pipeline from upload to LLM reasoning.

1. Why Multimodal Matters for an AI OS

Traditional chat interfaces accept only text. But real-world tasks involve screenshots of errors, audio recordings of meetings, PDF contracts, video demonstrations, and web links. Without multimodal perception, users must manually describe visual content or copy-paste document text — losing context and wasting time.

OctopusOS solves this by building a full-stack multimodal pipeline that handles five modalities end-to-end: from browser file selection through server-side processing to LLM reasoning with native vision and audio capabilities.

2. Five Modalities at a Glance

Image

PNG, JPEG, GIF, WebP formatsBase64-encoded into OpenAI vision content blocksAuto model upgrade: gpt-4.1-nano → gpt-4.1-miniLLMVisionPort for standalone image analysis

Audio

MP3, WAV, OGG, WebM audio formatsOpenAI Whisper API transcriptionTranscribed text injected into LLM promptWhisperAudioPort with graceful degradation

Video

MP4, WebM, QuickTime formatsffmpeg key frame extraction (5s intervals, max 10)VisionPort frame analysis + AudioPort track transcriptionGraceful fallback when ffmpeg unavailable

Document

PDF, DOCX, XLSX, PPTX, CSV, TXT formatsKernel domain parse_file() + OCR for scanned PDFsExtracted text (up to 5000 chars) injected into promptDocumentProcessor wraps kernel parser pipeline

Link

Auto-detected via regex in message text (max 3 URLs)httpx fetch with SSRF protection (blocks private IPs)OG tag extraction: title, description, content50KB content limit, 5s timeout per URL

3. The Upload Pipeline: 9 Stages

Every multimodal interaction follows the same governed pipeline, regardless of modality.

Multimodal Upload Pipeline

1. File Select

User picks a file via paperclip button or drag-and-drop onto the chat area. Hidden <input type='file'> with MIME accept filter.

2. Upload

POST /api/upload as multipart/form-data. Server receives UploadFile + session_id. 25MB size limit enforced.

3. MIME Detection

Server checks file against MIME allowlist (17 types across 5 modalities). Unknown types rejected with 415 status.

4. Storage

File saved to workspace/uploads/{session_id}/{attachment_id}.{ext} with .meta.json sidecar containing metadata.

5. Processing

Modality-specific: image→base64, audio→Whisper, video→ffmpeg+ports, document→parser, link→unfurl.

6. Content Block

Build OpenAI multimodal message: image_url blocks for images, processed text appended for other modalities.

7. LLM Call

Auto model selection: gpt-4.1-mini when images present (vision support), gpt-4.1-nano for text-only. Extended timeout for vision.

8. Response

LLM generates response with full multimodal context — can describe images, answer about documents, summarize audio.

9. Render

AttachmentPreview component renders inline media: <img> for images, <audio>/<video> players, file icon for documents.

4. Four-Layer Architecture

Multimodal Processing Layers

L4: LLM Reasoning

OpenAI Chat Completions APIMultimodal content blocks (text + image_url)Auto model selection (nano→mini)Extended timeout for vision (30s vs 10s)

L3: Port Processing

LLMVisionPort — image analysis via GPT-4 VisionWhisperAudioPort — OpenAI Whisper transcriptionVideoProcessor — ffmpeg frames + port fusionDocumentProcessor — kernel parse_file() + OCRLinkUnfurler — httpx fetch + SSRF guard

L2: HTTP API

POST /api/upload — multipart file uploadGET /api/upload/{id}/content — file servingMessageIn.attachments — attachment referencesIntentIn.attachments — console chat chain

L1: Frontend

React file input + drag-dropuploadFile() — FormData uploadAttachmentPreview — modality-aware renderingPending attachments strip with remove

5. Port Pattern: Protocol + Implementation + Fallback

OctopusOS follows a strict Port pattern for multimodal processing. The kernel defines Protocol interfaces (zero I/O), and the server layer provides real implementations with graceful fallback to stubs.

Port Wiring Architecture

VisionPort Protocol

analyze_image(), describe_frame(), is_available()

AudioPort Protocol

transcribe(), analyze_audio(), is_available()

Bootstrap Wiring

Conditional: API key present → real port, absent → stub

LLMVisionPort

OpenAI GPT-4.1-mini Vision API

WhisperAudioPort

OpenAI Whisper transcription

Stub Ports

Return 'unavailable' gracefully

Bootstrap Wiring Logic

# bootstrap.py — conditional port assembly
if llm_provider_port and llm_provider_port.is_available():
    vision_port = LLMVisionPort(llm_provider=llm_provider_port)
else:
    vision_port = StubVisionPort()

audio_port = WhisperAudioPort.from_config(config) \
    if config.get("OPENAI_API_KEY") else StubAudioPort()

This pattern ensures the kernel never crashes due to missing API keys — it simply degrades gracefully.

6. LLM Multimodal Content Blocks

The key innovation is transforming user messages from plain strings into OpenAI’s multimodal content block format.

Content Block Construction

Text Only

image attached

With Images

+ audio/doc

Mixed Modal

send

LLM Call

Message Format Transformation

# Text-only message (before multimodal)
{"role": "user", "content": "What is this?"}

# Multimodal message (after multimodal)
{"role": "user", "content": [
    {"type": "text", "text": "What is this?\n\n[Audio transcript]: Meeting about Q1 results..."},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]}

7. Video Processing Pipeline

Video is the most complex modality, combining vision and audio analysis through a multi-stage pipeline.

Video Processing Pipeline

Input

Video file (MP4, WebM, QuickTime) stored on disk after upload

Key Frame Extraction

ffmpeg -vf fps=1/5 — extract 1 frame every 5 seconds, max 10 frames as PNG

Frame Analysis

Each PNG frame sent to VisionPort.describe_frame() → frame description text

Audio Extraction

ffmpeg -vn -acodec pcm_s16le -ar 16000 — extract mono WAV audio track

Audio Transcription

WAV bytes sent to AudioPort.transcribe() → full transcript text

Fusion

Frame descriptions + audio transcript joined as processed_text for LLM context

Graceful Degradation

if not _ffmpeg_available():
    return f"[Video received: {video_path.name}, {size_kb:.0f} KB
             — ffmpeg not available for analysis]"

8. Link Auto-Detection and Unfurling

Unlike other modalities that require explicit file upload, links are automatically detected in message text.

Link Processing Pipeline

URL Detection

Regex https?://\S+ scans message text, max 3 URLs per message, trailing punctuation stripped

SSRF Protection

validate_url_no_ssrf() blocks private IPs, loopback, link-local, metadata endpoints (169.254.169.254)

Fetch

httpx.get() with 5s timeout, follow_redirects=True, 50KB content limit, custom User-Agent

Metadata Extraction

Parse <title>, <meta og:title>, <meta og:description> from HTML, strip tags for plain text

LinkPreview

Frozen kernel contract: url, title, description, extracted_text, content_type, metadata

Injection

LinkPreview content appended as virtual attachment with modality='link' for LLM context

9. Kernel Contracts

All multimodal data flows through frozen, immutable kernel contracts — ensuring type safety and auditability.

# ModalityType — expanded from 4 to 6
ModalityType = Literal["text", "image", "audio", "video", "document", "link"]

# New contracts for document and link modalities
@dataclass(frozen=True)
class DocumentAnalysisResult:
    analysis_id: str
    extracted_text: str
    page_count: int = 0
    doc_type: str = ""
    confidence: float = 0.0

@dataclass(frozen=True)
class LinkPreview:
    url: str
    title: str = ""
    description: str = ""
    extracted_text: str = ""
    content_type: str = ""

# ChatMessage — backward-compatible attachment extension
@dataclass(frozen=True)
class ChatMessage:
    message_id: str
    role: MessageRole
    content: str
    timestamp_ms: int
    metadata: dict[str, Any] = field(default_factory=dict)
    attachments: list[dict[str, Any]] = field(default_factory=list)  # NEW

10. Security Architecture

Multimodal processing introduces new attack surfaces. OctopusOS addresses each one.

Upload Security

MIME allowlist — only 17 known types accepted25MB size limit (configurable via UPLOAD_MAX_SIZE_MB)Filename sanitization — strip ../, limit 255 charsPath traversal prevention via validate_path_within_workspace()

SSRF Protection

Blocks 169.254.169.254 (AWS/GCP metadata)Blocks private IPs (10.x, 172.16-31.x, 192.168.x)Blocks loopback (127.0.0.1) and link-local addressesOnly http:// and https:// schemes allowed

Content Limits

Link fetch: 50KB body limit, 5s timeoutDocument text: truncated to 5000 charsLink text: truncated to 2000 charsVideo: max 10 frames, 60s ffmpeg timeout

Port Isolation

Kernel contracts: frozen dataclasses, zero I/OPort protocols: defined in kernel, implemented in serverGate checks: banned words prevent I/O in kernel layerStub fallback: no crash when API keys missing

11. Frontend Integration

The React frontend provides a seamless multimodal experience with three interaction patterns.

File Selection

A hidden <input type="file"> is triggered by a paperclip button next to the message input. The accept attribute filters to supported MIME types. Multiple files can be selected simultaneously.

Drag and Drop

The chat area accepts file drops via onDrop / onDragOver handlers, providing a natural interaction for desktop users.

Attachment Preview

The AttachmentPreview component renders uploaded files by modality:

Modality	Rendering	Compact Mode
Image	`<img>` with click-to-enlarge	120x90 thumbnail
Audio	`<audio controls>` player	180px width
Video	`<video controls>` player	160x120
Document	File icon + filename + size	Icon only
Link	URL card with title	Compact card

12. Implementation Metrics

Metric	Count
Supported MIME types	17
New frozen contracts	2 (DocumentAnalysisResult, LinkPreview)
ModalityType literals	6 (text, image, audio, video, document, link)
Port implementations	5 (Vision, Audio, Video, Document, Link)
Pipeline stages	9
New files created	6
Files modified	10
New tests	13
Security controls	4 layers (upload, SSRF, content limits, port isolation)
Gate violations introduced	0

13. Design Philosophy

1. Every modality is a first-class citizen. Whether a user sends an image, an audio clip, or a PDF, the same pipeline processes it: upload, store, process, inject into LLM context, render in chat. No modality is a second-class afterthought.

2. Graceful degradation over hard failure. Missing OpenAI API key? Vision falls back to stub. No ffmpeg? Video returns file info. Link fetch fails? Error is logged, message still sent. The system never crashes due to missing dependencies.

3. Security at every boundary. Upload checks MIME types and paths. Link unfurling blocks SSRF. Content is size-limited. Kernel contracts are frozen and I/O-free. Each layer enforces its own security invariants independently.

4. Reuse kernel infrastructure. DocumentProcessor wraps the existing parse_file() domain. VideoProcessor reuses VisionPort and AudioPort. ModalityType extends the existing Literal. ChatMessage adds one field with a default — zero breaking changes.