Multimodal Perception: How an AI OS Sees, Hears, and Reads

A text-only agent is like a human who can only read — it misses the richness of images, sounds, documents, and the web. OctopusOS Multimodal Perception gives the kernel five senses: Image, Audio, Video, Document, and Link — each flowing through a governed, type-safe pipeline from upload to LLM reasoning.


1. Why Multimodal Matters for an AI OS

Traditional chat interfaces accept only text. But real-world tasks involve screenshots of errors, audio recordings of meetings, PDF contracts, video demonstrations, and web links. Without multimodal perception, users must manually describe visual content or copy-paste document text — losing context and wasting time.

OctopusOS solves this by building a full-stack multimodal pipeline that handles five modalities end-to-end: from browser file selection through server-side processing to LLM reasoning with native vision and audio capabilities.


2. Five Modalities at a Glance

Image
PNG, JPEG, GIF, WebP formatsBase64-encoded into OpenAI vision content blocksAuto model upgrade: gpt-4.1-nano → gpt-4.1-miniLLMVisionPort for standalone image analysis
Audio
MP3, WAV, OGG, WebM audio formatsOpenAI Whisper API transcriptionTranscribed text injected into LLM promptWhisperAudioPort with graceful degradation
Video
MP4, WebM, QuickTime formatsffmpeg key frame extraction (5s intervals, max 10)VisionPort frame analysis + AudioPort track transcriptionGraceful fallback when ffmpeg unavailable
Document
PDF, DOCX, XLSX, PPTX, CSV, TXT formatsKernel domain parse_file() + OCR for scanned PDFsExtracted text (up to 5000 chars) injected into promptDocumentProcessor wraps kernel parser pipeline
Link
Auto-detected via regex in message text (max 3 URLs)httpx fetch with SSRF protection (blocks private IPs)OG tag extraction: title, description, content50KB content limit, 5s timeout per URL

3. The Upload Pipeline: 9 Stages

Every multimodal interaction follows the same governed pipeline, regardless of modality.

Multimodal Upload Pipeline
1
1. File Select
User picks a file via paperclip button or drag-and-drop onto the chat area. Hidden <input type='file'> with MIME accept filter.
2
2. Upload
POST /api/upload as multipart/form-data. Server receives UploadFile + session_id. 25MB size limit enforced.
3
3. MIME Detection
Server checks file against MIME allowlist (17 types across 5 modalities). Unknown types rejected with 415 status.
4
4. Storage
File saved to workspace/uploads/{session_id}/{attachment_id}.{ext} with .meta.json sidecar containing metadata.
5
5. Processing
Modality-specific: image→base64, audio→Whisper, video→ffmpeg+ports, document→parser, link→unfurl.
6
6. Content Block
Build OpenAI multimodal message: image_url blocks for images, processed text appended for other modalities.
7
7. LLM Call
Auto model selection: gpt-4.1-mini when images present (vision support), gpt-4.1-nano for text-only. Extended timeout for vision.
8
8. Response
LLM generates response with full multimodal context — can describe images, answer about documents, summarize audio.
9
9. Render
AttachmentPreview component renders inline media: <img> for images, <audio>/<video> players, file icon for documents.

4. Four-Layer Architecture

Multimodal Processing Layers
L4: LLM Reasoning
OpenAI Chat Completions APIMultimodal content blocks (text + image_url)Auto model selection (nano→mini)Extended timeout for vision (30s vs 10s)
L3: Port Processing
LLMVisionPort — image analysis via GPT-4 VisionWhisperAudioPort — OpenAI Whisper transcriptionVideoProcessor — ffmpeg frames + port fusionDocumentProcessor — kernel parse_file() + OCRLinkUnfurler — httpx fetch + SSRF guard
L2: HTTP API
POST /api/upload — multipart file uploadGET /api/upload/{id}/content — file servingMessageIn.attachments — attachment referencesIntentIn.attachments — console chat chain
L1: Frontend
React file input + drag-dropuploadFile() — FormData uploadAttachmentPreview — modality-aware renderingPending attachments strip with remove

5. Port Pattern: Protocol + Implementation + Fallback

OctopusOS follows a strict Port pattern for multimodal processing. The kernel defines Protocol interfaces (zero I/O), and the server layer provides real implementations with graceful fallback to stubs.

Port Wiring Architecture
VisionPort Protocol
analyze_image(), describe_frame(), is_available()
AudioPort Protocol
transcribe(), analyze_audio(), is_available()
Bootstrap Wiring
Conditional: API key present → real port, absent → stub
LLMVisionPort
OpenAI GPT-4.1-mini Vision API
WhisperAudioPort
OpenAI Whisper transcription
Stub Ports
Return 'unavailable' gracefully

Bootstrap Wiring Logic

# bootstrap.py — conditional port assembly
if llm_provider_port and llm_provider_port.is_available():
    vision_port = LLMVisionPort(llm_provider=llm_provider_port)
else:
    vision_port = StubVisionPort()

audio_port = WhisperAudioPort.from_config(config) \
    if config.get("OPENAI_API_KEY") else StubAudioPort()

This pattern ensures the kernel never crashes due to missing API keys — it simply degrades gracefully.


6. LLM Multimodal Content Blocks

The key innovation is transforming user messages from plain strings into OpenAI’s multimodal content block format.

Content Block Construction
Text Only
image attached
With Images
+ audio/doc
Mixed Modal
send
LLM Call

Message Format Transformation

# Text-only message (before multimodal)
{"role": "user", "content": "What is this?"}

# Multimodal message (after multimodal)
{"role": "user", "content": [
    {"type": "text", "text": "What is this?\n\n[Audio transcript]: Meeting about Q1 results..."},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
]}

7. Video Processing Pipeline

Video is the most complex modality, combining vision and audio analysis through a multi-stage pipeline.

Video Processing Pipeline
1
Input
Video file (MP4, WebM, QuickTime) stored on disk after upload
2
Key Frame Extraction
ffmpeg -vf fps=1/5 — extract 1 frame every 5 seconds, max 10 frames as PNG
3
Frame Analysis
Each PNG frame sent to VisionPort.describe_frame() → frame description text
4
Audio Extraction
ffmpeg -vn -acodec pcm_s16le -ar 16000 — extract mono WAV audio track
5
Audio Transcription
WAV bytes sent to AudioPort.transcribe() → full transcript text
6
Fusion
Frame descriptions + audio transcript joined as processed_text for LLM context

Graceful Degradation

if not _ffmpeg_available():
    return f"[Video received: {video_path.name}, {size_kb:.0f} KB
             — ffmpeg not available for analysis]"

Unlike other modalities that require explicit file upload, links are automatically detected in message text.

Link Processing Pipeline
1
URL Detection
Regex https?://\S+ scans message text, max 3 URLs per message, trailing punctuation stripped
2
SSRF Protection
validate_url_no_ssrf() blocks private IPs, loopback, link-local, metadata endpoints (169.254.169.254)
3
Fetch
httpx.get() with 5s timeout, follow_redirects=True, 50KB content limit, custom User-Agent
4
Metadata Extraction
Parse <title>, <meta og:title>, <meta og:description> from HTML, strip tags for plain text
5
LinkPreview
Frozen kernel contract: url, title, description, extracted_text, content_type, metadata
6
Injection
LinkPreview content appended as virtual attachment with modality='link' for LLM context

9. Kernel Contracts

All multimodal data flows through frozen, immutable kernel contracts — ensuring type safety and auditability.

# ModalityType — expanded from 4 to 6
ModalityType = Literal["text", "image", "audio", "video", "document", "link"]

# New contracts for document and link modalities
@dataclass(frozen=True)
class DocumentAnalysisResult:
    analysis_id: str
    extracted_text: str
    page_count: int = 0
    doc_type: str = ""
    confidence: float = 0.0

@dataclass(frozen=True)
class LinkPreview:
    url: str
    title: str = ""
    description: str = ""
    extracted_text: str = ""
    content_type: str = ""

# ChatMessage — backward-compatible attachment extension
@dataclass(frozen=True)
class ChatMessage:
    message_id: str
    role: MessageRole
    content: str
    timestamp_ms: int
    metadata: dict[str, Any] = field(default_factory=dict)
    attachments: list[dict[str, Any]] = field(default_factory=list)  # NEW

10. Security Architecture

Multimodal processing introduces new attack surfaces. OctopusOS addresses each one.

Upload Security
MIME allowlist — only 17 known types accepted25MB size limit (configurable via UPLOAD_MAX_SIZE_MB)Filename sanitization — strip ../, limit 255 charsPath traversal prevention via validate_path_within_workspace()
SSRF Protection
Blocks 169.254.169.254 (AWS/GCP metadata)Blocks private IPs (10.x, 172.16-31.x, 192.168.x)Blocks loopback (127.0.0.1) and link-local addressesOnly http:// and https:// schemes allowed
Content Limits
Link fetch: 50KB body limit, 5s timeoutDocument text: truncated to 5000 charsLink text: truncated to 2000 charsVideo: max 10 frames, 60s ffmpeg timeout
Port Isolation
Kernel contracts: frozen dataclasses, zero I/OPort protocols: defined in kernel, implemented in serverGate checks: banned words prevent I/O in kernel layerStub fallback: no crash when API keys missing

11. Frontend Integration

The React frontend provides a seamless multimodal experience with three interaction patterns.

File Selection

A hidden <input type="file"> is triggered by a paperclip button next to the message input. The accept attribute filters to supported MIME types. Multiple files can be selected simultaneously.

Drag and Drop

The chat area accepts file drops via onDrop / onDragOver handlers, providing a natural interaction for desktop users.

Attachment Preview

The AttachmentPreview component renders uploaded files by modality:

ModalityRenderingCompact Mode
Image<img> with click-to-enlarge120x90 thumbnail
Audio<audio controls> player180px width
Video<video controls> player160x120
DocumentFile icon + filename + sizeIcon only
LinkURL card with titleCompact card

12. Implementation Metrics

MetricCount
Supported MIME types17
New frozen contracts2 (DocumentAnalysisResult, LinkPreview)
ModalityType literals6 (text, image, audio, video, document, link)
Port implementations5 (Vision, Audio, Video, Document, Link)
Pipeline stages9
New files created6
Files modified10
New tests13
Security controls4 layers (upload, SSRF, content limits, port isolation)
Gate violations introduced0

13. Design Philosophy

1. Every modality is a first-class citizen. Whether a user sends an image, an audio clip, or a PDF, the same pipeline processes it: upload, store, process, inject into LLM context, render in chat. No modality is a second-class afterthought.

2. Graceful degradation over hard failure. Missing OpenAI API key? Vision falls back to stub. No ffmpeg? Video returns file info. Link fetch fails? Error is logged, message still sent. The system never crashes due to missing dependencies.

3. Security at every boundary. Upload checks MIME types and paths. Link unfurling blocks SSRF. Content is size-limited. Kernel contracts are frozen and I/O-free. Each layer enforces its own security invariants independently.

4. Reuse kernel infrastructure. DocumentProcessor wraps the existing parse_file() domain. VideoProcessor reuses VisionPort and AudioPort. ModalityType extends the existing Literal. ChatMessage adds one field with a default — zero breaking changes.

LinkedIn X
OctopusOS
How can we help?