Voice Intelligence: The Conversational State Runtime
When a customer says “everything is fine” through clenched teeth, text analysis sees calm while their voice screams agitation. Voice Intelligence bridges this gap — fusing text and voice emotion signals into a unified session state machine that detects hidden frustration, tracks escalation risk, and recommends intervention policies in real time.
1. Why Cross-Modal Emotion Matters
Traditional contact center analytics operate in a single modality: either text sentiment or voice tone, never both simultaneously. This creates dangerous blind spots:
- A customer typing “no problem” while their voice pitch rises 40% — text says calm, voice says stress
- An escalation trend visible only when correlating text frustration with voice agitation over time
- A “hidden frustration” state that neither modality alone can detect, but their divergence reveals
Voice Intelligence solves this by treating text and voice as two parallel signal channels, correlating them through a timeline alignment engine, and fusing the results into a 7-state session state machine with automatic policy recommendations.
2. Architecture Overview
The Voice Intelligence runtime is built across five layers, following the OctopusOS three-tier architecture (kernel contracts, kernel domains, shared layer).
Layer placement rules:
- L1 (Ingest) lives in
server/shared/ports_impl/— WebSocket handlers, audio stream adapters - L2 (Emotion) lives in
kernel/domains/voice_intelligence/— pure scoring functions, zero I/O - L3 (Correlation) lives in
kernel/domains/signal_correlation/— pure timeline alignment and drift detection - L4 (State) lives in
kernel/domains/session_state/— pure state machine, policy lookup, explainability - L5 (Surface) lives in
server/shared/— HTTP routes, dashboard, evidence export
3. Text Emotion Scoring
The text emotion scorer analyzes chat messages through a multi-feature pipeline:
Output: TextEmotionScores — frustration, confusion, urgency, calm scores plus dominant_emotion and confidence.
4. Voice Emotion Scoring
Voice emotion analysis works on acoustic features extracted from audio frames:
5. Cross-Channel Drift Correlation
The correlation engine aligns text and voice emotion timelines and detects five drift states:
Conflict Detection
When text and voice emotions diverge beyond threshold, the conflict detector identifies four contradiction patterns:
| Conflict Type | Text State | Voice State | Severity |
|---|---|---|---|
text_calm_voice_agitated | calm | agitation/stress | high |
text_neutral_voice_stressed | neutral | stress | medium |
text_positive_voice_negative | positive | negative | high |
voice_calm_text_frustrated | calm | frustration | medium |
6. Session State Machine
The session state engine fuses three signal sources into a single escalation score:
Escalation Score = drift(0.3) + conflict(0.4) + emotion(0.3)
This score drives a 7-state machine:
Policy Hooks
Each state maps to a recommended action:
| State | Action | Suggested Response |
|---|---|---|
| calm | no_action | Continue normal conversation |
| rising_tension | increase_monitoring | Monitor more frequently |
| hidden_frustration | flag_for_review | Flag: text appears calm but voice indicates distress |
| escalation_risk | recommend_intervention | Recommend supervisor review |
| active_escalation | escalate_immediately | Escalate now — active distress detected |
| deescalating | increase_monitoring | Maintain elevated monitoring |
| resolved | record_resolution | Record resolution for review |
7. Explainability Engine
Every state decision includes a full StateBasis decomposition:
- Score Contributors: Three weighted factors (drift/conflict/emotion) with raw scores, weights, and details
- Dominant Signal: Which factor most influenced the decision
- Classification Path: Human-readable description of the decision logic
- Calibration Version: Tracks which scoring parameters were used
This ensures that every escalation decision can be audited, explained, and calibrated.
8. Implementation Metrics
| Metric | Count |
|---|---|
| Frozen contracts (voice_intelligence + signal_correlation + session_state) | 22 |
| Pure domain modules | 16 |
| Kernel tests | 296 |
| Shared layer tests | 119 |
| Total tests | 415 |
| Gate violations | 0 |