Voice Intelligence: The Conversational State Runtime

When a customer says “everything is fine” through clenched teeth, text analysis sees calm while their voice screams agitation. Voice Intelligence bridges this gap — fusing text and voice emotion signals into a unified session state machine that detects hidden frustration, tracks escalation risk, and recommends intervention policies in real time.


1. Why Cross-Modal Emotion Matters

Traditional contact center analytics operate in a single modality: either text sentiment or voice tone, never both simultaneously. This creates dangerous blind spots:

  • A customer typing “no problem” while their voice pitch rises 40% — text says calm, voice says stress
  • An escalation trend visible only when correlating text frustration with voice agitation over time
  • A “hidden frustration” state that neither modality alone can detect, but their divergence reveals

Voice Intelligence solves this by treating text and voice as two parallel signal channels, correlating them through a timeline alignment engine, and fusing the results into a 7-state session state machine with automatic policy recommendations.


2. Architecture Overview

The Voice Intelligence runtime is built across five layers, following the OctopusOS three-tier architecture (kernel contracts, kernel domains, shared layer).

Voice Intelligence Five-Layer Model
L5: Action Surface
Policy HooksEvidence Pack ExportOperator DashboardEscalation Alerts
L4: Session State Engine
Escalation Score Fusion7-State MachineTransition DetectorExplainability
L3: Correlation Engine
Timeline AlignmentEmotion CorrelationDrift DetectionConflict Detector
L2: Emotion Analysis
Text Emotion ScorerVoice Emotion ScorerEmotion TrackerAMD Classifier
L1: Ingest Adapters
Twilio WebSocketRTP StreamSIPREC RecorderChat Message API

Layer placement rules:

  • L1 (Ingest) lives in server/shared/ports_impl/ — WebSocket handlers, audio stream adapters
  • L2 (Emotion) lives in kernel/domains/voice_intelligence/ — pure scoring functions, zero I/O
  • L3 (Correlation) lives in kernel/domains/signal_correlation/ — pure timeline alignment and drift detection
  • L4 (State) lives in kernel/domains/session_state/ — pure state machine, policy lookup, explainability
  • L5 (Surface) lives in server/shared/ — HTTP routes, dashboard, evidence export

3. Text Emotion Scoring

The text emotion scorer analyzes chat messages through a multi-feature pipeline:

Text Emotion Scoring Pipeline
1
Keyword Matching
Match against weighted emotion dictionaries — frustration words, confusion words, urgency markers, calm indicators
2
Feature Extraction
Punctuation density (!!!), capitalization ratio (ALL CAPS), word repetition, message length analysis
3
Intensity Calculation
Combine keyword weights with feature multipliers to compute raw emotion intensities
4
Score Normalization
Normalize to [0, 1] range, identify dominant emotion, compute confidence score

Output: TextEmotionScores — frustration, confusion, urgency, calm scores plus dominant_emotion and confidence.


4. Voice Emotion Scoring

Voice emotion analysis works on acoustic features extracted from audio frames:

Energy Analysis
energy_mean — average signal amplitudeenergy_variance — amplitude stabilityintensity_drift — energy trend over time
Pitch Analysis
pitch_mean_hz — average speaking pitchpitch_variance — pitch stability (stress indicator)pitch_contour — rising/falling patterns
Temporal Analysis
speaking_rate — words per secondsilence_ratio — pause frequencyspeech_continuity — flow vs. fragmentation
Score Computation
agitation — high energy + high pitch variancestress — rising pitch + fast speaking ratehesitation — high silence ratio + low continuitycalm — low variance + steady pitch

5. Cross-Channel Drift Correlation

The correlation engine aligns text and voice emotion timelines and detects five drift states:

Cross-Channel Drift States
Aligned Positive
voice agitates
Aligned Negative
Diverging
both worsen
Escalating Cross
intervention
Deescalating Cross
resolved: DCAP
text follows voice: DVAN

Conflict Detection

When text and voice emotions diverge beyond threshold, the conflict detector identifies four contradiction patterns:

Conflict TypeText StateVoice StateSeverity
text_calm_voice_agitatedcalmagitation/stresshigh
text_neutral_voice_stressedneutralstressmedium
text_positive_voice_negativepositivenegativehigh
voice_calm_text_frustratedcalmfrustrationmedium

6. Session State Machine

The session state engine fuses three signal sources into a single escalation score:

Escalation Score = drift(0.3) + conflict(0.4) + emotion(0.3)

This score drives a 7-state machine:

Session State Machine
Calm
score > 0.25
Rising Tension
conflict detected
Hidden Frustration
score > 0.55
Escalation Risk
score > 0.75
Active Escalation
score dropping
Deescalating
score < 0.15
Resolved

Policy Hooks

Each state maps to a recommended action:

StateActionSuggested Response
calmno_actionContinue normal conversation
rising_tensionincrease_monitoringMonitor more frequently
hidden_frustrationflag_for_reviewFlag: text appears calm but voice indicates distress
escalation_riskrecommend_interventionRecommend supervisor review
active_escalationescalate_immediatelyEscalate now — active distress detected
deescalatingincrease_monitoringMaintain elevated monitoring
resolvedrecord_resolutionRecord resolution for review

7. Explainability Engine

Every state decision includes a full StateBasis decomposition:

  • Score Contributors: Three weighted factors (drift/conflict/emotion) with raw scores, weights, and details
  • Dominant Signal: Which factor most influenced the decision
  • Classification Path: Human-readable description of the decision logic
  • Calibration Version: Tracks which scoring parameters were used

This ensures that every escalation decision can be audited, explained, and calibrated.


8. Implementation Metrics

MetricCount
Frozen contracts (voice_intelligence + signal_correlation + session_state)22
Pure domain modules16
Kernel tests296
Shared layer tests119
Total tests415
Gate violations0
LinkedIn X
OctopusOS
How can we help?