Voice Intelligence: The Conversational State Runtime

When a customer says “everything is fine” through clenched teeth, text analysis sees calm while their voice screams agitation. Voice Intelligence bridges this gap — fusing text and voice emotion signals into a unified session state machine that detects hidden frustration, tracks escalation risk, and recommends intervention policies in real time.

Traditional contact center analytics operate in a single modality: either text sentiment or voice tone, never both simultaneously. This creates dangerous blind spots:

A customer typing “no problem” while their voice pitch rises 40% — text says calm, voice says stress
An escalation trend visible only when correlating text frustration with voice agitation over time
A “hidden frustration” state that neither modality alone can detect, but their divergence reveals

Voice Intelligence solves this by treating text and voice as two parallel signal channels, correlating them through a timeline alignment engine, and fusing the results into a 7-state session state machine with automatic policy recommendations.

2. Architecture Overview

The Voice Intelligence runtime is built across five layers, following the OctopusOS three-tier architecture (kernel contracts, kernel domains, shared layer).

Voice Intelligence Five-Layer Model

L5: Action Surface

Policy HooksEvidence Pack ExportOperator DashboardEscalation Alerts

L4: Session State Engine

Escalation Score Fusion7-State MachineTransition DetectorExplainability

L3: Correlation Engine

Timeline AlignmentEmotion CorrelationDrift DetectionConflict Detector

L2: Emotion Analysis

Text Emotion ScorerVoice Emotion ScorerEmotion TrackerAMD Classifier

L1: Ingest Adapters

Twilio WebSocketRTP StreamSIPREC RecorderChat Message API

Layer placement rules:

L1 (Ingest) lives in server/shared/ports_impl/ — WebSocket handlers, audio stream adapters
L2 (Emotion) lives in kernel/domains/voice_intelligence/ — pure scoring functions, zero I/O
L3 (Correlation) lives in kernel/domains/signal_correlation/ — pure timeline alignment and drift detection
L4 (State) lives in kernel/domains/session_state/ — pure state machine, policy lookup, explainability
L5 (Surface) lives in server/shared/ — HTTP routes, dashboard, evidence export

3. Text Emotion Scoring

The text emotion scorer analyzes chat messages through a multi-feature pipeline:

Text Emotion Scoring Pipeline

Keyword Matching

Match against weighted emotion dictionaries — frustration words, confusion words, urgency markers, calm indicators

Feature Extraction

Punctuation density (!!!), capitalization ratio (ALL CAPS), word repetition, message length analysis

Intensity Calculation

Combine keyword weights with feature multipliers to compute raw emotion intensities

Score Normalization

Normalize to [0, 1] range, identify dominant emotion, compute confidence score

Output: TextEmotionScores — frustration, confusion, urgency, calm scores plus dominant_emotion and confidence.

4. Voice Emotion Scoring

Voice emotion analysis works on acoustic features extracted from audio frames:

Energy Analysis

energy_mean — average signal amplitudeenergy_variance — amplitude stabilityintensity_drift — energy trend over time

Pitch Analysis

pitch_mean_hz — average speaking pitchpitch_variance — pitch stability (stress indicator)pitch_contour — rising/falling patterns

Temporal Analysis

speaking_rate — words per secondsilence_ratio — pause frequencyspeech_continuity — flow vs. fragmentation

Score Computation

agitation — high energy + high pitch variancestress — rising pitch + fast speaking ratehesitation — high silence ratio + low continuitycalm — low variance + steady pitch

5. Cross-Channel Drift Correlation

The correlation engine aligns text and voice emotion timelines and detects five drift states:

Cross-Channel Drift States

Aligned Positive

voice agitates

Aligned Negative

Diverging

both worsen

Escalating Cross

intervention

Deescalating Cross

resolved: DC ← AP

text follows voice: DV ← AN

Conflict Detection

When text and voice emotions diverge beyond threshold, the conflict detector identifies four contradiction patterns:

Conflict Type	Text State	Voice State	Severity
`text_calm_voice_agitated`	calm	agitation/stress	high
`text_neutral_voice_stressed`	neutral	stress	medium
`text_positive_voice_negative`	positive	negative	high
`voice_calm_text_frustrated`	calm	frustration	medium

6. Session State Machine

The session state engine fuses three signal sources into a single escalation score:

Escalation Score = drift(0.3) + conflict(0.4) + emotion(0.3)

This score drives a 7-state machine:

Session State Machine

Calm

score > 0.25

Rising Tension

conflict detected

Hidden Frustration

score > 0.55

Escalation Risk

score > 0.75

Active Escalation

score dropping

Deescalating

score < 0.15

Resolved

Policy Hooks

Each state maps to a recommended action:

State	Action	Suggested Response
calm	no_action	Continue normal conversation
rising_tension	increase_monitoring	Monitor more frequently
hidden_frustration	flag_for_review	Flag: text appears calm but voice indicates distress
escalation_risk	recommend_intervention	Recommend supervisor review
active_escalation	escalate_immediately	Escalate now — active distress detected
deescalating	increase_monitoring	Maintain elevated monitoring
resolved	record_resolution	Record resolution for review

7. Explainability Engine

Every state decision includes a full StateBasis decomposition:

Score Contributors: Three weighted factors (drift/conflict/emotion) with raw scores, weights, and details
Dominant Signal: Which factor most influenced the decision
Classification Path: Human-readable description of the decision logic
Calibration Version: Tracks which scoring parameters were used

This ensures that every escalation decision can be audited, explained, and calibrated.

8. Implementation Metrics

Metric	Count
Frozen contracts (voice_intelligence + signal_correlation + session_state)	22
Pure domain modules	16
Kernel tests	296
Shared layer tests	119
Total tests	415
Gate violations	0