Cross-Modal Emotion Analysis Engine

Emotion is the hidden variable in every customer conversation. A frustrated customer who types politely is about to escalate. A stressed caller whose voice wavers needs immediate attention. The Emotion Analysis Engine detects what neither text nor voice alone can reveal — by analyzing both channels simultaneously and correlating their signals over time.


1. The Problem: Single-Modal Blindness

Current emotion analysis tools suffer from a fundamental limitation: they analyze one modality at a time.

  • Text-only tools (IBM Watson NLU, Google Cloud NL) detect sentiment from words but miss vocal cues — a customer saying “fine” in a strained voice reads as positive
  • Voice-only tools (Cogito, Beyond Verbal) detect vocal patterns but miss textual context — a technical question asked calmly might have frustrated text history
  • No tool currently fuses both modalities into a continuous session state with escalation tracking

The Emotion Analysis Engine is the first system to treat text and voice as parallel signal channels, correlate them in real time, and produce a unified conversation state.


2. Text Emotion Architecture

Scoring Model

Text emotion scoring uses a multi-feature approach rather than a single ML model, ensuring transparency and calibration control:

Text Emotion Scoring
1
Keyword Dictionary
Weighted word lists for frustration (angry, unacceptable, ridiculous), confusion (confused, unclear, lost), urgency (immediately, ASAP, critical), calm (okay, thanks, good)
2
Punctuation Analysis
Exclamation marks (!!!) boost frustration by 1.5x. Question marks (???) boost confusion. ALL CAPS boost urgency. Repeated words indicate emphasis.
3
Intensity Fusion
Keyword weights * punctuation multipliers = raw intensity per dimension. Apply sigmoid normalization to [0, 1].
4
Dominant Classification
Highest scoring dimension becomes dominant_emotion. Confidence = max_score - second_highest. Output: TextEmotionScores.

Output Contract

@dataclass(frozen=True)
class TextEmotionScores:
    frustration: float    # 0.0 - 1.0
    confusion: float      # 0.0 - 1.0
    urgency: float        # 0.0 - 1.0
    calm: float           # 0.0 - 1.0
    dominant_emotion: str  # highest scoring dimension
    confidence: float     # classification confidence

3. Voice Emotion Architecture

Acoustic Feature Extraction

Voice emotion analysis begins with extracting acoustic features from audio frames:

Energy Features
energy_mean — average amplitude across frameenergy_variance — amplitude stability measureintensity_drift — trend direction over sliding window
Pitch Features
pitch_mean_hz — average speaking frequencypitch_variance — frequency stability (key stress indicator)pitch_range — difference between min and max pitch
Temporal Features
speaking_rate — estimated words per secondsilence_ratio — proportion of silence in framespeech_continuity — flow score (1.0 = continuous, 0.0 = fragmented)
Derived Scores
agitation = f(energy_variance, pitch_variance)stress = f(pitch_mean, speaking_rate, energy_drift)calm = 1 - max(agitation, stress)hesitation = f(silence_ratio, 1 - continuity)

Voice Emotion Scores

@dataclass(frozen=True)
class VoiceEmotionScores:
    agitation: float       # 0.0 - 1.0
    stress: float          # 0.0 - 1.0
    calm: float            # 0.0 - 1.0
    hesitation: float      # 0.0 - 1.0
    dominant_emotion: str
    confidence: float

4. Emotion Tracking Over Time

Individual emotion scores are snapshots. The emotion tracker converts them into a continuous timeline:

Emotion Change Detection
1
Previous State
Last known EmotionEvent with scores and type (detected/escalation/deescalation)
2
Current Scores
New TextEmotionScores or VoiceEmotionScores from latest input
3
Delta Calculation
Compute per-dimension deltas. If any negative dimension increased by > 0.15, flag as potential escalation.
4
Event Classification
escalation (negative delta > threshold), deescalation (positive delta > threshold), or detected (within normal range)
5
EmotionEvent Emission
Emit typed event with correlation_id linking to session, delta values, and reason codes

Reason Codes

Every emotion event carries reason codes for auditability:

  • TEXT_FRUSTRATION_SPIKE — frustration score jumped by > 0.2
  • VOICE_HIGH_ENERGY — voice energy exceeded agitation threshold
  • TEXT_NEUTRAL — text analysis shows no strong emotion
  • VOICE_PITCH_RISE — pitch variance indicates increasing stress

5. Evidence Pack

The evidence pack bundles all session data for post-hoc review:

Session Metadata
session_idduration_mschannel (voice/chat)source adapter
Emotion Drift
text emotion timelinevoice emotion timelinecross-channel drift curveescalation events
AMD Card
verdict (human/machine/beep)confidence scorefeature valuesdecision reason
Transcript
timestamped segmentsspeaker labelslanguage detectionconfidence scores
Escalation Summary
reason codesescalation countintervention pointsresolution outcome
Session State
current state + basistransition historypolicy actions takencalibration version

6. Implementation Metrics

MetricCount
Frozen contracts in voice_intelligence.py10
Pure domain modules in voice_intelligence/9
Emotion-related kernel tests85
Shared layer tests45
Supported ingest adapters3 (Twilio, RTP, SIPREC)
API endpoints12
OctopusOS
How can we help?