Emotion is the hidden variable in every customer conversation. A frustrated customer who types politely is about to escalate. A stressed caller whose voice wavers needs immediate attention. The Emotion Analysis Engine detects what neither text nor voice alone can reveal — by analyzing both channels simultaneously and correlating their signals over time.

Current emotion analysis tools suffer from a fundamental limitation: they analyze one modality at a time.

Text-only tools (IBM Watson NLU, Google Cloud NL) detect sentiment from words but miss vocal cues — a customer saying “fine” in a strained voice reads as positive
Voice-only tools (Cogito, Beyond Verbal) detect vocal patterns but miss textual context — a technical question asked calmly might have frustrated text history
No tool currently fuses both modalities into a continuous session state with escalation tracking

The Emotion Analysis Engine is the first system to treat text and voice as parallel signal channels, correlate them in real time, and produce a unified conversation state.

2. Text Emotion Architecture

Scoring Model

Text emotion scoring uses a multi-feature approach rather than a single ML model, ensuring transparency and calibration control:

Text Emotion Scoring

Keyword Dictionary

Weighted word lists for frustration (angry, unacceptable, ridiculous), confusion (confused, unclear, lost), urgency (immediately, ASAP, critical), calm (okay, thanks, good)

Punctuation Analysis

Exclamation marks (!!!) boost frustration by 1.5x. Question marks (???) boost confusion. ALL CAPS boost urgency. Repeated words indicate emphasis.

Intensity Fusion

Keyword weights * punctuation multipliers = raw intensity per dimension. Apply sigmoid normalization to [0, 1].

Dominant Classification

Highest scoring dimension becomes dominant_emotion. Confidence = max_score - second_highest. Output: TextEmotionScores.

Output Contract

@dataclass(frozen=True)
class TextEmotionScores:
    frustration: float    # 0.0 - 1.0
    confusion: float      # 0.0 - 1.0
    urgency: float        # 0.0 - 1.0
    calm: float           # 0.0 - 1.0
    dominant_emotion: str  # highest scoring dimension
    confidence: float     # classification confidence

3. Voice Emotion Architecture

Acoustic Feature Extraction

Voice emotion analysis begins with extracting acoustic features from audio frames:

Energy Features

energy_mean — average amplitude across frameenergy_variance — amplitude stability measureintensity_drift — trend direction over sliding window

Pitch Features

pitch_mean_hz — average speaking frequencypitch_variance — frequency stability (key stress indicator)pitch_range — difference between min and max pitch

Temporal Features

speaking_rate — estimated words per secondsilence_ratio — proportion of silence in framespeech_continuity — flow score (1.0 = continuous, 0.0 = fragmented)

Derived Scores

agitation = f(energy_variance, pitch_variance)stress = f(pitch_mean, speaking_rate, energy_drift)calm = 1 - max(agitation, stress)hesitation = f(silence_ratio, 1 - continuity)

Voice Emotion Scores

@dataclass(frozen=True)
class VoiceEmotionScores:
    agitation: float       # 0.0 - 1.0
    stress: float          # 0.0 - 1.0
    calm: float            # 0.0 - 1.0
    hesitation: float      # 0.0 - 1.0
    dominant_emotion: str
    confidence: float

4. Emotion Tracking Over Time

Individual emotion scores are snapshots. The emotion tracker converts them into a continuous timeline:

Emotion Change Detection

Previous State

Last known EmotionEvent with scores and type (detected/escalation/deescalation)

Current Scores

New TextEmotionScores or VoiceEmotionScores from latest input

Delta Calculation

Compute per-dimension deltas. If any negative dimension increased by > 0.15, flag as potential escalation.

Event Classification

escalation (negative delta > threshold), deescalation (positive delta > threshold), or detected (within normal range)

EmotionEvent Emission

Emit typed event with correlation_id linking to session, delta values, and reason codes

Reason Codes

Every emotion event carries reason codes for auditability:

TEXT_FRUSTRATION_SPIKE — frustration score jumped by > 0.2
VOICE_HIGH_ENERGY — voice energy exceeded agitation threshold
TEXT_NEUTRAL — text analysis shows no strong emotion
VOICE_PITCH_RISE — pitch variance indicates increasing stress

5. Evidence Pack

The evidence pack bundles all session data for post-hoc review:

Session Metadata

session_idduration_mschannel (voice/chat)source adapter

Emotion Drift

text emotion timelinevoice emotion timelinecross-channel drift curveescalation events

AMD Card

verdict (human/machine/beep)confidence scorefeature valuesdecision reason

Transcript

timestamped segmentsspeaker labelslanguage detectionconfidence scores

Escalation Summary

reason codesescalation countintervention pointsresolution outcome

Session State

current state + basistransition historypolicy actions takencalibration version

6. Implementation Metrics

Metric	Count
Frozen contracts in voice_intelligence.py	10
Pure domain modules in voice_intelligence/	9
Emotion-related kernel tests	85
Shared layer tests	45
Supported ingest adapters	3 (Twilio, RTP, SIPREC)
API endpoints	12

Cross-Modal Emotion Analysis Engine

1. The Problem: Single-Modal Blindness