Arize + Coval for Enterprise Obervability

May 22, 2026

Test and evaluate your voice AI agents with automated conversation simulations, production monitoring, and CI/CD integration. Catch failures before your users do.

This guide demonstrates how to use Arize and Coval together to evaluate voice AI applications, combining Arize's deep system-level observability with Coval's conversation-level simulation and evaluation capabilities.

Overview

Arize provides comprehensive observability for voice AI applications, capturing detailed traces of internal system calls, audio processing events, and performance metrics. It allows you to deep dive into the technical implementation and troubleshoot issues at the system level.

Coval pulls traces from Arize and provides conversation-level simulation and evaluation capabilities. With just your API key, Coval can access your Arize traces and enable higher-level testing, simulation, and evaluation of entire voice conversations.

Architecture

Voice AI Application sends detailed traces to Arize
Arize captures system calls, API events, and technical metrics
Coval pulls traces from Arize for conversation-level analysis and simulation

Setting Up Arize for Voice AI Tracing

1. Instrument Your Voice AI Application

First, set up comprehensive tracing in your voice AI application to send detailed system traces to Arize.

from opentelemetry import trace
from arize.opentelemetry import register

# Initialize Arize tracing
tracer = register(
    space_id="your_space_id",
    api_key="your_arize_api_key",
    model_id="your_voice_ai_model",
    model_version="v1.0"
)

2. Key Events for Voice AI Instrumentation

Arize captures detailed system-level events from OpenAI Realtime API's WebSocket:

Session Events

session.created: New session initialization with system parameters
session.updated: Session configuration changes and system state updates

Audio Input Events

input_audio_buffer.speech_started: Speech detection algorithms triggered
input_audio_buffer.speech_stopped: End-of-speech detection completed
input_audio_buffer.committed: Audio buffer processing pipeline initiated

Conversation Events

conversation.item.created: Message processing and context management

Response Events

response.audio_transcript.delta: Real-time transcription processing
response.audio_transcript.done: Transcription pipeline completion
response.done: Complete response generation cycle
response.audio.delta: Audio synthesis and streaming

Error Events

error: System failures, API errors, and processing exceptions

3. Detailed Span Creation for System Observability

# Session Management
if event.get("type") == "session.created":
    with tracer.start_as_current_span("session.lifecycle") as parent_span:
        parent_span.set_attribute("session.id", event["session"]["id"])
        parent_span.set_attribute("system.model", "gpt-4o-realtime-preview")
        parent_span.set_attribute("system.voice", event["session"]["voice"])
        parent_span.set_attribute("system.input_audio_format", event["session"]["input_audio_format"])

# Audio Processing Pipeline
if event.get("type") == "input_audio_buffer.speech_started":
    with tracer.start_as_current_span("audio.input.processing") as audio_span:
        audio_span.set_attribute("audio.processing.stage", "speech_detection")
        audio_span.set_attribute("audio.buffer.size", len(audio_buffer))

# Response Generation System
if event.get("type") == "response.done":
    resp = event["response"]
    with tracer.start_as_current_span("response.generation") as response_span:
        response_span.set_attributes({
            "llm.token_count.prompt": resp["usage"]["input_tokens"],
            "llm.token_count.completion": resp["usage"]["output_tokens"],
            "system.processing_time_ms": resp.get("processing_time"),
            "system.model_temperature": resp.get("temperature", 0.8),
            "metadata.status_details": resp["status_details"],
        })

4. Audio File Management and URLs

def upload_to_gcs(file_path, bucket_name, destination_blob_name, make_public=False):
    """Uploads audio files to Google Cloud Storage for Arize tracing."""
    try:
        storage_client = storage.Client()
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(destination_blob_name)
        blob.upload_from_filename(file_path)
        if make_public:
            blob.make_public()
            return blob.public_url
        else:
            return destination_blob_name
    except Exception as e:
        raise RuntimeError(f"Failed to upload {file_path} to GCS: {e}")

def process_audio_and_upload(pcm16_audio, span):
    """Processes audio, uploads to storage, and adds URL to Arize span."""
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    file_name = f"audio_{timestamp}.wav"
    file_path = file_name
    bucket_name = "your-audio-bucket"

    try:
        save_audio_to_wav(pcm16_audio, file_path)
        gcs_url = upload_to_gcs(file_path, bucket_name, f"voice-ai/audio/{file_name}")
        span.set_attribute("input.audio.url", gcs_url)
        span.set_attribute("input.audio.mime_type", "audio/wav")
        span.set_attribute("input.audio.duration_seconds", get_audio_duration(file_path))
    finally:
        if os.path.exists(file_path):
            os.remove(file_path)

    return gcs_url

Setting Up Coval for Conversation-Level Evaluation

1. Add Your API Keys

Configure Coval to pull traces from your Arize instance by adding your API keys in the Coval dashboard.

2. Conversation-Level Simulation & Evaluation

Once connected, Coval can:

Pull conversation data from Arize traces
Run automated conversation simulations
Evaluate conversation quality metrics
Generate comprehensive performance reports

Arize Deep Dive Capabilities

System-Level Monitoring

Use Arize to analyze:

Technical Performance

API response times and latencies
Audio processing pipeline performance
Token usage and costs
Error rates by system component

Audio Processing Metrics

Speech-to-text accuracy
Audio quality scores
Processing buffer sizes
Compression and encoding efficiency

Model Performance

Response generation times
Context window utilization
Temperature and parameter effects
Function calling success rates

Debugging with Arize Traces

# Example: Investigating slow response times
# In Arize dashboard, filter traces by:
# - response.generation span duration > 2000ms
# - Analyze token counts, model parameters
# - Examine audio processing pipeline bottlenecks
# - Check for API rate limiting or failures

Coval Conversation Analysis

Conversation Metrics

Tool call evaluation
Conversation flow analysis
User satisfaction scoring
Response quality assessment

Arize Prompt and Tool Evaluation

Unit-level testing of prompts
Tool calling accuracy
Context management evaluation

Integration Workflow

Daily Monitoring Workflow

System Monitoring in Arize

Monitor technical performance metrics
Track error rates and system health
Analyze API usage and costs
Debug technical issues in real-time

Conversation Analysis in Coval

Pull daily conversation data from Arize
Evaluate conversation quality metrics
Run automated conversation simulations
Generate conversation performance reports

Combined Insights

Correlate system performance with conversation quality
Identify technical issues affecting user experience
Optimize both system parameters and conversation flows

Continuous Improvement Process

# Weekly improvement cycle
def weekly_analysis():
    # Pull system traces from Arize
    system_metrics = arize.get_system_metrics(period="week")

    # Pull conversation data to Coval
    conversations = coval.pull_conversations(period="week")

    # Analyze correlations
    correlation_analysis = coval.analyze_system_conversation_correlation(
        system_metrics=system_metrics,
        conversations=conversations
    )

    # Generate improvement recommendations
    recommendations = coval.generate_recommendations(
        analysis_results=correlation_analysis,
        improvement_areas=["latency", "accuracy", "user_satisfaction"]
    )

    return recommendations

Best Practices

Arize Configuration

Instrument all critical system events
Include comprehensive span attributes
Store audio files in accessible cloud storage
Set up alerting for system anomalies
Use proper error handling and logging

Coval Usage

Regular conversation pulls for fresh data
Define clear evaluation criteria
Use representative conversation samples
Set up automated evaluation pipelines
Compare performance across time periods

Data Management

Maintain consistent audio file naming conventions
Implement proper access controls for sensitive conversations
Archive old conversation data appropriately
Ensure GDPR/privacy compliance for voice data
Regular backup of evaluation results

Troubleshooting

Common Integration Issues

Coval Cannot Pull Traces from Arize

Verify API key permissions and space access
Check that traces exist in the specified time range
Ensure model IDs match between Arize and Coval
Validate network connectivity and firewall settings

Missing Conversation Data

Confirm that conversation spans are properly structured in Arize
Check that audio URLs are accessible to Coval
Verify conversation identification logic
Review trace aggregation settings

Evaluation Failures

Validate conversation data format and completeness
Check evaluation template syntax and criteria
Ensure sufficient conversation samples for analysis
Monitor API rate limits for evaluation models

Conclusion

The combination of Arize and Coval provides a complete voice AI evaluation solution:

Arize gives you deep technical observability into your voice AI system's internal operations, allowing you to monitor performance, debug issues, and optimize system-level components
Coval leverages this detailed trace data to provide conversation-level insights, simulation capabilities, and comprehensive evaluation of user experiences

This two-tier approach ensures you can maintain both technical excellence and conversation quality in your voice AI applications. Start by implementing comprehensive Arize tracing, then use Coval to pull this data for higher-level conversation analysis and optimization.