How to Build an On-Prem Speech AI Pipeline with NVIDIA Riva

Introduction

Real-time speech applications rarely operate in ideal acoustic environments. In enterprise scenarios—sales calls, support sessions, and hybrid meetings—audio often includes echo from remote participants, background noise from laptops, and overlapping speech. These acoustic artifacts degrade speech recognition accuracy and increase downstream processing complexity.

In previous posts, AMAX engineers explored how NVIDIA® Riva enables real-time, on-prem speech recognition for enterprise workloads. As that work progressed into real customer-like scenarios, one challenge became increasingly clear: audio quality matters as much as model quality.

This is where NVIDIA Maxine™ SDK plays a critical role.

Maxine provides GPU-accelerated audio enhancement capabilities such as Acoustic Echo Cancellation (AEC) and Background Noise Removal (BNR). In this post, AMAX engineers share how we integrated Maxine SDK into our existing Riva-based voice AI pipeline, the architectural decisions involved, and what we learned while deploying it in real-world environments.

Why Audio Enhancement Matters in Enterprise Voice AI

Most speech recognition models assume reasonably clean input audio. In practice, enterprise voice workflows introduce challenges such as:

Far-end audio leaking into the microphone path
Laptop speakers causing echo loops
HVAC, keyboard, and ambient noise
Variability across devices and operating systems

During early testing, AMAX engineers observed that these issues could cause:

Increased transcription fragmentation
Delayed stabilization of interim results
Reduced accuracy on domain-specific terminology

Rather than attempting to compensate for these issues purely at the ASR layer, we explored audio pre-processing using NVIDIA Maxine SDK.

Overview of NVIDIA Maxine SDK

NVIDIA Maxine SDK is a collection of GPU-accelerated audio and video enhancement libraries designed for real-time communication applications. For voice AI workloads, two capabilities are particularly relevant:

Acoustic Echo Cancellation (AEC) – removes far-end audio from near-end microphone signals
Background Noise Removal (BNR) – suppresses ambient noise while preserving speech quality

Maxine SDK is delivered as a C++ SDK and runs on both Windows and Linux systems. Unlike cloud-based audio processing services, Maxine operates entirely on-prem and can be integrated directly into existing real-time pipelines.

Architecture: Integrating Maxine into a Riva-Based Voice AI Pipeline

AMAX engineers implemented Maxine as part of a proxy-style architecture that sits between audio capture clients and the Riva ASR service.

High-level flow:

Client captures:

Near-end microphone audio
Far-end loopback/system audio

Audio streams are paired and time-aligned
Maxine AEC removes echo artifacts
Optional Maxine BNR reduces background noise
Cleaned audio is forwarded to Riva ASR
Transcripts are returned to the client

This design allowed us to enhance audio quality without modifying Riva itself, keeping responsibilities cleanly separated.

Key Implementation Considerations

Time Alignment Is Critical for AEC

One of the most important lessons learned was that AEC depends on precise alignment between microphone and loopback audio. Even small timing drift reduces cancellation effectiveness.

To address this, we implemented a Metronome-based pairing mechanism in the Python client. Rather than streaming audio directly from device callbacks, both streams are buffered and released at fixed intervals, ensuring each AEC frame contains synchronized near-end and far-end audio.

This design significantly improved echo cancellation quality across different devices and operating systems.

Metronome-Based Audio Pairing (Python Client)

To ensure NVIDIA Maxine AEC receives precisely aligned near-end (microphone) and far-end (loopback) audio frames, AMAX engineers implemented a Metronome-based pairing mechanism in the Python client.

Instead of streaming audio directly from device callbacks—which may fire at different times—the client buffers both streams and releases paired frames on a fixed clock. This guarantees that each AEC frame contains synchronized audio with a consistent time base.

Key idea

Audio callbacks → ring buffers
Metronome thread → fixed cadence (e.g., 10–20 ms)
Each tick → exactly one paired frame
Missing samples → padded with silence

Simplified implementation

RATE   = 16000          # Riva/AEC expects 16 kHz
FRAME  = 160            # 10 ms @ 16 kHz
BYTES  = FRAME * 2      # PCM16 mono bytes per frame
class SyncedAudioBuffer:
    """Thread-safe dual buffer for mic + loopback audio."""
    def __init__(self, rate=RATE):
        self.rate = rate
        self.mic  = bytearray()
        self.loop = bytearray()
        self.mic_lock  = threading.Lock()
        self.loop_lock = threading.Lock()

    def add_mic(self, data: bytes):
        with self.mic_lock:
            self.mic.extend(data)

    def add_loop(self, data: bytes):
        with self.loop_lock:
            self.loop.extend(data)

    def take_pair(self, n_bytes: int):
        """Only succeed when BOTH buffers have at least n_bytes."""
        with self.mic_lock:
            if len(self.mic) < n_bytes:
                return None, None, False
        with self.loop_lock:
            if len(self.loop) < n_bytes:
                return None, None, False

        with self.mic_lock:
            mic_chunk = bytes(self.mic[:n_bytes])
            del self.mic[:n_bytes]
        with self.loop_lock:
            loop_chunk = bytes(self.loop[:n_bytes])
            del self.loop[:n_bytes]
        return mic_chunk, loop_chunk, True

    def resync_if_needed(self, max_diff_ms: float = 100.0):
        """Trim the longer buffer to match the shorter one (avoid drift buildup)."""
        with self.mic_lock:
            mic_ms = (len(self.mic) / (self.rate * 2)) * 1000
        with self.loop_lock:
            loop_ms = (len(self.loop) / (self.rate * 2)) * 1000
        if abs(mic_ms - loop_ms) <= max_diff_ms:
            return False

        target = None
        with self.mic_lock, self.loop_lock:
            target = min(len(self.mic), len(self.loop))
            self.mic  = self.mic[-target:]
            self.loop = self.loop[-target:]
        return True


def metronome_loop(audio_buf: SyncedAudioBuffer, out_q: queue.Queue, stop: threading.Event):
    """Emit exactly one synchronized (mic, loopback) pair every 10 ms."""
    seq = 0
    base_ms = int(time.time() * 1000)
    period_s = FRAME / RATE  # 10 ms
    next_t = time.perf_counter()
    last_sync_check = time.time()

    while not stop.is_set():
        next_t += period_s
        ts_ms = base_ms + seq * 10

        # Periodically resync buffers to prevent drift
        now_wall = time.time()
        if now_wall - last_sync_check > 1.0:
            audio_buf.resync_if_needed(max_diff_ms=100.0)
            last_sync_check = now_wall

        mic_chunk, loop_chunk, ok = audio_buf.take_pair(BYTES)
        if ok:
            out_q.put_nowait({
                "mic_chunk": mic_chunk,
                "loopback_chunk": loop_chunk,
                "seq": seq,
                "sample_rate": RATE,
                "num_samples": FRAME,
                "timestamp_ms": ts_ms,
            })
            seq += 1

        # Keep a stable cadence
        now = time.perf_counter()
        if next_t > now:
            time.sleep(next_t - now)

Why this matters for Maxine AEC

Acoustic Echo Cancellation assumes that near-end and far-end audio frames represent the same moment in time. Even small timing drift between streams can significantly reduce cancellation effectiveness.

By enforcing a single, stable metronome clock, this approach:

keeps microphone and loopback audio aligned,
prevents drift caused by OS scheduling or driver behavior,
improves echo suppression consistency across devices,
and produces cleaner input for downstream Riva ASR.

This mechanism proved essential when running Maxine AEC in real-world enterprise environments, particularly on Windows systems where audio devices operate asynchronously.

Running Maxine AEC in a Real-Time Loop

A simplified Maxine AEC initialization flow:

NvAFX_Handle handle;
NvAFX_CreateEffect(NVAFX_EFFECT_AEC, &handle);
NvAFX_SetString(handle, NVAFX_PARAM_MODEL_PATH, model_path.c_str());
NvAFX_SetU32(handle, NVAFX_PARAM_INPUT_SAMPLE_RATE, 16000);
NvAFX_Load(handle);

Each audio frame is converted to float format, processed by Maxine, then converted back to PCM before streaming to Riva.

Observed Impact on Transcription Quality

While Maxine introduces a small fixed processing overhead, AMAX engineers observed several practical benefits:

Reduced echo-induced misrecognitions
More stable interim transcripts
Fewer false utterance boundaries
Improved readability for downstream processing such as translation and RAG

In hybrid meeting scenarios, the combination of AEC + BNR + Riva produced noticeably cleaner transcripts compared to ASR alone.

In Part 2, we will examine how to integrate NVIDIA Maxine SDK on Linux with our C++ daemon that operates as the middleware between our Riva Desktop App and the Riva NIM.

Why This Matters for Enterprise Voice AI

By integrating Maxine SDK into the voice AI pipeline, AMAX demonstrated how audio enhancement and speech recognition work best as complementary layers rather than isolated components.

For enterprises deploying on-prem voice AI solutions, this architecture provides:

Better transcription quality in real-world environments
Fully private, GPU-accelerated processing
A modular design that can evolve with new models and features

AMAX's Role

AMAX engineers design and validate complete AI pipelines that combine GPU infrastructure, NVIDIA AI Enterprise software, and real application workflows. Our work with Maxine SDK builds on our existing Riva-based voice AI efforts, creating a reference architecture that addresses both audio quality and speech intelligence.

By deploying and testing these solutions internally, AMAX helps customers move from proof-of-concept voice AI to production-ready, enterprise-grade systems.

🎙️

Contact AMAX to explore on-prem voice AI solutions built with NVIDIA Maxine SDK.

Enhancing Real-Time Voice AI with NVIDIA Maxine SDK: Acoustic Echo Cancellation and Noise Reduction in Practice