Building an AEC daemon on Linux with NVIDIA Maxine SDK and C++

In Part 1 of this series, AMAX engineers focused on the client-side foundations of real-time voice AI: capturing audio from heterogeneous devices, synchronizing microphone and loopback streams, and enforcing deterministic timing using a metronome-based mechanism. That work ensures every audio frame arriving at the server represents a precise moment in time.

Part 2 turns to the server-side counterpart: a Linux-based C++ daemon built with NVIDIA® Maxine™ SDK. This daemon performs Acoustic Echo Cancellation (AEC) on synchronized audio streams, leveraging NVIDIA provided AEC model optimized for specific GPU model, before forwarding audio to NVIDIA Riva for real-time speech recognition.

Together, the client and daemon form a production-ready audio enhancement layer designed for real enterprise voice environments.

Why a Dedicated Audio Enhancement Daemon

As AMAX engineers moved from proof-of-concept voice AI demos to realistic enterprise scenarios—sales calls, customer support sessions, and hybrid meetings—it became clear that audio enhancement needed to be:

deterministic and frame-aligned,
reusable across multiple clients,
isolated from ASR logic, and
deployable on stable, GPU-enabled Linux systems.

NVIDIA Maxine SDK is delivered as a C++ library and optimized for low-latency, GPU-accelerated audio processing. Rather than embedding Maxine directly into each client or modifying Riva itself, AMAX implemented a standalone Linux daemon that acts as an audio intelligence layer between capture and recognition.

This separation keeps responsibilities clean: clients handle capture and synchronization, the daemon handles enhancement, and Riva focuses solely on speech recognition.

High-Level Architecture

At a high level, the pipeline looks like this:

Client (Windows, Python)
  ├─ Microphone audio
  ├─ Loopback (system) audio
  └─ gRPC stream (paired 10 ms frames)
        ↓
Maxine AEC Daemon (Linux, C++)
  ├─ Acoustic Echo Cancellation
  └─ Clean + raw audio streams
        ↓
NVIDIA Riva ASR
  └─ Streaming transcription

Each gRPC message contains paired microphone and loopback PCM frames, already aligned by the metronome mechanism described in Part 1. This allows the daemon to operate without additional buffering or timing correction.

Core Responsibilities of the Daemon

The Maxine-based daemon is intentionally narrow in scope. Its responsibilities include:

hosting a gRPC service for real-time audio ingestion,
managing the lifecycle of the Maxine AEC effect,
running AEC on fixed-size audio frames,
converting between PCM16 and floating-point formats,
streaming audio to Riva ASR,
relaying transcription results back to the client, and
supporting logging and debug capture.

By keeping the daemon focused, AMAX engineers made the system easier to tune, observe, and extend.

Integrating NVIDIA Maxine SDK

Initialization and Model Loading

On startup, the daemon initializes the Maxine AEC effect and loads the required model files. A simplified initialization sequence looks like this:

NvAFX_Handle aec;
NvAFX_CreateEffect(NVAFX_EFFECT_AEC, &aec);
NvAFX_SetString(aec, NVAFX_PARAM_MODEL_PATH, model_path.c_str());
NvAFX_SetU32(aec, NVAFX_PARAM_INPUT_SAMPLE_RATE, 16000);
NvAFX_SetU32(aec, NVAFX_PARAM_NUM_STREAMS, 1);

// Standardize on 10 ms frames (160 samples @ 16 kHz)
NvAFX_SetU32(aec, NVAFX_PARAM_NUM_SAMPLES_PER_INPUT_FRAME, 160);

NvAFX_Load(aec);

Both Maxine and Riva operate on 16 kHz audio, simplifying the server-side pipeline. Sample rate probing and resampling are handled earlier on the client, allowing the daemon to assume a fixed format.

Frame-Based Processing: 10ms for AEC, 200ms for ASR

One key design choice was to separate the granularity required by AEC from the granularity preferred by ASR.

AEC processing operates on 10ms frames (160 samples), which provides stable echo cancellation and predictable latency.
Riva streaming ASR benefits from larger audio chunks to reduce gRPC overhead.

In practice, the daemon processes AEC on every 10ms frame, then accumulates cleaned audio into ~200ms chunks before sending them to Riva. This balances audio quality with streaming efficiency.

constexpr uint32_t RATE = 16000;
constexpr uint32_t AEC_FRAME_SAMPLES = 160;   // 10 ms
constexpr size_t   TARGET_CHUNK_MS   = 200;   // for Riva
constexpr size_t   TARGET_SAMPLES    = TARGET_CHUNK_MS * 16;

Low-Latency Multi-Threading Design

Real-time audio systems must process audio continuously while also handling network I/O and ASR responses. If these responsibilities are tightly coupled, transient delays—such as ASR response bursts or network jitter—can stall the audio path and degrade echo cancellation quality.

To avoid this, the Linux daemon uses a multi-threaded design that decouples audio processing from ASR communication.

Thread Responsibilities

Main gRPC ingest loop
Continuously reads paired 10ms audio frames from the client, runs Maxine AEC, and feeds audio into RivaClient buffers. This loop never blocks on ASR I/O.
Riva send thread (per stream)
Accumulates small AEC frames into ~200 ms chunks and writes them to Riva for efficient streaming.
Riva response thread (per stream)
Reads streaming ASR responses asynchronously and stores them in a thread-safe queue.
Server consumer thread
Polls ASR response queues and publishes transcripts back to the client over the same gRPC stream.

This structure allows AEC to run on a stable cadence while ASR communication happens in parallel.

Key Implementation Snippets

RivaClient thread creation:

active_ = true;
resp_thread_ = std::thread([this]{ this->readLoop(); });
send_thread_ = std::thread([this]{ this->sendLoop(); });

Server-side consumer thread publishing transcripts:

std::thread consumer_thread([&]() {
  while (consume_active) {
    auto mic_resps = riva_mic.consumeAllResponses();
    for (const auto& resp : mic_resps) {
      ProcessingResult r;
      r.set_status("AEC_MIC");
      r.set_transcription(std::get<0>(resp));
      r.set_transcript_is_final(std::get<1>(resp));
      r.set_confidence(std::get<2>(resp));
      stream->Write(r);
    }

    auto loop_resps = riva_loop.consumeAllResponses();
    for (const auto& resp : loop_resps) {
      ProcessingResult r;
      r.set_status("AEC_LOOPBACK");
      r.set_transcription(std::get<0>(resp));
      r.set_transcript_is_final(std::get<1>(resp));
      r.set_confidence(std::get<2>(resp));
      stream->Write(r);
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(10));
  }
});

By separating these responsibilities, the daemon maintains low and predictable latency even under sustained load.

Running AEC on Paired Audio Frames

For each incoming frame, the daemon converts PCM16 audio to floating-point format, runs Maxine AEC, and converts the output back to PCM16.

pcm16_to_float(mic_pcm,  ns, mic_f);
pcm16_to_float(loop_pcm, ns, loop_f);

std::vector<float> cleaned(ns);
const float* inputs[2] = { mic_f.data(), loop_f.data() };
float* out = cleaned.data();

{
  std::lock_guard<std::mutex> lock(aec_mu);
  auto st = NvAFX_Run(aec, inputs, &out, ns, 2);
  if (st != NVAFX_STATUS_SUCCESS) {
    cleaned = mic_f;  // fallback
  }
}

Although frames are processed sequentially, the mutex clearly communicates thread-safety intent and allows the design to evolve.

Dual Riva Streams: LOCAL and REMOTE

The daemon maintains two concurrent Riva streaming sessions per client:

one for the AEC-cleaned microphone path (LOCAL), and
one for the raw loopback path (REMOTE).

RivaClient riva_mic (riva_addr, "MIC");
RivaClient riva_loop(riva_addr, "LOOP");
 
riva_mic.addAudio(cleaned);
riva_loop.addAudio(loop_f);

This approach allows downstream applications to distinguish between local and remote speakers—even when diarization is disabled—while still benefiting from echo cancellation on the microphone path.

Streaming to NVDIA Riva

Once audio is prepared, it is streamed to Riva using Riva’s standard streaming ASR interface. No changes are required to Riva models or configurations.

Maxine operates as a transparent enhancement layer, improving audio quality without altering ASR behavior.

Operational Observations

In real-world testing, AMAX engineers observed that introducing Maxine AEC ahead of Riva led to:

fewer echo-induced misrecognitions,
more stable interim transcripts,
reduced fragmentation of final utterances, and
cleaner input for downstream NLP and RAG workflows.

While AEC introduces a small, fixed processing overhead, the overall system behaved more reliably under realistic acoustic conditions.

How Part 2 Complements the Client-Side Design

Part 1 and Part 2 are intentionally symmetrical:

Client (Part 1) enforces timing, pairing, and format correctness.
Daemon (Part 2) assumes deterministic frames and focuses on audio enhancement and orchestration.

This separation allows each side to remain simple and robust while forming a cohesive end-to-end pipeline.

AMAX's Role

AMAX engineers design and deploy end-to-end enterprise AI pipelines built on NVIDIA AI Enterprise software. Our work with Maxine SDK extends our Riva-based voice AI efforts by addressing a critical but often overlooked layer: audio quality.

By building and operating this Linux-based Maxine daemon, AMAX provides a practical reference architecture for integrating real-time audio enhancement into private, GPU-accelerated voice AI systems.