Identifying speaker in onAudio Metadata no longer works

Help!
I have a zoom app working with @zoom/sdk 0.0.2 and just tried to upgrade to @zoom/sdk 1.0.1
I had to add duration and frameSize to my audio params:

  const audioParams: AudioParams = {
    channel: rtms.AudioChannel.MONO,
    sampleRate: rtms.AudioSampleRate.SR_16K,
    codec: rtms.AudioCodec.L16,
    contentType: rtms.AudioContentType.RAW_AUDIO,
    dataOpt: rtms.AudioDataOption.AUDIO_MULTI_STREAMS,
    duration: 20,
    frameSize: 320, // 16000 * 0.02 * 1
  };

And everything seemed to work until I receive data in my onAudio ClientEventHandler and it never makes it past the check to see if the selected participantId matches the metadata.userId:

import type { Metadata } from '@zoom/rtms';
import type { ServerContext } from '../../context.js';
import { type ClientEventHandler, RecordingState } from '../../services/client-registry/types.js';

export const onAudio: ClientEventHandler = async (
  context: ServerContext,
  meetingUuid: string,
  buffer: Buffer,
  _size: number,
  _timestamp: number,
  metadata: Metadata
) => {
  try {
    const selectedParticipantId = context.services.clients.getData(meetingUuid)?.participant?.participantId;

    if (
      metadata.userId == null ||
      !meetingUuid ||
      metadata.userId === -1 ||
      (selectedParticipantId != null && selectedParticipantId !== metadata.userId.toString())
    ) {
      return;
    }

    const state = context.services.clients.getState(meetingUuid);

    if (state === RecordingState.Recording_Start) {
      if (context.services.audio.hasReachRecordingMaxTime(context, meetingUuid)) {
        context.services.clients.setState(meetingUuid, RecordingState.Recording_Stop);
        context.log.info(`Max recording time reached for meeting ${meetingUuid}, stopping RTMS.`);

        // BUG: Will this cause any issues? What is the flow here?
        context.services.clients.pause(meetingUuid, 'audio');
        await context.services.audio.processStream(context, meetingUuid);
        return;
      }

      context.services.audio.pushBytes(context, meetingUuid, buffer, metadata.userId);
    }
  } catch (error: any) {
    context.log.error(`Error handling audio data for ${meetingUuid}:`, error.message);
  }
};

In debugging I’ve found that metadata.userId = -1 and metadata.userName = “" for both rtms.AudioDataOption.AUDIO_MULTI_STREAMS and rtms.AudioDataOption.AUDIO_MIXED_STREAM

Hopefully, there is a simple way to get information about which speaker is speaking so I can isolate to just their voice and the app will continue to work the way it did in the past.

@Canary_Speech Thank you for reaching out about this! I was able to replicate this issue and am working to identify the root cause. I’ll be sure to keep you in the loop as I work to release a fix.

2 Likes

@Canary_Speech Thanks for your patience on this. I’ve released v1.0.2 today which includes some fixes around setting audio parameters.

It seems that AUDIO_MULTI_STREAMS should have been the default but there was a bug preventing that from being set in all cases. I also ensured that calling setAudioParams after setting the OnAudioData callback configured the parameters as expected. In my testing, this resolved the edge cases preventing metadata from being displayed.

When it comes to using AUDIO_MIXED_STREAM there is only one stream that is sent for all participants. In this case, you’ll want to use the OnActiveSpeakerEvent function to receive information about the active speaker.

You can find the latest version here. I hope that helps! Let me know if you are still running into any issues.

1 Like

Thank you! I will pull it in and try it now.

@MaxM That unfortunately didn’t work… The code may be fine but I now have a problem in that I can’t get any events to fire. I even downloaded your rtms-quickstart-js app and tried and I’m getting a CSRF error from ngrok I think

Here are the details:

POST / HTTP/1.1
Host: 8f1d22f5efc6.ngrok.app
User-Agent: Zoom Marketplace/1.0a
Content-Length: 267
Authorization: BLABLA
Clientid: kkYCzJpZQFKbZsbHd2FOzQ
Content-Type: application/json; charset=utf-8
Traceparent: 00-ea21930aea161ac612edb563b03b268c-4f8fbe0ffe928160-00
X-Forwarded-For: 170.114.6.87
X-Forwarded-Host: 8f1d22f5efc6.ngrok.app
X-Forwarded-Proto: https
X-Zm-Request-Id: 9cc5bc35_3068_49a2_9666_2e223ca4e117
X-Zm-Request-Timestamp: 1769827344
X-Zm-Signature: v0=2976ed88baee0dc410a71c6f74468d796508e0112f5994c57a8750e6758bd161
X-Zm-Trackingid: Webhook_7c0044ad39584fbaa05439a77a2080c2
Zm-Trace-Upstream: Meeting_Web_marketplace-consumer
Accept-Encoding: gzip

{“event”:“meeting.rtms_started”,“payload”:{“meeting_uuid”:“qt2W4HBAQgSzdPnhvKzYWQ==”,“operator_id”:“yzsAxeKsRx-MXL2-EnxLfg”,“rtms_stream_id”:“d5002d86af074c53bd7757b532d12544”,“server_urls”:“wss://zoomsjc144-195-54-108zssgw.sjc.zoom.us:443”},“event_ts”:1769827344365}

@Canary_Speech On my end when I test the onAudioData function out of the box on v1.0.2 I’m seeing the metadata is returned. I’m using this code to obtain the metadata:

client.onAudioData((Buffer, size, timestamp, metadata) => {

   console.log(`${metadata.userName} - ${metadata.userId}`)

})


// For AUDIO_MIXED_STREAM do this

const audioParams = {
  dataOpt: rtms.AudioDataOption.AUDIO_MIXED_STREAM,
};

client.setAudioParams(audioParams);

// This works for all dataOpt types
client.onActiveSpeakerEvent((timestamp, userId, userName) => {
    console.log(`Active speaker: ${userName} (${userId})`);
});

Let me know if that’s helpful. If not I sent you a DM so that we can coordinate on a time to meet.

Apologies there was a copy paste issue there and I updated the code. :backhand_index_pointing_up:

To clarify because that code example is a bit terse, you’ll want to use the onActiveSpeakerEvent for AUDIO_MIXED_STREAM

When you’re using an AUDIO_MULTI_STREAMS then you can use the object in the callback. This is the way that the core sdk handles these event.

The default was supposed to be multi steams just to target ease of use and that should be fixed in v1.0.2

Thank you!

I may just be experiencing challenges with old code. I can’t seem to find any audio samples that give examples of this any longer. For example this is no longer in the repo: https://github.com/zoom/rtms-samples/tree/main/audio/save_audio_sdk
What I’ve been having trouble with is getting the setAudioParams to work. I’ve been setting this but it appears that I’m still getting OPUS encoded packets.

    client.setAudioParams({                                                                                                         
      channel: rtms.AudioChannel.MONO,                                                                                              
      sampleRate: rtms.AudioSampleRate.SR_16K,                                                                                      
      codec: rtms.AudioCodec.L16,                                                                                                   
      contentType: rtms.AudioContentType.RAW_AUDIO,                                                                                 
      dataOpt: rtms.AudioDataOption.AUDIO_MULTI_STREAMS,                                                                            
      duration: 20,                                                                                                                 
      frameSize: 320                                                                                                                
    });

I haven’t tried your exact example yet, so I will try that now.

I spotted an issue with the repo where the package.lock was out of date. I fixed that and you can pull down those changes from git and run an npm install to get the latest 1.0.2 version.

I just checked the buffer size and it looks to be in chunks of 640b. How are you testing for PCM just so I can replicate that on my end?

Thank you, I will pull the quickstart with the lastest SDK update and try replicating my code to generate a wave file there. In the meantime:
Here is the code I have in my current solution.

This runs on the rtms started event:

import EventEmitter from 'node:events';
import rtms, { type Client, type JoinParams } from '@zoom/rtms';
import type { ServerContext } from '../../context.js';
import { type ClientEvent, type ClientEventHandler, RecordingState } from './types.js';
import { logger } from '../../logger.js';

export class ClientRegistry<T> {
  private readonly registry = new Map<string, ClientInfo<T>>();
  private readonly eventEmitter = new EventEmitter({ captureRejections: true });

  get size() {
    return this.registry.size;
  }

  has(id: string) {
    return this.registry.has(id);
  }

  async startClient(context: ServerContext, id: string, params: JoinParams, data: T) {
    if (this.has(id)) throw new Error(`Client with meeting uuid ${id} already exists!`);

    const client = new rtms.Client();
    await context.services.audio.prepare(context, id);

    const audioParams = {                                                                                                         
      channel: rtms.AudioChannel.MONO,                                                                                              
      sampleRate: rtms.AudioSampleRate.SR_16K,                                                                                      
      codec: rtms.AudioCodec.L16,                                                                                                   
      contentType: rtms.AudioContentType.RAW_AUDIO,                                                                                 
      dataOpt: rtms.AudioDataOption.AUDIO_MULTI_STREAMS,                                                                            
      duration: 20,                                                                                                                 
      frameSize: 320                                                                                                                
    };

    client.setAudioParams(audioParams);

    client.onActiveSpeakerEvent((event, timestamp, participants)=> {
      console.log(`${event} ${timestamp} ${JSON.stringify(participants)}`)
    });

    client.onJoinConfirm(this.eventSinkFn(context, 'join-confirm', id));
    client.onAudioData(this.eventSinkFn(context, 'audio', id));
    client.onLeave(this.eventSinkFn(context, 'leave', id));
    client.onSessionUpdate(this.eventSinkFn(context, 'session-update', id));
    client.onUserUpdate(this.eventSinkFn(context, 'user-update', id));

    this.setInfo(
      id,
      {
        meetingUuid: id,
        context,
        client,
        startTime: Date.now(),
        state: RecordingState.Recording_Connecting,
        paused: new Set(),
        data,
      },
      false
    );

    try {
 
      client.join(params);
      this.setState(id, RecordingState.Recording_Start);

      context.log.debug(`Added client to meeting(${id})`);
    } catch (joinError) {
      context.log.error(`Failed to join meeting ${id}: ${JSON.stringify(joinError)}`);
      this.setState(id, RecordingState.Recording_Fail);
      this.removeClient(id);
    }
  }

  getClient(id: string) {
    return this.getInfo(id)?.client;
  }

  removeClient(id: string) {
    return this.registry.delete(id);
  }

  getStartTime(id: string) {
    return this.getInfo(id)?.startTime;
  }

  getState(id: string) {
    return this.getInfo(id)?.state;
  }

  setState(id: string, state: RecordingState, extraData?: unknown) {
    const info = this.getInfo(id);
    if (info == null) throw new Error('Cannot set state on non-existent client');

    if (info.state !== state) {
      const previousState = info.state;
      info.state = state;
      this.eventSink(info.context, 'state-transition', id, previousState, state, extraData);
    }
  }

  register<T extends any[]>(event: ClientEvent, listener: ClientEventHandler<T>) {
    logger.info(`Registered Client listener for event(${JSON.stringify(event)})`);

    const eventName = `client:${event}`;
    this.eventEmitter.addListener(eventName, listener);
    return {
      destroy: () => {
        this.eventEmitter.removeListener(eventName, listener);
      },
    };
  }

  pause(id: string, event: ClientEvent) {
    this.getInfo(id)?.paused.add(event);
  }

  resume(id: string, event: ClientEvent) {
    this.getInfo(id)?.paused.delete(event);
  }

  getData(id: string): T | undefined {
    return this.getInfo(id)?.data;
  }

  updateData(id: string, data: Partial<T>): boolean {
    const savedData = this.getInfo(id)?.data;
    if (data == null) return false;
    Object.assign(savedData ?? {}, data);
    return true;
  }

  private getInfo(id: string): ClientInfo<T> | undefined {
    return this.registry.get(id);
  }

  private setInfo(id: string, info: ClientInfo<T>, processChanges?: boolean) {
    const lastState = this.getInfo(id)?.state;
    this.registry.set(id, info);

    if ((processChanges ?? true) && (lastState == null || lastState !== info.state)) {
      this.eventSink(info.context, 'state-transition', id, lastState, info.state);
    }
  }

  private eventSink(context: ServerContext, event: ClientEvent, id: string, ...args: any[]) {
    if (this.getInfo(id)?.paused.has(event)) return;
    this.eventEmitter.emit(`client:${event}`, context, id, ...args);
  }

  private eventSinkFn(context: ServerContext, event: ClientEvent, id: string) {
    return (...args: any[]) => this.eventSink(context, event, id, ...args);
  }
}

interface ClientInfo<T> {
  readonly meetingUuid: string;
  readonly context: ServerContext;
  startTime: number;
  readonly client: Client;
  state: RecordingState;
  paused: Set<ClientEvent>;
  readonly data: T;
}

This is in the onAudio event:

import type { Metadata } from '@zoom/rtms';
import type { ServerContext } from '../../context.js';
import { type ClientEventHandler, RecordingState } from '../../services/client-registry/types.js';

// Debug: track packet count per meeting
const packetCounts = new Map<string, number>();

export const onAudio: ClientEventHandler = async (
  context: ServerContext,
  meetingUuid: string,
  buffer: Buffer,
  _size: number,
  _timestamp: number,
  metadata: Metadata
) => {
  try {
    const selectedParticipantId = context.services.clients.getData(meetingUuid)?.participant?.participantId;

    if (
      metadata.userId == null ||
      !meetingUuid ||
      metadata.userId === -1 ||
      (selectedParticipantId != null && selectedParticipantId !== metadata.userId.toString())
    ) {
      return;
    }

    // Debug: Log first few packets to understand the format
    const count = (packetCounts.get(meetingUuid) ?? 0) + 1;
    packetCounts.set(meetingUuid, count);
    if (count <= 5) {
      const firstBytes = buffer.subarray(0, Math.min(20, buffer.length));
      context.log.info(`[DEBUG] Audio packet #${count}: size=${buffer.length}, first20bytes=${firstBytes.toString('hex')}`);
    }

    const state = context.services.clients.getState(meetingUuid);

    if (state === RecordingState.Recording_Start) {
      if (context.services.audio.hasReachRecordingMaxTime(context, meetingUuid)) {
        context.services.clients.setState(meetingUuid, RecordingState.Recording_Stop);
        context.log.info(`Max recording time reached for meeting ${meetingUuid}, stopping RTMS.`);

        // BUG: Will this cause any issues? What is the flow here?
        context.services.clients.pause(meetingUuid, 'audio');
        await context.services.audio.processStream(context, meetingUuid);
        return;
      }

      context.services.audio.pushBytes(context, meetingUuid, buffer, metadata.userId);
    }
  } catch (error: any) {
    context.log.error(`Error handling audio data for ${meetingUuid}:`, error.message);
  }
};

And this writes the chunks to a file:

async pushBytes(context: ServerContext, meetingUuid: string, buffer: Buffer, userId: number) {
    // Get or create meeting-specific streams Map
    let meetingStreams = this.writeStreams.get(meetingUuid);
    if (!meetingStreams) {
      meetingStreams = new Map();
      this.writeStreams.set(meetingUuid, meetingStreams);
    }

    const filePath = this.getFilePathForUser(meetingUuid, userId);

    let stream = meetingStreams.get(userId.toString());
    if (!stream) {
      const ambient = context.services.clients.getData(meetingUuid)?.ambient ?? false;
      stream = ambient ? AudioStream.ambient(context, filePath) : AudioStream.standard(context, filePath);
      meetingStreams.set(userId.toString(), stream);
    }

    try {
      await stream.write(buffer);

      // Debug statements with low sample rate
      if (Math.random() < 0.01)
        context.log.debug(
          `Audio received for user(${userId}); Processed Seconds(${stream.bytesIn / 2 / 16000}); Saved Seconds(${stream.bytesOut / 2 / 16000})`
        );
    } catch (e) {
      context.log.error(`Failed to write to stream, ${JSON.stringify(e)}`);
    }
  }

From what I can tell that file is still OPUS encoded.

I think the package lock update fixed the audio issue! I would still like to meet with you to review my app but I am able to create a wav file now!

1 Like