Get Captions in real time

Hey there,

I wanted to hear if it would be possible to get the live captions from the meeting. I am building a headless bot using the Linux SDK and wanted to retrieve the captions. In a perfect world, I would query it every x seconds and get the captions since the last time it was queried. Is there any way to do this?

Thanks

@noahviktorschenk seems like Meeting SDK for Linux does not have this capability at this moment.

On the other hand, Meeting SDK for Windows does have the IClosedCaptionControllerEvent which provides you with captions and transcription.

@chunsiong.zoom yeah, okay. We have already built the application using the Linux SDK.

Am I correct in thinking that we would be able to get the live audio track and from that transcribe it ourselves using a third party?

If so, do you have any resources on how this could be implemented that you could point to?

@noahviktorschenk if you are looking at a live-stream type of scenario, you will need to

  1. get the raw audio in PCM (this is provided by zoom)
  2. convert the buffer/frames /streaminto 3rd party accepted format
  3. send the converted audio frames/stream to the 3rd party service
  4. get back the translated caption from the 3rd party service

You will need to implement step 2,3,4 on your end.

Hey @noahviktorschenk ,

If you’re open to using a third party API, you could consider the Recall.ai API.

Here is the guide to get real-time transcription from Zoom with the Recall API: Real-Time Transcription

@chunsiong.zoom I will most likely need them as an m4a, mp3, mp4, mpeg, mpga, wav or webm in 10 seconds intervals.

Just to make sure, this is possible, correct?

Also, is the audio stream split up into the different participants or just one singular?

@noahviktorschenk , we only provide you with step 1.

You will need to convert the raw audio in PCM format to m3a, mp3, mp4, mpg etc…

There are 2 callbacks, one which returns multiple individual audio, and the other which returns single audio with everyone in it.

Okay, no problem.

Do you have some sample code or recourse for how to implement step 1?

@noahviktorschenk

In this sample

There is a boolean variable GetAudioRawData in meeting_sdk_demo.cpp which shows what are some of the methods to call.

Once the audio has been subscribed, the callback would be found in
ZoomSDKAudioRawData.cpp

the sample code found inside the class saves the audio into a PCM file

void ZoomSDKAudioRawData::onMixedAudioRawDataReceived(AudioRawData* audioRawData)
{
	std::cout << "Received onMixedAudioRawDataReceived" << std::endl;
	//add your code here


	static std::ofstream pcmFile;
	pcmFile.open("audio.pcm", std::ios::out | std::ios::binary | std::ios::app);

	if (!pcmFile.is_open()) {
		std::cout << "Failed to open wave file" << std::endl;
		return;
	}
	
		// Write the audio data to the file
		pcmFile.write((char*)audioRawData->GetBuffer(), audioRawData->GetBufferLen());
		//std::cout << "buffer length: " << audioRawData->GetBufferLen() << std::endl;
		std::cout << "buffer : " << audioRawData->GetBuffer() << std::endl;

		// Close the wave file
		pcmFile.close();
		pcmFile.flush();
}

@chunsiong.zoom Ah, perfect. Thank you so much.

Just read through the documentation for the meeting SDK, and it has an IClosedCaptionController, which has a StartLiveTranscription function. Would these not work?

@noahviktorschenk
Yes it will work, for Windows.
There is no such controller in Linux.