Need help in extracting transcripts from the audio file

Hey everyone, I’ve built a Linux app using Linux Zoom SDK which joins Zoom meetings and records the audio of individual participants
And now I’m facing 3 different problems which require some solution

  1. As the audio file is different for different participants, if the meeting is 30 mins there will be 3 audio files with a combined length of 90 minutes which is expensive to process for getting transcriptions
  2. I want to combine the transcripts of all the users from different audio files like User1: Hi
    User2: Hey, what’s up?
  3. I want the time stamp as well for those transcripts

P.S I have my own ASR model which I use for speech-to-text but the speaker diarization is currently missing So I’m not looking for any 3rd party service providers.
The final result I’m expecting is similar to how Zoom provides transcripts

00:00:47.000 → 00:00:52.000
David: Hey, Good Morning

00:00:53.154 → 00:00:55.025
Goggins: Very Good Morning, David

@rakshithgowdahr if you are using raw recording, you will need to work on the timestamp on your side.

There might be a way for you to achieve this, but I do not guarantee it will work.
This is a suggestion, and it does not represent the functionality of Zoom’s SDK.

Use the onOneWayAudioRawDataReceived to identify who is speaking, and record the speaking time into metadata

However you will need to handle if there is overlapping of audio.

Thanks @chunsiong.zoom ,
I have a follow-up question on this,
I’ve built a Zoom bot using Linux SDK for accessing meeting transcripts but I don’t want to raw record audio and process the audio for transcripts by myself So I have 2 solutions in mind

  1. Can the Linux Zoom bot access Closed Captions or Live transcripts? Because I could not find the support for Closed Captions in Linux SDK though they’re present in Mac and Windows SDK
  2. If I build a Zoom Marketplace App and publish it Can it access Closed Captions or Live transcripts?

If Linux SDK does not support ZoomSDKCloseCaptionController when will the support be added?

@rakshithgowdahr

  1. Can the Linux Zoom bot access Closed Captions or Live transcripts? Because I could not find the support for Closed Captions in Linux SDK though they’re present in Mac and Windows SDK

No, it is not available on Linux.

  1. If I build a Zoom Marketplace App and publish it Can it access Closed Captions or Live transcripts?

Not on Linux. But it is available on other platforms

If Linux SDK does not support ZoomSDKCloseCaptionController when will the support be added

There are no plans to do so at the moment. If you could elaborate on the use-case / user story, I can put in a feature request for this.

?

Thanks for the detailed explanation @chunsiong.zoom ,
I have a couple of more questions

  1. If I build a Zoom bot using Windows SDK, Can I access Closed Captions or Live transcripts with speaker identification?
  2. Does the bot has to be the host to get this data or any participant can access it?
  3. Does the host have to enable Live transcripts for every meeting?

Can the bot enable Live transcripts on/from the bot’s account? You know a bot is also like any other participant So If a bot tries to enable Live transcripts and the meeting host provides the permission Can I access those transcripts from my Windows SDK?

How about the Web SDK instead of Windows SDK? Can the Web SDK do the things mentioned above or do I still need Windows SDK for that?

@rakshithgowdahr

  1. If I build a Zoom bot using Windows SDK, Can I access Closed Captions or Live transcripts with speaker identification?

yes

  1. Does the bot has to be the host to get this data or any participant can access it

All participants will have access to it, you do not need to be host

  1. Does the host have to enable Live transcripts for every meeting?

Yes it needs to be enabled, it can be enabled from participant’s side.

@chunsiong.zoom Sorry, the post was not updated, there are few more questions left

Can the bot enable Live transcripts on/from the bot’s account? You know a bot is also like any other participant So If a bot tries to enable Live transcripts and the meeting host provides the permission Can I access those transcripts from my Windows SDK?

How about the Web SDK instead of Windows SDK? Can the Web SDK do the things mentioned above or do I still need Windows SDK for that?

@rakshithgowdahr

Can the bot enable Live transcripts on/from the bot’s account? You know a bot is also like any other participant So If a bot tries to enable Live transcripts and the meeting host provides the permission Can I access those transcripts from my Windows SDK?

Yes, the bot can start transcription service, and the bot can also receive those transcript from Windows Meeting SDK.

How about the Web SDK instead of Windows SDK? Can the Web SDK do the things mentioned above or do I still need Windows SDK for that?

The Web SDK cannot programmatically start transcription. The Web SDK can receive the transcripts.

Hey @chunsiong.zoom ,
Based on my research and understanding I believe that the Web SDK does receive transcripts but it does not have speaker diarization, Am I wrong? If speaker identification is supported in Web SDK can you guide me to some good docs?

ZoomMtg.onCaptions(data => {
    console.log('Closed caption data:', data);
});

Sample data

{
    "sequence": 3,
    "type": 1,
    "lang": "en-US",
    "text": "Hello, this is an example of closed caption text."
}

@rakshithgowdahr ,

Closed captions is not transcription.

Closed caption is for a person to caption it manually.

Transcription is speech to text service, and this is most likely what you are looking for.

ZoomMtg.inMeetingServiceListener('onReceiveTranscriptionMsg', function (data) {console.log('onReceiveTranscriptionMsg', data);});

Web SDK does support receiving transcription, but you cannot programmatically start the transcription service from Web SDK.

No, we do not provide speaker diarization detail from this callback.

Hey @rakshithgowdahr!

As Chun Siong said, the support for each platform varies:

Zoom Linux SDK

:x: Does not support closed captions

:x: Does not support live transcripts

Zoom Windows SDK

:white_check_mark: Supports closed captions

:white_check_mark: Supports live transcripts with speaker identification

:white_check_mark: Can start transcription service

Web SDK

:white_check_mark: Can receive transcripts

:x: Cannot programmatically start transcription

:x: Does not provide speaker diarization detail

These are all important implementation details to consider when deciding on how to build your bot solution.

Another alternative is to use Recall.ai for your meeting bots instead. It’s a simple 3rd party API that lets you use meeting bots to get raw audio/video from meetings without you needing to spend months to build, scale and maintain these bots. Recall also provides the transcripts from Zoom out-of-the-box without having to worry about these implementation details.

Let me know if you have any questions!

Hi @chunsiong.zoom ,

In the Windows Meeting SDK documentation, the StartLiveTranscription() method says
If the meeting allows multi-language transcription,all users can start live transcription.Otherwise only the host can start it
Does this mean the Windows meeting bot (the bot is not the host) cannot start the transcription service? If yes, then what are all the requirements/actions needed for the bot to start the transcription service?

@rakshithgowdahr

If the transcription is disabled by host, no one can start live transcription.
If the transcription is enabled by host, the bot can start the transcription service.

@chunsiong.zoom
Thanks, I’ve used the demo apps you’ve set up.
Currently, I’m using the CaptionDemo app and I have a few questions.
Can it start the live transcription though it’s not the host?
And for some reason, it’s not joining the meeting.
Meeting Id, password, jwt_token everything is correct

@rakshithgowdahr error 63 means you are trying to join an external meeting without publish your app.

You will need to publish your meeting SDK app if you want to join a meeting created by an external host.

@chunsiong.zoom
It’s not an external meeting, the meeting belongs to the same user who created the app on the marketplace. It’s still in development So we haven’t published the app yet but in the future, we will. And one more thing, using the same credentials I’m able to join the meeting using the SDK demo app included in the Windows SDK kit.

I’ll pm you for more details

@chunsiong.zoom Here’s the details,

    "sdk_jwt": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzZGtLZXkiOiI1UGd6bTFtTlFudXVRTHlWTEt3eW93IiwiYXBwS2V5IjoiNVBnem0xbU5RbnV1UUx5VkxLd3lvdyIsIm1uIjo4Mzc2MzI5OTczMywicm9sZSI6MCwiaWF0IjoxNzEyMDU0Mzk4LCJleHAiOjE3MTIwNjE1OTgsInRva2VuRXhwIjoxNzEyMDYxNTk4fQ.4c6lU725xS9tSYBosnoHodi0HXmHLF8IS3uLxlFkkc8",
    "meeting_number": 83763299733,
    "passcode": "redacted",
    "video_source": "Big_Buck_Bunny_1080_10s_1MB.mp4",
    "zak":""

@rakshithgowdahr could you try with a legacy meeting sdk app and check if this happens as well?

@chunsiong.zoom I’ve tried all the below SDKs and it’s the same for all
5.15.7.20385
5.16.5.24346
5.17.5.31085