Detecting active speakers in real-time or getting audio of each participant separately

This is for developer-specific feature requests. For other requests please contact our customer support team.

Is your feature request related to a problem? Please describe.
Our use case is performing real-time analysis on the meetings for each participant.
We have two questions/feature request:

  1. We’d like to know if there is an existing API (In Zoom electron) to get which participants at every time started and stopped speaking during the conversation to have an accurate analysis? We are already aware that there is the TIMELINE post recording which is the less preferred way for us to do it, since we are looking to have something in real time. There are other platforms today that solves that either by using Websocket with separate audio stream per participant or platforms that provide real-time speaker talking events and it will be great to know if Zoom plan to have these kind of capability.
  2. Is there a way to get the audio of each participant separately like Websocket style? If not, is there a better way to get all participants audio in one stream only besides SIP? The offered customized RTMP live streaming solution to convert to an audio stream require at least 3 phases in order to pass it the audio Websocket for speech-to-text puposes (Set RTMP server for streaming end point management, then have a server to convert RTMP stream to Websocket stream and then convert Websocket stream of Video+Audio to audio stream only)

Describe the solution you’d like
A clear and concise description of what you want to happen.

Describe alternatives you’ve considered
A clear and concise description of any alternative solutions or features you’ve considered.

Additional context
Add any other context or screenshots about the feature request here.