Lengths of recorded radio and video are not the same, and out of sync while merging them together

Hello,

I have successfully implemented the zoom sdk headless bot for linux using the sample code, that allows me to join meeting, record and store radio and video into yuv and pcm raw data format respectively.

However, I found the lengths of video and radio are not the same(see attached)

Here is the ffmpeg command I used to merge radio and video:
ffmpeg -f rawvideo -pix_fmt yuv420p -s:v 640x360 -r 25 -i meeting-video.yuv -f s16le -ar 32000 -ac 1 -i meeting-audio.pcm -c:v copy -c:a pcm_s16le -map 0:v:0 -map 1:a:0 video-audio-output.mkv

input video size: 640x360
input frame rate: 25 fps
audio sample rate: 32k
channel: 1

Possible Problem: The writing rates are not the same, see attached it is uneven while recording audio and video. Or Varibale frame rate.

Do you know how to fix this issue, or what caused this situation.

@chriswuyiming these demos shows the capabilities of the SDK and how to access the raw audio and raw video stream. For instance in this case, how to get access to the video YUV420p frame and PCM audio.

There are further optimisation which needs to be done, and they are currently not in this demo application.

I did some quick testing, and it seems that if the video frame is encoded into mkv using ffmpeg at runtime, you should not have this issue of different length of video & audio.

It is also necessary to do slight offset of the audio and video, as the starting time of saving the audio and video file might differ at runtime level.

Thanks for replying. Encoding in real-time while recording is definitly one of the methods to solve this issue, both FFmpeg and Gstreamer are capable of handling real-time encoding tasks. I am wondering what ffmpeg command did you use at runtime, did you encode the video frame into mkv from yuv, or you just skip yuv format(directly from data to mkv)?

@chriswuyiming, we’ve seen GStreamer be more effective when dealing with more complicated real-time audio/video pipelines, especially those that require dynamic reconfiguration at runtime.

You’d be able to accomplish this with GStreamer with a pipeline containing two appsrc to ingest the raw audio and video, videorate and videoscale to normalize the framerate and size of the incoming video, audiorate to normalize the sample rate of the audio and x264enc and voaacenc to encode the video (h264) and audio (aac) respectively, followed by an mp4mux and a filesink to mux the audio and video and write it to a file.

If you want to capture the screenshare as well, or have multiple participants video showing at the same time, you’ll have a significant amount of additional complexity as you’ll need to modify the pipeline dynamically while it runs, in order to add or remove the required pipeline elements.

Another alternative is to use Recall.ai for your meeting bots instead. It’s a simple 3rd party API that lets you use meeting bots to get raw audio/video from meetings without you needing to spend months to build, scale and maintain these bots.

Let me know if you have any questions!

@chriswuyiming ,

I’ve tried using ffmpeg to encode and then save to mkv file at runtime. This is not using the commandline ffmpeg but the c++ libraries.

@chriswuyiming did you figure out a way to merge the raw audio and video or compensate for the difference in lengths?