Video Stream is 2-3x faster then the audio stream

Hi,
So I’m using the Linux SDK to record the host’s video & audio stream.
The SDK successfully creates a yuv & pcm files.

I’m trying to deal with 2 different situations (maybe related to each other) which I would like to get some help:

  1. For example, The sdk creates a .yuv file for the video, After taking the file and running ffmpeg command to convert it to mp4, it look like the video playing very fast.
    Here is the command I’m running. In order to deal with it I played with the -framerate attribute in the command but I believe it’s not really solves the situation.
    Usually used -framerate 25
ffmpeg -y -f rawvideo -pix_fmt yuv420p -video_size {video_file_width}*{video_file_height} -framerate 17 -i temp-{video_file} -f mp4 video.mp4

The bot subscribes to the stream this way:

videoHelper1->setRawDataResolution(ZoomSDKResolution_720P);
videoHelper1->subscribe(getUserObj(i)->GetUserID(), RAW_DATA_TYPE_VIDEO);

The function that creates the yuv file:

void ZoomSDKRenderer::SaveToRawYUVFile(YUVRawDataI420* data) {

	// Open the file for writing

    std::string filename = "video_output-";
    int number = data->GetSourceID();
    int width = data->GetStreamWidth();
    int height = data->GetStreamHeight();

    // Convert int to string and concatenate
    filename += std::to_string(number);
    filename += "---";
    filename += std::to_string(width);
    filename += "---";
    filename += std::to_string(height);
	filename += ".yuv";
	const char* file = filename.c_str();

	std::ofstream outputFile(file, std::ios::out | std::ios::binary | std::ios::app);
	if (!outputFile.is_open())
	{
		std::cout << "Error opening file." << std::endl;
		return;
	}
	// Calculate the sizes for Y, U, and V components
	size_t ySize = data->GetStreamWidth() * data->GetStreamHeight();
	size_t uvSize = ySize / 4;

	// Write Y, U, and V components to the output file
	outputFile.write(data->GetYBuffer(), ySize);
	outputFile.write(data->GetUBuffer(), uvSize);
	outputFile.write(data->GetVBuffer(), uvSize);

	// Close the file
	outputFile.close();
	outputFile.flush();
}

Is there any explanation what can cause that?

  1. The audio and the video not aligned. What I’m doing is this flow:
  • Make sure there are yuv & pcm files. and the host’s camera & mic are open.
  • Delete both files at the same time.
  • The sdk creates the files immediately.
  • After 10 seconds I’m coping the files to ‘temp’ yuv & pcm.
  • running ffmpeg command to combine the video and the audio this way:
ffmpeg -y -f rawvideo -pix_fmt yuv420p -video_size {video_file_width}*{video_file_height} -framerate 17 -i temp-{video_file} -f mp4 video.mp4
ffmpeg -y -i video.mp4 -f s16le -ar 32000 -ac 1 -i temp-{audio_file} -c:v copy -c:a aac -strict experimental final.mp4

The result ‘final.mp4’ file is’nt perfect because the audio not 100% percent aligned with the video. When the person speaks so there is a delay with the person’s lips and moves.

What I’m missing here?

CC: @chunsiong.zoom I’m very appreciate your help :slight_smile:

Thanks!

@gofmannir muxing is probably out of scope for this developer forum.

One way to solve this is to use ffmpeg or gstreamer in code level to first encode the yuv frames into mkv or mp4.

Thereafter muxing the audio and video together should be in sync.

What about that the stream yuv file is very fast relatively ?

@gofmannir if you use gstreamer or ffmpeg at code level to encode the frame each time you receive the callback, it will have the same length as the wav file.

Currently I’m not encoding the frame each callback, but each ~10 seconds interval.
the process in separated for video and audio.
The video output file is running 2x time then real time which is weird, what is the frequency that the callback called? what FPS? framerate?

Thanks.

@gofmannir and you are using command line to encode the video and audio every 10 seconds? That’s likely the issue.

The solution is to encode it at runtime.

Why is it matters?
If I’m taking the yuv file after 10 seconds (let’s leave muxing aside), and after 10 secs I’m converting it to mp4, I’m getting a very fast video.
Are you saying this approach causing the video to be fast?

Hey @gofmannir!

Why is it matters?
If I’m taking the yuv file after 10 seconds (let’s leave muxing aside), and after 10 secs I’m converting it to mp4, I’m getting a very fast video.
Are you saying this approach causing the video to be fast?

When the video is running fast, this is because the frame rate you’re specifying to ffmpeg is too high.

ffmpeg receives the input frames but needs to know how long to show each frame for. In the case where your video is too fast, this means that the frame rate is too high and you should lower it accordingly.

In general, you shouldn’t use a fixed frame rate when converting the video from the Zoom SDK. The reason for this is that the frame rate can actually vary. For instance, if the network connection is bad or experiences a disruption, you could actually get a lower frame rate or drop frames.

We recommend using something like gstreamer to encode the video in real-time. This will also solve the issue you’re seeing around audio and video becoming desynchronized. When you encode the audio and video simultaneously, this will keep them in sync regardless of if you have a gap in the video due to your network, or any other reason.

Let me know if this helps and if you have other questions here!

Another alternative is to use Recall.ai for your meeting bots instead. It’s a simple 3rd party API that lets you use meeting bots to get raw audio/video from meetings without you needing to spend months to build, scale and maintain these bots. We’ve encountered all of the same issues you’ve experienced and developed a service that allows you to abstract away the complexities and implementation details of meeting bots so that you can focus on building your core product features.

@amanda-recallai Can you please share some example how to encode the frames in real-time in the callback?

@gofmannir did you have any luck with encoding frames in realtime? would you mind sharing your learnings? thanks