Skip to content

add audio support in cap_msmf #19721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Oct 20, 2021
Merged

add audio support in cap_msmf #19721

merged 19 commits into from
Oct 20, 2021

Conversation

MaximMilashchenko
Copy link
Contributor

@MaximMilashchenko MaximMilashchenko commented Mar 13, 2021

Merge with extra: opencv/opencv_extra#864

@alalek, @allnes, @fzhar
Issue #16394
I attach a link on opencv_extra's PR opencv/opencv_extra#864 .

@MaximMilashchenko MaximMilashchenko changed the title audio add audio support in cap_msmf Mar 13, 2021
@MaximMilashchenko
Copy link
Contributor Author

@alalek, I fixed the errors, but there is a warning because the patch size is exceeded. Also, the build failed on one of the Jenkins.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch size opencv_extra: 114698 KiB

Where is the link on opencv_extra's PR?

114 Mb is very large. Existed video test data is even smaller. Test data must be reduced.


@allnes do we have an issue about Audio support? Link should be added.


PR's description should not be empty. At least it should define the scope of work, how/what should be validated, prerequisites, etc.

@@ -315,6 +315,9 @@ enum
CV_CAP_PROP_XI_SENSOR_FEATURE_SELECTOR = 585, // Selects the current feature which is accessible by XI_PRM_SENSOR_FEATURE_VALUE.
CV_CAP_PROP_XI_SENSOR_FEATURE_VALUE = 586, // Allows access to sensor feature value currently selected by XI_PRM_SENSOR_FEATURE_SELECTOR.

//Properties of audio VideoIO
CV_CAP_PROP_AUDIO_ENABLE = 1000, // Select audio or video
CV_CAP_PROP_BPS = 1001, // Change bit_per_sample parametr for audio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy/constants_c.h

Legacy files must not be touched without critical reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this file to use the same style of property names inside the backend (CV_CAP_PROP_...). If I don't change the file then I need to use CAP_PROP_...

@@ -185,6 +185,8 @@ enum VideoCaptureProperties {
CAP_PROP_ORIENTATION_AUTO=49, //!< if true - rotates output frames of CvCapture considering video file's metadata (applicable for FFmpeg back-end only) (https://github.com/opencv/opencv/issues/15499)
CAP_PROP_HW_ACCELERATION=50, //!< (**open-only**) Hardware acceleration type (see #VideoAccelerationType). Setting supported only via `params` parameter in cv::VideoCapture constructor / .open() method. Default value is backend-specific.
CAP_PROP_HW_DEVICE =51, //!< (**open-only**) Hardware device index (select GPU if multiple available)
CAP_PROP_AUDIO_ENABLE =1000,
CAP_PROP_BPS =1001,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAP_PROP_BPS

What does that mean in the context of VideoCapture?
"Bytes per second"? (the most incorrect case)

@@ -185,6 +185,8 @@ enum VideoCaptureProperties {
CAP_PROP_ORIENTATION_AUTO=49, //!< if true - rotates output frames of CvCapture considering video file's metadata (applicable for FFmpeg back-end only) (https://github.com/opencv/opencv/issues/15499)
CAP_PROP_HW_ACCELERATION=50, //!< (**open-only**) Hardware acceleration type (see #VideoAccelerationType). Setting supported only via `params` parameter in cv::VideoCapture constructor / .open() method. Default value is backend-specific.
CAP_PROP_HW_DEVICE =51, //!< (**open-only**) Hardware device index (select GPU if multiple available)
CAP_PROP_AUDIO_ENABLE =1000,
CAP_PROP_BPS =1001,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper documentation must be added.

@@ -185,6 +185,8 @@ enum VideoCaptureProperties {
CAP_PROP_ORIENTATION_AUTO=49, //!< if true - rotates output frames of CvCapture considering video file's metadata (applicable for FFmpeg back-end only) (https://github.com/opencv/opencv/issues/15499)
CAP_PROP_HW_ACCELERATION=50, //!< (**open-only**) Hardware acceleration type (see #VideoAccelerationType). Setting supported only via `params` parameter in cv::VideoCapture constructor / .open() method. Default value is backend-specific.
CAP_PROP_HW_DEVICE =51, //!< (**open-only**) Hardware device index (select GPU if multiple available)
CAP_PROP_AUDIO_ENABLE =1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=1000

Why? Why is not 10000000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the value in this file to match the values in the constants_c.h file.

@@ -0,0 +1,150 @@
#include <tuple>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandatory requirements of OpenCV library:

  • proper license header
  • test_precomp.hpp must be included first.

@@ -185,6 +185,8 @@ enum VideoCaptureProperties {
CAP_PROP_ORIENTATION_AUTO=49, //!< if true - rotates output frames of CvCapture considering video file's metadata (applicable for FFmpeg back-end only) (https://github.com/opencv/opencv/issues/15499)
CAP_PROP_HW_ACCELERATION=50, //!< (**open-only**) Hardware acceleration type (see #VideoAccelerationType). Setting supported only via `params` parameter in cv::VideoCapture constructor / .open() method. Default value is backend-specific.
CAP_PROP_HW_DEVICE =51, //!< (**open-only**) Hardware device index (select GPU if multiple available)
CAP_PROP_AUDIO_ENABLE =1000,
Copy link
Member

@alalek alalek Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Select audio or video

We should support processing of both video and audio streams. This is why we are integrating into VideoCapture instead of adding dedicated Audio API.

The next requirement is that video and audio must be synchronized in a some way (need to define these details and write them somewhere. Other "backends" and user apps should follow there guidelines).

Also we need these properties:

  • (Priority 0) CAP_PROP_VIDEO_STREAM - (open-only, 0-based index or -1 to disable) - Default value is 0.
  • (Priority 0) CAP_PROP_AUDIO_STREAM - (open-only, 0-based index or -1 to disable) - Specify stream in multi-language media files. -1 - disable audio processing (default)
  • (Priority 1) CAP_PROP_AUDIO_POS - (read-only, in samples) accurate audio sample timestamp of previous grabbed fragment. See CAP_PROP_AUDIO_SAMPLES_PER_SECOND.
  • (Priority 1) CAP_PROP_AUDIO_DATA_DEPTH - (open-only, Mat::depth()) Default value is -1: use "native" file/codec information. Alternative definition to bits-per-sample, but with clear handling of 32F / 32S.
  • (Priority 1) CAP_PROP_AUDIO_SAMPLES_PER_SECOND - (open-only) 0 - determine from file/codec input. If not specified through parameters or input, then selected audio sample rate is 44100. Note: resampling is not performed by OpenCV.
  • (Priority 2) CAP_PROP_AUDIO_CHANNELS - (open-only?, bitset) - to properly handle stereo or 5.1 streams. ? 0 - use average of all channels. Use -1 to grab all channels. Default value is 0 / -1 ?
  • (Priority 3) CAP_PROP_AUDIO_TOTAL_SAMPLES - (read-only) Number of samples in the file. Returns zero if information is not available (live streams).
  • (Priority 3) CAP_PROP_AUDIO_TOTAL_CHANNELS - (read-only) Number of audio channels in the selected audio stream.
  • (Priority 4) CAP_PROP_AUDIO_TOTAL_STREAMS - (read-only) Number of audio stream in the used media.

Update 2021-04-30:

  • drop CAP_PROP_AUDIO_CHANNELS due to confusion with CAP_PROP_AUDIO_TOTAL_CHANNELS.
  • will be replaced by CAP_PROP_AUDIO_CHANNELS_RETRIEVE_MODE later (not in the scope of this PR)

TBD later

int apiID = cv::CAP_MSMF;
//congigurate VideoCapture for video and audio
VideoCapture cap_video;
VideoCapture cap_audio;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong way.

As said above, we are integrating into VideoCapture to handle both streams instead of adding a dedicated Audio API.

return -1;
}
// open selected micro using selected API
std::vector<int> params { CAP_PROP_AUDIO_ENABLE , static_cast<int>(1) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast(1)

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guided on creating a vector of parameters in a video writer (cap.cpp). There the bool cast to int, in my case it is not required. I will correct this.

// open selected micro using selected API
cap.open(0, apiID, params);
if (!cap.isOpened()) {
cerr << "ERROR! Can't to open file\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n

There is std::endl in C++.

@allnes allnes linked an issue Mar 25, 2021 that may be closed by this pull request
@asenyaev
Copy link
Contributor

asenyaev commented Apr 8, 2021

jenkins cn please retry a build

Comment on lines 203 to 208
enum DeviceSatus
{
NONDEVICE = -1,
CAMERA = 0,
MICROPHONE = 1
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why users need that in public API?

@@ -315,7 +315,6 @@ enum
CV_CAP_PROP_XI_SENSOR_FEATURE_SELECTOR = 585, // Selects the current feature which is accessible by XI_PRM_SENSOR_FEATURE_VALUE.
CV_CAP_PROP_XI_SENSOR_FEATURE_VALUE = 586, // Allows access to sensor feature value currently selected by XI_PRM_SENSOR_FEATURE_SELECTOR.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert all changes in this file, including touching of empty lines.

res.majorType = MFMediaType_Audio;
res.subType = MFAudioFormat_PCM;
res.bit_per_sample = 32;
res.nChannels = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should start from handling of 'Mono' audio streams

const double thisRateDiff = absDiff(getFramerate(), ref.getFramerate());
const double otherRateDiff = absDiff(other.getFramerate(), ref.getFramerate());
if (thisRateDiff < otherRateDiff)
if (width > other.width)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename original function to handle "video"only data and create new one for audio and "generic".

Keep patches small, code history clear.

if (thisDiff < otherDiff)
return true;
if (thisDiff == otherDiff)
if(majorType == MFMediaType_Video)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code format: if<space>()
if is not a function, it is C++ statement.

The same note is about for()

const int audio_base_index = cap.get(cv::CAP_PROP_AUDIO_BASE_INDEX);
for (;;)
{
cap.read(video_frame);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't lost returned results.
Show how to handle them.

@@ -185,11 +185,28 @@ enum VideoCaptureProperties {
CAP_PROP_ORIENTATION_AUTO=49, //!< if true - rotates output frames of CvCapture considering video file's metadata (applicable for FFmpeg back-end only) (https://github.com/opencv/opencv/issues/15499)
CAP_PROP_HW_ACCELERATION=50, //!< (**open-only**) Hardware acceleration type (see #VideoAccelerationType). Setting supported only via `params` parameter in cv::VideoCapture constructor / .open() method. Default value is backend-specific.
CAP_PROP_HW_DEVICE =51, //!< (**open-only**) Hardware device index (select GPU if multiple available)
CAP_PROP_ENABLE_MICROPHONE = 52,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAP_PROP_ENABLE_MICROPHONE

Why do we need dedicated property for that?

CAP_PROP_AUDIO_SAMPLES_PER_SECOND = 57,
CAP_PROP_AUDIO_CHANNELS = 58,
CAP_PROP_AUDIO_BASE_INDEX = 59,
CAP_PROP_AUDIO_TOTAL_CHANNELS = 60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAP_PROP_AUDIO_CHANNELS
CAP_PROP_AUDIO_TOTAL_CHANNELS

Documentation should be added.

@@ -951,15 +1139,14 @@ bool CvCapture_MSMF::open(const cv::String& _filename, const cv::VideoCapturePar
}
else
duration = 0;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read OpenCV contribution guidelines on Wiki and install necessary Git hooks.

MediaType captureFormat;
MediaType captureVideoFormat;
MediaType captureAudioFormat;
bool deviceType; // false - camera, true - audio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool deviceType; // false - camera, true - audio

... and somewhere we have recorded streams too.

This looks overcomplicated.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment with properties proposals too.

}
else
{
cv::Mat(1, cursize, CV_8UC1, ptr, pitch).copyTo(frame);
//switch(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is cv::extractChannel() call in OpenCV which can help to extract channel from interleaved data.

@Samson-Mayeem
Copy link

Can i get a common structure of the OpenCv development here, so i can know the idea forward, relative to this. i have been busy and could not follow a lot of trends on opencv. can one just make a simple summary?

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI builds are failed due to merge conflict.
Squash commits and then rebase on upstream branch (in that order).

Comment on lines 5 to 6
#include <fstream>
#include <stdio.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fstream
stdio.h

Used?

}

const int audioBaseIndex = cap.get(cv::CAP_PROP_AUDIO_BASE_INDEX);
const int numberOfChannels = cap.get(cv::CAP_PROP_AUDIO_TOTAL_CHANNELS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv::

we have using namespace cv in all samples to avoid using of cv:: prefix

Comment on lines 44 to 45
cout << "Number of samples: " << cap.get(cv::CAP_PROP_AUDIO_POS) << endl;
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we can get here?

This is the "error" path.
.grab() should handle normal end-of-stream handling.

Comment on lines 45 to 48
if (audioFrame.empty() && videoFrame.empty())
{
cerr << "ERROR! blank frame grabbed" << endl;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why .grab() didn't handle that?

Comment on lines 29 to 30
cap.grab();
cap.retrieve(frame, audio_base_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing check of the returned results.

VideoCapture cap;
};

class aud : public AudioTestFixture{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aud

What is the problem to use normal name?

void comparison()
{
for(int i = 0; i < validData.size(); i++)
ASSERT_TRUE(fabs(validData[i] - fileData[i]) < epsilon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google test provides ASSERT_NEAR with accurate error reporting.

Also use (...) << i; to dump index at least.

void comparison()
{
for(int i = 0; i < validData.size(); i++)
ASSERT_TRUE(fabs(validData[i] - fileData[i]) < epsilon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileData[i]

missing check for buffer overflow.

Comment on lines 38 to 73
for(int j = 0; j < 44100*3; j += 44100)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is not from 0 to 3?

@ebachard
Copy link

Hello,

@MaximMilashchenko : Is it possible to attach a full diff (with master) ? I'd like to test the whole changes.

Thanks in advance,
@ebachard

@alalek
Copy link
Member

alalek commented Jul 14, 2021

@ebachard Check links at page footer: "ProTip! Add .patch or .diff to the end of URLs for Git’s plaintext views."

@ebachard
Copy link

ebachard commented Jul 14, 2021

@alalek : thanks for the tip. In fact, in meantime I created a full diff including new created files, and I'll give it a try. First, I'll do a (personal) code review, to understand how things work, and then I'll experiment.

Thanks again !

@MaximMilashchenko MaximMilashchenko force-pushed the Audio branch 2 times, most recently from 86f5c78 to a7e9d4d Compare August 26, 2021 15:25
CAP_PROP_AUDIO_BASE_INDEX = 60, //!< Number of video channels
CAP_PROP_AUDIO_TOTAL_CHANNELS = 61, //!< Number of audio channels in the selected audio stream.
CAP_PROP_AUDIO_TOTAL_STREAMS = 62, //!< Number of audio stream in the used media.
CAP_PROP_SYNC_LAST_FRAME = 63, //!< Defult value is 1 (the last audio frame is synchronized with the video frame by duration), 0 is no audio and video last frames sync(the last audio frame will contain all remaining audio data. The duration of the received audio data may be longer than the duration of the received video data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to keep AUDIO prefix:

CAP_PROP_SYNC_LAST_FRAME => CAP_PROP_AUDIO_SYNC_LAST_FRAME

CAP_PROP_AUDIO_POS = 57, //!< Audio position is measured in samples. Accurate audio sample timestamp of previous grabbed fragment. See CAP_PROP_AUDIO_SAMPLES_PER_SECOND
CAP_PROP_AUDIO_DATA_DEPTH = 58, //!< Alternative definition to bits-per-sample, but with clear handling of 32F / 32S
CAP_PROP_AUDIO_SAMPLES_PER_SECOND = 59, //!< determined from file/codec input. If not specified, then selected audio sample rate is 44100
CAP_PROP_AUDIO_BASE_INDEX = 60, //!< Number of video channels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of video channels

(read-only) Index of first audio channel. Used as parameter for .retrieve() call.


Perhaps we should have: CAP_PROP_VIDEO_TOTAL_CHANNELS (currently with the same value)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it's not clear why Audio_base_index sets number of video channels

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do both. (add total and rename to VIDEO_TOTAL_CHANNELS). Also, you need to define clearly in docs that audio channel number is not zero-indexed, it continues enumeration after video channels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of video channels

(read-only) Index of first audio channel. Used as parameter for .retrieve() call.

Perhaps we should have: CAP_PROP_VIDEO_TOTAL_CHANNELS (currently with the same value)

What do I have to do here: to rename CAP_PROP_AUDIO_BASE_INDEX=>CAP_PROP_VIDEO_TOTAL_CHANNELS or to add a new CAP_PROP_VIDEO_TOTAL_CHANNELS property and to change the CAP_PROP_AUDIO_BASE_INDEX description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • CAP_PROP_AUDIO_BASE_INDEX should be here with updated comment (see above).
  • add new property CAP_PROP_VIDEO_TOTAL_CHANNELS (for RGBD or stereo streams/cameras)

@@ -69,7 +69,7 @@ static void init_MFCreateDXGIDeviceManager()
#endif

#include <mferror.h>

#include <fstream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this?

virtual bool grabFrame() CV_OVERRIDE;
bool retrieveAudioFrame(int, cv::OutputArray);
bool retrieveVideoFrame(cv::OutputArray);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv::

Should not be used in OpenCV code.

Exception is cv::format() (to avoid problem with std::format())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv::

Should not be used in OpenCV code.

Exception is cv::format() (to avoid problem with std::format())

In master, cv:: is used with this argument. Is it necessary to remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from modified/added code.

No need to change unmodified parts.

buf = NULL;
}
audioSamples.clear();
int numberOfByte = (videoStream != -1) ? (int)((double)(curVideoTime/1e7)*captureAudioFormat.nSamplesPerSec*captureAudioFormat.nChannels*(captureAudioFormat.bit_per_sample)/8) : cursize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curVideoTime/1e7

It is not accurate with integer numbers:

LONGLONG curVideoTime;

Division should be the last operation of expression (also avoid overflow - int64_t should be enough).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfByte is confusing. What is it - start of data chunk or chunk's length ? Please rename (using start or length).
What is magic number 1e7 - does it come from MSFT measurements in units of 0.1us ?
(double) cast is extra here, 1e7 should be double and will autocast everything else to double.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curVideoTime/1e7

It is not accurate with integer numbers:

LONGLONG curVideoTime;

Division should be the last operation of expression (also avoid overflow - int64_t should be enough).

Mat accepts arguments with the int type. If we take int64_t numberOfByte, then the conversion of int64_t types to int will occur in the Mat constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • compute as int64_t
  • avoid doubles / floating-point computations
  • check range of final result and cast to "int" if needed

Comment on lines 74 to 75
ASSERT_TRUE(cap.retrieve(videoFrame));
ASSERT_TRUE(cap.retrieve(audioFrame, audioBaseIndex));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide useful messages about failed checks. Add number of processed frame.
There are different cases with failures on frame 0 (nothing works here), and some frame 345 (something works).

Use GoogleTest SCOPED_TRACE for that.

{
ASSERT_TRUE(cap.retrieve(videoFrame));
ASSERT_TRUE(cap.retrieve(audioFrame, audioBaseIndex));
ASSERT_EQ(audioFrame.cols/44100, 1/fps); // check if the duration of the received audio data satisfies one video frame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check may fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the duration of the received audio data does not equal to one frame of the video, the check will not be passed. This is possible if synchronization does not work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided code doesn't look valid. Looks like it always compare 0 vs 0.

Use this instead:

int samplesPerFrame = 1/fps*44100;
// 44100 should be checked and replaced to CAP_PROP_AUDIO_SAMPLES_PER_SECOND property.

int audioSamplesTolerance = samplesPerFrame / 2;

We should have these checks:

  • position: abs(CAP_PROP_AUDIO_POS - CAP_PROP_POS_MSEC / 1000 * 44100) < audioSamplesTolerance
  • duration: abs(audioFrame.cols - samplesPerFrame) < audioSamplesTolerance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided code doesn't look valid. Looks like it always compare 0 vs 0.

Use this instead:

int samplesPerFrame = 1/fps*44100;
// 44100 should be checked and replaced to CAP_PROP_AUDIO_SAMPLES_PER_SECOND property.

int audioSamplesTolerance = samplesPerFrame / 2;

We should have these checks:

  • position: abs(CAP_PROP_AUDIO_POS - CAP_PROP_POS_MSEC / 1000 * 44100) < audioSamplesTolerance
  • duration: abs(audioFrame.cols - samplesPerFrame) < audioSamplesTolerance

The CAP_PROP_POS_MSEC value is not always valid

ASSERT_EQ(audioFrame.cols/44100, 1/fps); // check if the duration of the received audio data satisfies one video frame
if (!videoFrame.empty())
videoData.push_back(videoFrame);
for (int i = 0; i < audioFrame.cols; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for audioFrame.type() as code below assumes CV_16SC1.

}
ASSERT_FALSE(fileData.empty());
}
void comparison()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAudio
validateAudio

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAudio
validateAudio

What does this mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion to use more clear name

@@ -246,7 +299,7 @@ struct MediaType
return wdiff + hdiff;
}
// check if 'this' is better than 'other' comparing to reference
bool isBetterThan(const MediaType& other, const MediaType& ref) const
bool VideoIsBetterThan(const MediaType& other, const MediaType& ref) const
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually video quality is compared not only by resolution, but also by bitrate and FRC mode

buf = NULL;
}
audioSamples.clear();
int numberOfByte = (videoStream != -1) ? (int)((double)(curVideoTime/1e7)*captureAudioFormat.nSamplesPerSec*captureAudioFormat.nChannels*(captureAudioFormat.bit_per_sample)/8) : cursize;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfByte is confusing. What is it - start of data chunk or chunk's length ? Please rename (using start or length).
What is magic number 1e7 - does it come from MSFT measurements in units of 0.1us ?
(double) cast is extra here, 1e7 should be double and will autocast everything else to double.

}
}
_ComPtr<IMFMediaBuffer> buf = NULL;
BYTE* useAudioData = NULL;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audioDataBuffer or audioDataInUse ? useAudioData sounds like function name or boolean flag

Comment on lines 1683 to 1673
for (int i = 0; i < numberOfByte; i++)
{
useAudioData[i] = bufferAudioData[i];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use memcpy_s or std::copy (in case of using std::dequeue instead of vector for bufferAudioData)?

LONGLONG audioSamplePos;
DWORD numberOfAudioStreams;
Mat audioFrame;
std::vector<BYTE> bufferAudioData;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as this buffer will be used as queue for samples, I'd suggest to use ring queue or std::deque to speed it up. This will be noticeable on long queues only though.


if (!SUCCEEDED(buf->Lock(&ptr, &maxsize, &cursize)))
break;
for (unsigned int i = 0; i < cursize; i++)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do .reserve first to speed it up. Or do resize and just fill the data to the tail.

}
if (!audioFrame.empty())
{
audioData.push_back(audioFrame);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it would be good to do something with decoded data. As long as we have live video,you may paint oscillogram of decoded chunk over the video frame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzhar Audio writer is out of scope of this PR. It would be added later.

for (int nCh = 0; nCh < numberOfChannels; nCh++)
{
cap.retrieve(frame, audioBaseIndex+nCh);
audioData.push_back(frame);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Writing audio to the file would be nice.

{
DWORD streamIndex, flags;
HRESULT hr;
std::vector<bool> outInstalFlag;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need vector of flags here ? At the end they are just and-ed, mybe better use single boolean flag instead of vector ?

}
else if (isOpen)
{
std::vector<bool> outInstalFlag;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector is really extra here. jut use single bool

}
else
{
sampleTime += frameStep;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampleTime += frameStep;

This may provide inaccurate values.
Use timestamp value from ReadSample. See code related to m_lastSampleTimestamp.

Copy link
Contributor Author

@MaximMilashchenko MaximMilashchenko Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampleTime += frameStep;

This may provide inaccurate values.
Use timestamp value from ReadSample. See code related to m_lastSampleTimestamp.

Timestamp value from Read Sample does not always indicate the end of the captured sample. Example, in the test case of {"mov", "H 264", 30. f, CAP_MSMF} (videoio_synthetic, write_read_position ) the time stamp in ReadSample is set at the beginning of the captured frame and for correct operation it is necessary to calculate SampleTime= + frameStep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the end of the captured sample
end

This is not needed at all and can NOT be properly determined for VFR videos (without reading of the next frame).

Use provided value from the decoder/demuxer.
Compare results with FFmpeg backend.

Comment on lines 1850 to 1851
case CV_CAP_PROP_POS_FRAMES:
return floor(((double)sampleTime / 1e7)* captureFormat.getFramerate() + 0.5);
return floor(((double)sampleTime / 1e7)* captureVideoFormat.getFramerate() + 0.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use direct frame counter by counting of grab() / ReadSample() calls.

Update also ::setTime() method. Seek to zero should be accurate, others are not (perhaps we should not return property value after that)

typedef std::tuple<std::string, double, int, int, int, int, int, double, std::pair<std::string, int> > paramCombination;
typedef std::tuple<std::string, double, std::pair<std::string, int> > param;

class baseAudio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baseAudio must start with capital letter

baseAudio => AudioBaseTest

Comment on lines 164 to 177
void getValidAudioData()
{
const double step = 3.14/22050;
double value = 0;
for (int j = 0; j < 3; j++)
{
value = 0;
for (int i = 0; i < 44100; i++)
{
validAudioData.push_back(sin(value));
value += step;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code duplication

Comment on lines 222 to 228
void comparisonAudio()
{
for (unsigned int i = 0; i < audioData.size(); i++)
{
EXPECT_LE(fabs(validAudioData[i] - audioData[i]), epsilon) << "sample index " << i;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code duplication

EXPECT_LE(fabs(validAudioData[i] - audioData[i]), epsilon) << "sample index " << i;
}
}
void comparisonVideo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparisonVideo => checkVideoFrames

Comment on lines 68 to 81
void getValidAudioData()
{
const double step = 3.14/22050;
double value = 0;
for(int j = 0; j < 3; j++)
{
value = 0;
for(int i = 0; i < 44100; i++)
{
validAudioData.push_back(sin(value));
value += step;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code duplication

Comment on lines 52 to 55
int numberOfSamles = 0;
for (auto item : audioData)
numberOfSamles+=item.cols;
cout << "Number of samples: " << numberOfSamles << endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move out post processing code from the for loop.

Keep for loops readable - minimal as possible.

cap.open(0, CAP_MSMF, params);
if (!cap.isOpened())
{
cerr << "ERROR! Can't to open file" << endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file

there is no file

const double cvTickFreq = getTickFrequency();
int64 sysTimeCurr = getTickCount();
int64 sysTimePrev = sysTimeCurr;
while ((sysTimeCurr-sysTimePrev)/cvTickFreq < 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to emit message that audio would be captured for the next 10 seconds.
To avoid confusion with program hang.

Comment on lines +36 to +37
if (cap.grab())
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"else" handling is missing.

What if user disconnect microphone device?

}
if (!audioFrame.empty())
{
audioData.push_back(audioFrame);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzhar Audio writer is out of scope of this PR. It would be added later.

Comment on lines 1666 to 1675
try
{
if (chunkLengthOfBytes < INT_MIN || chunkLengthOfBytes > INT_MAX)
throw "The chunkLengthOfBytes is out of the allowed range";
}
catch (const std::exception& e)
{
CV_LOG_WARNING(NULL, "MSMF: Exception is raised: " << e.what());
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't work properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I can fix that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look for similar code in the library. Use CV_Check.

}
copy(bufferAudioData.begin(), bufferAudioData.begin()+chunkLengthOfBytes, std::back_inserter(audioDataInUse));
bufferAudioData.erase(bufferAudioData.begin(), bufferAudioData.begin()+chunkLengthOfBytes);
residualTime = (double)(bufferAudioData.size()/((captureAudioFormat.bit_per_sample/8)*captureAudioFormat.nChannels))/captureAudioFormat.nSamplesPerSec;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

residualTime

Wrong logic is here.
.retrive() must be idempotent implementation. It can be called many times with same parameters or not called at all.
It is wrong to modify residualTime variable.

State must be changed during .grab() call only.

Comment on lines 1505 to 1509
bool returnFlag = true;
if (videoStream != -1)
returnFlag &= grabVideoFrame();
if (audioStream != -1)
returnFlag &= grabAudioFrame();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to capture video, then there is no reason to capture audio. We already failed.

0, // Flags.
&streamIndex, // Receives the actual stream index.
&flags, // Receives status flags.
&sampleTime, // Receives the time stamp.
&videoSample // Receives the sample or NULL.
NULL, // Receives the time stamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why audio ignores timestamp?

This is why the current logic fails on sample_960x400_ocean_with_audio.mp4

Add debug dump of values from ReadSample:

CvCapture_MSMF::initStream Init stream 1 with MediaType (960x400 @ 23.976) MFVideoFormat_RGB32
CvCapture_MSMF::initStream Init stream 0 with MediaType (0x0 @ 1) ☺
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=1668332
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=213333
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=206349
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=419682
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
...
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=2085415
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=633061
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=846441
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213152
...
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=2502498
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=1059593
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=1272972
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379

Audio stream starts 4+ frames before video. We need to capture all of this audio data and return with the first video frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why audio ignores timestamp?

This is why the current logic fails on sample_960x400_ocean_with_audio.mp4

Add debug dump of values from ReadSample:

CvCapture_MSMF::initStream Init stream 1 with MediaType (960x400 @ 23.976) MFVideoFormat_RGB32
CvCapture_MSMF::initStream Init stream 0 with MediaType (0x0 @ 1) ☺
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=1668332
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=213333
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=206349
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=419682
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
...
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=2085415
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=633061
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=846441
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213152
...
CvCapture_MSMF::grabVideoFrame ReadSample(video): stream=1 time=2502498
CvCapture_MSMF::grabVideoFrame ReadSample(video): GetSampleDuration=417083
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=1059593
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379
CvCapture_MSMF::grabAudioFrame ReadSample(audio): stream=0 time=1272972
CvCapture_MSMF::grabAudioFrame ReadSample(audio): GetSampleDuration=213379

Audio stream starts 4+ frames before video. We need to capture all of this audio data and return with the first video frame.
Audio ignores timestamp because there was no need to use it. I will fix it.

Comment on lines 1795 to 1799
bool CvCapture_MSMF::setTime(int numberFrame)
{
if(videoStream == -1)
return false;
PROPVARIANT var;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should forbid seeking in case of Video+Audio capturing, except seek to beginning of the media file.

Other values are just not accurate even for Video-only cases.

The SetCurrentPosition method does not guarantee exact seeking. The accuracy of the seek depends on the media content.

https://docs.microsoft.com/en-us/windows/win32/api/mfreadwrite/nf-mfreadwrite-imfsourcereader-setcurrentposition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should forbid seeking in case of Video+Audio capturing, except seek to beginning of the media file.

Other values are just not accurate even for Video-only cases.

The SetCurrentPosition method does not guarantee exact seeking. The accuracy of the seek depends on the media content.

https://docs.microsoft.com/en-us/windows/win32/api/mfreadwrite/nf-mfreadwrite-imfsourcereader-setcurrentposition

We can manually rewind to any frame by setting SetCurrentPosition to 0, call grab and release the frames until we reach the desired one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call grab and release the frames until we reach the desired one

Then the next step will be receiving a bug report about why seeking implementation is so slow.
If we want to implement feature, then we should it implemented in the best way. Otherwise we would get unresolvable threads like these: #9053

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should forbid seeking in case of Video+Audio capturing, except seek to beginning of the media file.

Then I can forbid setting CV_CAP_PROP_YPOS_FRAMES and CV_CAP_PROP_YPOS_MSC in case of Video+Audio capturing and allow setting CV_CAP_PROP_POS_AVI_RATIO for values 0 and 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except seek to beginning of the media file.

Seek to zero should be supported for all modes (this case is well defined and just works).

Comment on lines 200 to 201
if (!videoFrame.empty())
videoData.push_back(videoFrame);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

videoFrame.empty()

Must be always false on .grab() success.

Current last frame synchronization logic doesn't work properly.


const paramCombination mediaParams[] =
{
paramCombination("mp4", 1, 0.15, CV_8UC3, 240, 320, 90, 30, 30., {"CAP_MSMF", cv::CAP_MSMF})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CAP_MSMF", cv::CAP_MSMF

Which other tests uses the same scheme? Why we need string here?

}
ASSERT_LT(abs(cap.get(CAP_PROP_AUDIO_POS) - (cap.get(CAP_PROP_POS_MSEC)/ 1000 + 1./fps) * samplePerSecond), audioSamplesTolerance);
ASSERT_LT(abs(audioFrame.cols - samplesPerFrame), audioSamplesTolerance);
ASSERT_EQ(CV_16SC1, audioFrame.type());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type check is placed after data usage.

}
}
ASSERT_LT(abs(cap.get(CAP_PROP_AUDIO_POS) - (cap.get(CAP_PROP_POS_MSEC)/ 1000 + 1./fps) * samplePerSecond), audioSamplesTolerance);
ASSERT_LT(abs(audioFrame.cols - samplesPerFrame), audioSamplesTolerance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audioFrame.cols must be same across channels.

audioFrame.cols should not be checked for the first frame (or multiple no-audio frames in the beginning) and the last frame (due to the last frame synchronization, which reads audio till the end).

Copy link
Contributor Author

@MaximMilashchenko MaximMilashchenko Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audioFrame.cols must be same across channels.

Do I need to check it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, test should check that

audioData[nCh].push_back(f);
}
}
ASSERT_LT(abs(cap.get(CAP_PROP_AUDIO_POS) - (cap.get(CAP_PROP_POS_MSEC)/ 1000 + 1./fps) * samplePerSecond), audioSamplesTolerance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this check to work we need one more read-only property which contains timestamp shift between audio and video streams (of the first captured data)

timestamp shift should be in nanoseconds.

@MaximMilashchenko
Copy link
Contributor Author

How the pipeline should behave when the audio stream is shorter than the video stream? that is, the audio has an offset at the beginning and may end earlier

@alalek
Copy link
Member

alalek commented Oct 1, 2021

Last frames should return empty audio data. Audio data for the first frame is larger that for others (due to audio starts before video)

@spazewalker spazewalker mentioned this pull request Oct 3, 2021
8 tasks
Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run test locally with

paramCombination("sample_960x400_ocean_with_audio.mp4", 2, 0.15, CV_8UC3, 400, 960, 1116, 30, 30., cv::CAP_MSMF)

Several test conditions still fail (besides of audio gold results).

for (unsigned int nCh = 0; nCh < audioData.size(); nCh++)
for (unsigned int i = 0; i < audioData[nCh].size(); i++)
{
EXPECT_LE(fabs(validAudioData[nCh][i] - audioData[nCh][i]), epsilon) << "sample index " << i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if readFile() fails or read more data then that code tried to perform out of buffer access.


nCh < audioData.size()

Checks constraints based on input data are logically incorrect. Gold values and test parameters must be used instead.

if (nCh == 0)
audioFrameCols = audioFrame.cols;
else
ASSERT_EQ(audioFrameCols, audioFrame.cols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<< nCh; or use SCOPED_TRACE with nCh value.

if (!audioFrame.empty())
if (frame != numberOfFrames-1)
{
ASSERT_LT(abs(cap.get(CAP_PROP_AUDIO_POS) + (cap.get(CAP_PROP_TIME_SHIFT_STREAMS)/ 1e6) * samplePerSecond - (cap.get(CAP_PROP_POS_MSEC)/ 1000 + 1./fps) * samplePerSecond), audioSamplesTolerance);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is failed with:

paramCombination("sample_960x400_ocean_with_audio.mp4", 2, 0.15, CV_8UC3, 400, 960, 1116, 30, 30., cv::CAP_MSMF)

See audioSamplePos comment above.


1./fps

This is wrong. Both positions must point to be beginning of the data. You just can't estimate the end for VFR videos.
Check must look like this:

EXPECT_NEAR(
        cap.get(CAP_PROP_AUDIO_POS),
        ((cap.get(CAP_PROP_POS_MSEC) - cap.get(CAP_PROP_TIME_SHIFT_STREAMS) / 1e6) / 1000) * samplePerSecond,
        audioSamplesTolerance)
    << cap.get(CAP_PROP_POS_MSEC)=" << cap.get(CAP_PROP_POS_MSEC);

Copy link
Contributor Author

@MaximMilashchenko MaximMilashchenko Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check must look like this:

EXPECT_NEAR(
        cap.get(CAP_PROP_AUDIO_POS),
        ((cap.get(CAP_PROP_POS_MSEC) - cap.get(CAP_PROP_TIME_SHIFT_STREAMS) / 1e6) / 1000) * samplePerSecond,
        audioSamplesTolerance)
    << cap.get(CAP_PROP_POS_MSEC)=" << cap.get(CAP_PROP_POS_MSEC);

This is how the check should look, only for the first frame. After reading the first frame, the timestamps of audio and video will not need to align and take into account the offset

if (frame == 0)
{
EXPECT_NEAR(
cap.get(CAP_PROP_AUDIO_POS),
((cap.get(CAP_PROP_POS_MSEC) + cap.get(CAP_PROP_AUDIO_SHIFT_NSEC) / 1e6) / 1000) * samplePerSecond,
audioSamplesTolerance)
<< "CAP_PROP_POS_MSEC = " << cap.get(CAP_PROP_POS_MSEC);
}
else
{
EXPECT_NEAR(cap.get(CAP_PROP_AUDIO_POS), (cap.get(CAP_PROP_POS_MSEC) / 1000) * samplePerSecond, audioSamplesTolerance) << "CAP_PROP_POS_MSEC = " << cap.get(CAP_PROP_POS_MSEC);
EXPECT_LT(abs(audioFrame.cols - samplesPerFrame), audioSamplesTolerance);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, CAP_PROP_AUDIO_SHIFT_NSEC must be applied for all frames.

frame == 0 check is EXPECT_EQ(0, cap.get(CAP_PROP_AUDIO_POS));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, CAP_PROP_AUDIO_SHIFT_NSEC must be applied for all frames.

Calculations will be erroneous. From CAP_PROP_POS_MSEC, you need to subtract the audio offset relative to the zero of the media file, and not the time difference between the start of the audio stream and the video stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check is updated. Please fetch the last commit.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!

nFrame++;
usedVideoSample->GetSampleDuration(&videoSampleDuration);
requiredAudioTime = usedVideoSampleTime + videoSampleDuration - allTime;
allTime += requiredAudioTime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allTime

unclear name which is used for audio only.

Comment on lines 1457 to 1459
if (residualTime*1e7 > requiredAudioTime)
return true;
while ((!vEOS) ? bufferedAudioDuration <= requiredAudioTime : !aEOS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

residualTime

This is some kind of duplicate for bufferedAudioDuration (but with unclear name)

Comment on lines 1548 to 1550
if (returnFlag)
if (!audioSamples.empty() || !bufferAudioData.empty() && aEOS)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to indent such huge code blocks:

if (!returnFlag)
{
    CV_LOG_DEBUG(...);
    return false;
}

... code block ...

Comment on lines 1508 to 1509
if (!audioSamples.back())
audioSamples.pop_back();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do not write into array directly through &audioSamples[numberOfSamples].

It is better to write into local variable and register it on success after all passed checks.

sampleTime += frameStep;
audioSamples[numberOfSamples]->GetSampleDuration(&audioSampleDuration);
CV_LOG_DEBUG(NULL, "videoio(MSMF): got audio frame with timestamp=" << audioSampleTime << " duration=" << audioSampleDuration);
bufferedAudioDuration += (LONGLONG)(audioSampleDuration + residualTime*1e7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

residualTime*1e7

Added multiple times in the loop (unexpected shift):

while ((!vEOS) ? bufferedAudioDuration <= requiredAudioTime : !aEOS)

Comment on lines 1538 to 1539
if (videoStream == -1)
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"while()" condition should handle this?

void checkAudio()
{
for (unsigned int nCh = 0; nCh < audioData.size(); nCh++)
for (unsigned int i = 0; i < validAudioData[nCh].size(); i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audioData[nCh].size() check is missing


No need to have 2 identical checkAudio() implementations.

Comment on lines 261 to 264
#if 0
// https://filesamples.com/samples/video/mp4/sample_960x400_ocean_with_audio.mp4
, paramCombination("sample_960x400_ocean_with_audio.mp4", 2, 0.15, CV_8UC3, 400, 960, 1116, 0, 30, 30., cv::CAP_MSMF)
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case has audio-video desync starting from frame 666 which increases.

Copy link
Contributor Author

@MaximMilashchenko MaximMilashchenko Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case has audio-video desync starting from frame 666 which increases.

Starting from frame 666, the audio position begins to outpace the video by more than (1.0 / fps) * 0.3, because for each output of audio data, it is necessary that the data size in bytes be a multiple of the dimension of the matrix.

chunkLengthOfBytes +=
((int)(captureAudioFormat.bit_per_sample)/8* (int)captureAudioFormat.nChannels) - chunkLengthOfBytes %
((int)(captureAudioFormat.bit_per_sample)/8* (int)captureAudioFormat.nChannels);

For this reason, with each iteration, the gap between the video and audio positions will increase and on sufficiently long media files, desynchronization of positions will be observed, but the amount of output audio data with a frame will be unchanged. Confirmed by verification:
EXPECT_NEAR(audio Frame.cols, samplesPerFrame, audioSamplesTolerance);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary to return chunks of audio of the same size.

but the amount of output audio data with a frame will be unchanged

This doesn't make any sense for videos with variable frame rate.

Copy link
Contributor Author

@MaximMilashchenko MaximMilashchenko Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary to return chunks of audio of the same size.

but the amount of output audio data with a frame will be unchanged

This doesn't make any sense for videos with variable frame rate.

Chunks of audio of the same size are returned only if the frame rate is constant. Chunks of audio are calculated based on the duration of the frame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunks of audio are calculated based on the duration of the frame

Chunks of audio should be calculated based on the timestamp of the next video frame.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done 👍

@alalek alalek merged commit f36c268 into opencv:master Oct 20, 2021
@alalek alalek mentioned this pull request Dec 30, 2021
@MaximMilashchenko MaximMilashchenko deleted the Audio branch January 11, 2022 14:27
@alalek alalek mentioned this pull request Feb 22, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
add audio support in cap_msmf

* audio msmf

* fixed warnings

* minor fix

* fixed SampleTime MSMF

* minor fix, fixed audio test, retrieveAudioFrame

* fixed warnings

* impelemented sync audio and video stream with start offset

* fixed error

* fixed docs

* fixed audio sample

* CAP_PROP_AUDIO_POS, minor fixed

* fixed warnings

* videoio(MSMF): update audio test checks, add debug logging

* fixed

* fixed desynchronization of time positions, warnings

* fixed warnings

* videoio(audio): tune tests checks

* videoio(audio): update properties description

* build warnings

Co-authored-by: Alexander Alekhin <alexander.a.alekhin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support audio in OpenCV
6 participants