-
Notifications
You must be signed in to change notification settings - Fork 4.5k
whisper : fix VAD processing for skipped audio segments #3230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think the end timestamp is correct. It is just the start timestamp that begins too early. But this is also what the reference OpenAI implementation produces, so I don't think it is a bug. The proposed changes in this PR don't seem to work with make -j && ./build/bin/whisper-cli -f ./samples/gb1.wav -m models/ggml-medium.en.bin -ps
|
This is what I'm seeing when I run this sample audio using OpenAI's whisper: whisper samples/aladdin-first30.mp3 --model medium.en --word_timestamps True
[00:20.040 --> 00:25.980] There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.980 --> 00:29.920] would do nothing but play all day long in the streets with little idle boys. And the output in {
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 20.039999999999996,
"end": 25.98,
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
"tokens": [
50363,
1318,
1752,
5615,
257,
3595,
35280,
11,
508,
550,
257,
3367,
1444,
978,
46782,
11,
257,
36138,
21696,
2933,
11,
508,
51661
],
"temperature": 0.0,
"avg_logprob": -0.22081296920776367,
"compression_ratio": 1.075,
"no_speech_prob": 0.03549518063664436,
"words": [
{
"word": " There",
"start": 20.039999999999996,
"end": 20.619999999999997,
"probability": 0.18882536888122559
},
{
"word": " once",
"start": 20.619999999999997,
"end": 21.2,
"probability": 0.9850126504898071
},
{
"word": " lived",
"start": 21.2,
"end": 21.5,
"probability": 0.9961034059524536
},
...
{
"word": " nothing",
"start": 26.28,
"end": 26.54,
"probability": 0.9977497458457947
},
{
"word": " but",
"start": 26.54,
"end": 26.82,
"probability": 0.9585630893707275
},
{
"word": " play",
"start": 26.82,
"end": 27.12,
"probability": 0.9835618138313293
},
{
"word": " all",
"start": 27.12,
"end": 27.32,
"probability": 0.9963666200637817
},
{
"word": " day",
"start": 27.32,
"end": 27.54,
"probability": 0.9964172840118408
},
If we only correct the the start timestamp (though that seems to be an incorrect fix I've made) we would get the following timestamps in the second segment: "text": " nothing",
"timestamps": {
"from": "00:00:29,330",
"to": "00:00:30,020"
},
"offsets": {
"from": 29330,
"to": 30020
},
"id": 2147,
"p": 0.99731,
"t_dtw": -1
},
{
"text": " but",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 475,
"p": 0.923541,
"t_dtw": -1
},
{
"text": " play",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 711,
"p": 0.978036,
"t_dtw": -1
},
And actually this is also the case when running without first fix as well (so using the master branch). |
The If you remove |
Ah I see, I thought this was the same thing as enabling full json output and get the timestamps and tried to follow that.
This is the output when I remove the [00:00.000 --> 00:25.960] There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.960 --> 00:30.040] would do nothing but play all day long in the streets with little idle boys. aladdin-first30.json{
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 25.96,
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
"tokens": [
50363,
1318,
1752,
5615,
257,
3595,
35280,
11,
508,
550,
257,
3367,
1444,
978,
46782,
11,
257,
36138,
21696,
2933,
11,
508,
51661
],
"temperature": 0.0,
"avg_logprob": -0.22081296920776367,
"compression_ratio": 1.075,
"no_speech_prob": 0.03549518063664436
},
{
"id": 1,
"seek": 2596,
"start": 25.96,
"end": 30.04,
"text": " would do nothing but play all day long in the streets with little idle boys.",
"tokens": [
50363,
561,
466,
2147,
475,
711,
477,
1110,
890,
287,
262,
6483,
351,
1310,
21696,
6510,
13,
50567
],
"temperature": 0.0,
"avg_logprob": -0.264762376484118,
"compression_ratio": 1.0857142857142856,
"no_speech_prob": 0.9776378870010376
}
],
"language": "en"
}
And this is the ouput from whisper.cpp (using master branch): [00:00:00.000 --> 00:00:26.000] There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000] would do nothing but play all day long in the streets, with little idle boys
output_json: saving output to 'aladdin.json'
aladdin.json{
"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
"model": {
"type": "medium",
"multilingual": false,
"vocab": 51864,
"audio": {
"ctx": 1500,
"state": 1024,
"head": 16,
"layer": 24
},
"text": {
"ctx": 448,
"state": 1024,
"head": 16,
"layer": 24
},
"mels": 80,
"ftype": 1
},
"params": {
"model": "./models/ggml-medium.en.bin",
"language": "en",
"translate": false
},
"result": {
"language": "en"
},
"transcription": [
{
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:26,000"
},
"offsets": {
"from": 0,
"to": 26000
},
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,000"
},
"offsets": {
"from": 0,
"to": 0
},
"id": 50363,
"p": 0.755106,
"t_dtw": -1
},
{
"text": " There",
"timestamps": {
"from": "00:00:01,140",
"to": "00:00:21,120"
},
"offsets": {
"from": 1140,
"to": 21120
},
"id": 1318,
"p": 0.338112,
"t_dtw": -1
},
{
"text": " once",
"timestamps": {
"from": "00:00:21,120",
"to": "00:00:21,290"
},
"offsets": {
"from": 21120,
"to": 21290
},
"id": 1752,
"p": 0.937644,
"t_dtw": -1
},
{
"text": " lived",
"timestamps": {
"from": "00:00:21,290",
"to": "00:00:21,540"
},
"offsets": {
"from": 21290,
"to": 21540
},
"id": 5615,
"p": 0.980597,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:21,540",
"to": "00:00:21,600"
},
"offsets": {
"from": 21540,
"to": 21600
},
"id": 257,
"p": 0.989806,
"t_dtw": -1
},
{
"text": " poor",
"timestamps": {
"from": "00:00:21,600",
"to": "00:00:21,860"
},
"offsets": {
"from": 21600,
"to": 21860
},
"id": 3595,
"p": 0.973454,
"t_dtw": -1
},
{
"text": " tailor",
"timestamps": {
"from": "00:00:21,860",
"to": "00:00:22,250"
},
"offsets": {
"from": 21860,
"to": 22250
},
"id": 35280,
"p": 0.670927,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:22,250",
"to": "00:00:22,320"
},
"offsets": {
"from": 22250,
"to": 22320
},
"id": 11,
"p": 0.727574,
"t_dtw": -1
},
{
"text": " who",
"timestamps": {
"from": "00:00:22,570",
"to": "00:00:22,570"
},
"offsets": {
"from": 22570,
"to": 22570
},
"id": 508,
"p": 0.942266,
"t_dtw": -1
},
{
"text": " had",
"timestamps": {
"from": "00:00:22,570",
"to": "00:00:22,760"
},
"offsets": {
"from": 22570,
"to": 22760
},
"id": 550,
"p": 0.997018,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:22,760",
"to": "00:00:22,840"
},
"offsets": {
"from": 22760,
"to": 22840
},
"id": 257,
"p": 0.996136,
"t_dtw": -1
},
{
"text": " son",
"timestamps": {
"from": "00:00:22,840",
"to": "00:00:23,140"
},
"offsets": {
"from": 22840,
"to": 23140
},
"id": 3367,
"p": 0.977694,
"t_dtw": -1
},
{
"text": " called",
"timestamps": {
"from": "00:00:23,140",
"to": "00:00:23,590"
},
"offsets": {
"from": 23140,
"to": 23590
},
"id": 1444,
"p": 0.66705,
"t_dtw": -1
},
{
"text": " Al",
"timestamps": {
"from": "00:00:23,590",
"to": "00:00:23,740"
},
"offsets": {
"from": 23590,
"to": 23740
},
"id": 978,
"p": 0.95969,
"t_dtw": -1
},
{
"text": "addin",
"timestamps": {
"from": "00:00:23,740",
"to": "00:00:23,960"
},
"offsets": {
"from": 23740,
"to": 23960
},
"id": 46782,
"p": 0.964032,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:24,280",
"to": "00:00:24,280"
},
"offsets": {
"from": 24280,
"to": 24280
},
"id": 11,
"p": 0.853909,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:24,320",
"to": "00:00:24,350"
},
"offsets": {
"from": 24320,
"to": 24350
},
"id": 257,
"p": 0.958133,
"t_dtw": -1
},
{
"text": " careless",
"timestamps": {
"from": "00:00:24,350",
"to": "00:00:24,900"
},
"offsets": {
"from": 24350,
"to": 24900
},
"id": 36138,
"p": 0.943495,
"t_dtw": -1
},
{
"text": " idle",
"timestamps": {
"from": "00:00:24,970",
"to": "00:00:25,270"
},
"offsets": {
"from": 24970,
"to": 25270
},
"id": 21696,
"p": 0.909963,
"t_dtw": -1
},
{
"text": "-",
"timestamps": {
"from": "00:00:25,270",
"to": "00:00:25,340"
},
"offsets": {
"from": 25270,
"to": 25340
},
"id": 12,
"p": 0.31478,
"t_dtw": -1
},
{
"text": "boy",
"timestamps": {
"from": "00:00:25,350",
"to": "00:00:25,570"
},
"offsets": {
"from": 25350,
"to": 25570
},
"id": 7081,
"p": 0.998407,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:25,570",
"to": "00:00:25,610"
},
"offsets": {
"from": 25570,
"to": 25610
},
"id": 11,
"p": 0.841821,
"t_dtw": -1
},
{
"text": " who",
"timestamps": {
"from": "00:00:25,900",
"to": "00:00:26,000"
},
"offsets": {
"from": 25900,
"to": 26000
},
"id": 508,
"p": 0.985908,
"t_dtw": -1
},
{
"text": "[_TT_1300]",
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:26,000"
},
"offsets": {
"from": 26000,
"to": 26000
},
"id": 51663,
"p": 0.0790712,
"t_dtw": -1
}
]
},
{
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:56,000"
},
"offsets": {
"from": 26000,
"to": 56000
},
"text": " would do nothing but play all day long in the streets, with little idle boys",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:26,000"
},
"offsets": {
"from": 26000,
"to": 26000
},
"id": 50363,
"p": 0.99269,
"t_dtw": -1
},
{
"text": " would",
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:28,350"
},
"offsets": {
"from": 26000,
"to": 28350
},
"id": 561,
"p": 0.731398,
"t_dtw": -1
},
{
"text": " do",
"timestamps": {
"from": "00:00:28,460",
"to": "00:00:29,330"
},
"offsets": {
"from": 28460,
"to": 29330
},
"id": 466,
"p": 0.976556,
"t_dtw": -1
},
{
"text": " nothing",
"timestamps": {
"from": "00:00:29,330",
"to": "00:00:30,020"
},
"offsets": {
"from": 29330,
"to": 30020
},
"id": 2147,
"p": 0.99731,
"t_dtw": -1
},
{
"text": " but",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 475,
"p": 0.923541,
"t_dtw": -1
},
{
"text": " play",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 711,
"p": 0.978036,
"t_dtw": -1
},
{
"text": " all",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 477,
"p": 0.993187,
"t_dtw": -1
},
{
"text": " day",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 1110,
"p": 0.991072,
"t_dtw": -1
},
{
"text": " long",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 890,
"p": 0.987279,
"t_dtw": -1
},
{
"text": " in",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 287,
"p": 0.981879,
"t_dtw": -1
},
{
"text": " the",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 262,
"p": 0.989247,
"t_dtw": -1
},
{
"text": " streets",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 6483,
"p": 0.994812,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 11,
"p": 0.619108,
"t_dtw": -1
},
{
"text": " with",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 351,
"p": 0.988525,
"t_dtw": -1
},
{
"text": " little",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 1310,
"p": 0.964262,
"t_dtw": -1
},
{
"text": " idle",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 21696,
"p": 0.964748,
"t_dtw": -1
},
{
"text": " boys",
"timestamps": {
"from": "00:00:30,040",
"to": "00:00:30,040"
},
"offsets": {
"from": 30040,
"to": 30040
},
"id": 6510,
"p": 0.961095,
"t_dtw": -1
},
{
"text": "<|endoftext|>",
"timestamps": {
"from": "00:00:56,000",
"to": "00:00:56,000"
},
"offsets": {
"from": 56000,
"to": 56000
},
"id": 50256,
"p": 0.167677,
"t_dtw": -1
}
]
}
]
}
And this is the output from whipser.cpp using this PR: [00:00:21.120 --> 00:00:26.000] There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000] would do nothing but play all day long in the streets, with little idle boys aladdin.json{
"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
"model": {
"type": "medium",
"multilingual": false,
"vocab": 51864,
"audio": {
"ctx": 1500,
"state": 1024,
"head": 16,
"layer": 24
},
"text": {
"ctx": 448,
"state": 1024,
"head": 16,
"layer": 24
},
"mels": 80,
"ftype": 1
},
"params": {
"model": "./models/ggml-medium.en.bin",
"language": "en",
"translate": false
},
"result": {
"language": "en"
},
"transcription": [
{
"timestamps": {
"from": "00:00:21,120",
"to": "00:00:26,000"
},
"offsets": {
"from": 21120,
"to": 26000
},
"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:21,120",
"to": "00:00:21,120"
},
"offsets": {
"from": 21120,
"to": 21120
},
"id": 50363,
"p": 0.755106,
"t_dtw": -1
},
{
"text": " There",
"timestamps": {
"from": "00:00:21,280",
"to": "00:00:21,440"
},
"offsets": {
"from": 21280,
"to": 21440
},
"id": 1318,
"p": 0.338112,
"t_dtw": -1
},
{
"text": " once",
"timestamps": {
"from": "00:00:21,440",
"to": "00:00:21,700"
},
"offsets": {
"from": 21440,
"to": 21700
},
"id": 1752,
"p": 0.937644,
"t_dtw": -1
},
{
"text": " lived",
"timestamps": {
"from": "00:00:21,700",
"to": "00:00:22,020"
},
"offsets": {
"from": 21700,
"to": 22020
},
"id": 5615,
"p": 0.980597,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:22,020",
"to": "00:00:22,080"
},
"offsets": {
"from": 22020,
"to": 22080
},
"id": 257,
"p": 0.989806,
"t_dtw": -1
},
{
"text": " poor",
"timestamps": {
"from": "00:00:22,260",
"to": "00:00:22,340"
},
"offsets": {
"from": 22260,
"to": 22340
},
"id": 3595,
"p": 0.973454,
"t_dtw": -1
},
{
"text": " tailor",
"timestamps": {
"from": "00:00:22,340",
"to": "00:00:22,730"
},
"offsets": {
"from": 22340,
"to": 22730
},
"id": 35280,
"p": 0.670927,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:22,730",
"to": "00:00:22,860"
},
"offsets": {
"from": 22730,
"to": 22860
},
"id": 11,
"p": 0.727574,
"t_dtw": -1
},
{
"text": " who",
"timestamps": {
"from": "00:00:22,860",
"to": "00:00:23,050"
},
"offsets": {
"from": 22860,
"to": 23050
},
"id": 508,
"p": 0.942266,
"t_dtw": -1
},
{
"text": " had",
"timestamps": {
"from": "00:00:23,050",
"to": "00:00:23,240"
},
"offsets": {
"from": 23050,
"to": 23240
},
"id": 550,
"p": 0.997018,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:23,240",
"to": "00:00:23,250"
},
"offsets": {
"from": 23240,
"to": 23250
},
"id": 257,
"p": 0.996136,
"t_dtw": -1
},
{
"text": " son",
"timestamps": {
"from": "00:00:23,310",
"to": "00:00:23,400"
},
"offsets": {
"from": 23310,
"to": 23400
},
"id": 3367,
"p": 0.977694,
"t_dtw": -1
},
{
"text": " called",
"timestamps": {
"from": "00:00:23,500",
"to": "00:00:23,790"
},
"offsets": {
"from": 23500,
"to": 23790
},
"id": 1444,
"p": 0.66705,
"t_dtw": -1
},
{
"text": " Al",
"timestamps": {
"from": "00:00:23,910",
"to": "00:00:24,010"
},
"offsets": {
"from": 23910,
"to": 24010
},
"id": 978,
"p": 0.95969,
"t_dtw": -1
},
{
"text": "addin",
"timestamps": {
"from": "00:00:24,010",
"to": "00:00:24,330"
},
"offsets": {
"from": 24010,
"to": 24330
},
"id": 46782,
"p": 0.964032,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:24,330",
"to": "00:00:24,460"
},
"offsets": {
"from": 24330,
"to": 24460
},
"id": 11,
"p": 0.853909,
"t_dtw": -1
},
{
"text": " a",
"timestamps": {
"from": "00:00:24,460",
"to": "00:00:24,520"
},
"offsets": {
"from": 24460,
"to": 24520
},
"id": 257,
"p": 0.958133,
"t_dtw": -1
},
{
"text": " careless",
"timestamps": {
"from": "00:00:24,520",
"to": "00:00:24,690"
},
"offsets": {
"from": 24520,
"to": 24690
},
"id": 36138,
"p": 0.943495,
"t_dtw": -1
},
{
"text": " idle",
"timestamps": {
"from": "00:00:25,300",
"to": "00:00:25,300"
},
"offsets": {
"from": 25300,
"to": 25300
},
"id": 21696,
"p": 0.909963,
"t_dtw": -1
},
{
"text": "-",
"timestamps": {
"from": "00:00:25,330",
"to": "00:00:25,350"
},
"offsets": {
"from": 25330,
"to": 25350
},
"id": 12,
"p": 0.31478,
"t_dtw": -1
},
{
"text": "boy",
"timestamps": {
"from": "00:00:25,360",
"to": "00:00:25,550"
},
"offsets": {
"from": 25360,
"to": 25550
},
"id": 7081,
"p": 0.998407,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:25,550",
"to": "00:00:25,660"
},
"offsets": {
"from": 25550,
"to": 25660
},
"id": 11,
"p": 0.841821,
"t_dtw": -1
},
{
"text": " who",
"timestamps": {
"from": "00:00:25,740",
"to": "00:00:26,000"
},
"offsets": {
"from": 25740,
"to": 26000
},
"id": 508,
"p": 0.985908,
"t_dtw": -1
},
{
"text": "[_TT_1300]",
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:26,000"
},
"offsets": {
"from": 26000,
"to": 26000
},
"id": 51663,
"p": 0.0790712,
"t_dtw": -1
}
]
},
{
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:56,000"
},
"offsets": {
"from": 26000,
"to": 56000
},
"text": " would do nothing but play all day long in the streets, with little idle boys",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:26,000",
"to": "00:00:26,000"
},
"offsets": {
"from": 26000,
"to": 26000
},
"id": 50363,
"p": 0.99269,
"t_dtw": -1
},
{
"text": " would",
"timestamps": {
"from": "00:00:27,140",
"to": "00:00:28,280"
},
"offsets": {
"from": 27140,
"to": 28280
},
"id": 561,
"p": 0.731398,
"t_dtw": -1
},
{
"text": " do",
"timestamps": {
"from": "00:00:28,380",
"to": "00:00:29,330"
},
"offsets": {
"from": 28380,
"to": 29330
},
"id": 466,
"p": 0.976556,
"t_dtw": -1
},
{
"text": " nothing",
"timestamps": {
"from": "00:00:29,330",
"to": "00:00:32,660"
},
"offsets": {
"from": 29330,
"to": 32660
},
"id": 2147,
"p": 0.99731,
"t_dtw": -1
},
{
"text": " but",
"timestamps": {
"from": "00:00:32,660",
"to": "00:00:34,060"
},
"offsets": {
"from": 32660,
"to": 34060
},
"id": 475,
"p": 0.923541,
"t_dtw": -1
},
{
"text": " play",
"timestamps": {
"from": "00:00:34,120",
"to": "00:00:35,980"
},
"offsets": {
"from": 34120,
"to": 35980
},
"id": 711,
"p": 0.978036,
"t_dtw": -1
},
{
"text": " all",
"timestamps": {
"from": "00:00:35,980",
"to": "00:00:37,400"
},
"offsets": {
"from": 35980,
"to": 37400
},
"id": 477,
"p": 0.993187,
"t_dtw": -1
},
{
"text": " day",
"timestamps": {
"from": "00:00:37,400",
"to": "00:00:38,750"
},
"offsets": {
"from": 37400,
"to": 38750
},
"id": 1110,
"p": 0.991072,
"t_dtw": -1
},
{
"text": " long",
"timestamps": {
"from": "00:00:38,830",
"to": "00:00:40,720"
},
"offsets": {
"from": 38830,
"to": 40720
},
"id": 890,
"p": 0.987279,
"t_dtw": -1
},
{
"text": " in",
"timestamps": {
"from": "00:00:40,720",
"to": "00:00:41,560"
},
"offsets": {
"from": 40720,
"to": 41560
},
"id": 287,
"p": 0.981879,
"t_dtw": -1
},
{
"text": " the",
"timestamps": {
"from": "00:00:42,340",
"to": "00:00:43,090"
},
"offsets": {
"from": 42340,
"to": 43090
},
"id": 262,
"p": 0.989247,
"t_dtw": -1
},
{
"text": " streets",
"timestamps": {
"from": "00:00:43,090",
"to": "00:00:44,840"
},
"offsets": {
"from": 43090,
"to": 44840
},
"id": 6483,
"p": 0.994812,
"t_dtw": -1
},
{
"text": ",",
"timestamps": {
"from": "00:00:46,830",
"to": "00:00:47,370"
},
"offsets": {
"from": 46830,
"to": 47370
},
"id": 11,
"p": 0.619108,
"t_dtw": -1
},
{
"text": " with",
"timestamps": {
"from": "00:00:47,370",
"to": "00:00:49,220"
},
"offsets": {
"from": 47370,
"to": 49220
},
"id": 351,
"p": 0.988525,
"t_dtw": -1
},
{
"text": " little",
"timestamps": {
"from": "00:00:49,270",
"to": "00:00:52,120"
},
"offsets": {
"from": 49270,
"to": 52120
},
"id": 1310,
"p": 0.964262,
"t_dtw": -1
},
{
"text": " idle",
"timestamps": {
"from": "00:00:52,120",
"to": "00:00:54,020"
},
"offsets": {
"from": 52120,
"to": 54020
},
"id": 21696,
"p": 0.964748,
"t_dtw": -1
},
{
"text": " boys",
"timestamps": {
"from": "00:00:54,020",
"to": "00:00:56,000"
},
"offsets": {
"from": 54020,
"to": 56000
},
"id": 6510,
"p": 0.961095,
"t_dtw": -1
},
{
"text": "<|endoftext|>",
"timestamps": {
"from": "00:00:56,000",
"to": "00:00:56,000"
},
"offsets": {
"from": 56000,
"to": 56000
},
"id": 50256,
"p": 0.167677,
"t_dtw": -1
}
]
}
]
}
So I was wrong about the start timestamp and OpenAI's whisper does the same thing in this respect. But if someone uses |
Yes, the output should not contain token-level timestamps when they are not enabled. We have some warnings in the Lines 140 to 149 in b175baa
But it would be better to prevent incorrect usage in the examples too. Ideally, the DTW algorithm should be fixed so we can provide token-level timestamps, but I'm not sure how difficult this would be. Anyway, not a very big priority atm. |
I just noticed that the whisper.cpp/examples/cli/cli.cpp Lines 1166 to 1168 in a73e240
So I have forgotten that we have 2 different token-level timestamp algorithms:
The second one is probably too difficult to fix, so no need to look into it. But the first one should give OK results. So in that sense, your observation about the timestamp repetitions is correct and this does look like a bug. |
Maybe not the right place to mention, but was looking for issues with reference to dtw. It would definitely be great to have -dtw fixed, and with the new VAD I would suspect that timestamps may get even more precise? |
f3ee700
to
c34ee30
Compare
This commit addresses an issue with token timestamps when audio segments are skipped, in `whisper_exp_compute_token_level_timestamps` related to the VAD processing and the energy levels. The motivation for this is that the token timestamps exceed the energy array bounds due to segment timing misalignment: ```console (skipped introduction) ↓ Audio segment: [2600ms → 5600ms] (3 seconds of actual audio) Energy array: [0 → 480652] (samples for 3 seconds) Token timestamps: [3266ms → 3408ms] (absolute timestamps) ``` So both `s0` and `t1` get clamped to the maximum sample index (480652) which causes the start/end timestamps to be the same for all the tokens after a certain point. This is addressed by using segment-relative timestamps in the `timestamp_to_sample` and `sample_to_timestamp`. Resolves: ggml-org#3207
c34ee30
to
deb1c0d
Compare
@ggerganov Is this something I could/should take a look at? |
@danbev Yes, but I don't have a sense about the amount of effort that is needed to fix it as I didn't familiarize deeply with the initial implementation. It might be something simple, but it also might be quite difficult. AFAIR it required to extract intermediate data from the encoder attention and the original implementation was a bit hacky in this regard. Feel free to take a look, but if it turns out to be something too complicated, then no need to spend efforts on it. |
This commit addresses an issue with token timestamps when audio segments
are skipped, in
whisper_exp_compute_token_level_timestamps
related tothe VAD processing and the energy levels.
The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:
So both
s0
andt1
get clamped to the maximum sample index (480652)which causes the start/end timestamps to be the same for all the tokens
after a certain point.
This is addressed by using segment-relative timestamps in the
timestamp_to_sample
andsample_to_timestamp
.