whisper : fix VAD processing for skipped audio segments #3230

danbev · 2025-06-05T13:06:43Z

This commit addresses an issue with token timestamps when audio segments
are skipped, in whisper_exp_compute_token_level_timestamps related to
the VAD processing and the energy levels.

The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:

                      (skipped introduction)
                        ↓
Audio segment:     [2600ms → 5600ms]  (3 seconds of actual audio)
Energy array:      [0 → 480652]       (samples for 3 seconds)
Token timestamps:  [3266ms → 3408ms]  (absolute timestamps)

So both s0 and t1 get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.

This is addressed by using segment-relative timestamps in the
timestamp_to_sample and sample_to_timestamp.

ggerganov · 2025-06-06T07:31:26Z

This caused the segment start timestamp to start at 0 but the end timestamp was relative to the skipped audio (so must larger and incorrect).

I think the end timestamp is correct. It is just the start timestamp that begins too early. But this is also what the reference OpenAI implementation produces, so I don't think it is a bug.

The proposed changes in this PR don't seem to work with gb1.wav:

make -j && ./build/bin/whisper-cli -f ./samples/gb1.wav -m models/ggml-medium.en.bin -ps

main: processing './samples/gb1.wav' (3179750 samples, 198.7 sec), 1 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_exp_compute_token_level_timestamps: audio samples skipped, setting segment.t0 to 2962

[00:00:29.620 --> 00:00:07.360]  [_BEG_] My fellow Americans, this day has brought terrible news and great sadness to our[_TT_368]
[00:00:07.360 --> 00:00:14.800]   country. At nine o'clock this morning, Mission Control in Houston lost contact[_TT_740]
[00:00:14.800 --> 00:00:21.400]   with our space shuttle Columbia. A short time later, debris was seen falling from[_TT_1070]
...

danbev · 2025-06-06T07:59:05Z

I think the end timestamp is correct. It is just the start timestamp that begins too early. But this is also what the reference OpenAI implementation produces, so I don't think it is a bug.

This is what I'm seeing when I run this sample audio using OpenAI's whisper:

whisper samples/aladdin-first30.mp3 --model medium.en --word_timestamps True
[00:20.040 --> 00:25.980]  There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.980 --> 00:29.920]  would do nothing but play all day long in the streets with little idle boys.

And the output in aladdin-first30.json looks like this:

{
  "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 20.039999999999996,
      "end": 25.98,
      "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
      "tokens": [
        50363,
        1318,
        1752,
        5615,
        257,
        3595,
        35280,
        11,
        508,
        550,
        257,
        3367,
        1444,
        978,
        46782,
        11,
        257,
        36138,
        21696,
        2933,
        11,
        508,
        51661
      ],
      "temperature": 0.0,
      "avg_logprob": -0.22081296920776367,
      "compression_ratio": 1.075,
      "no_speech_prob": 0.03549518063664436,
      "words": [
        {
          "word": " There",
          "start": 20.039999999999996,
          "end": 20.619999999999997,
          "probability": 0.18882536888122559
        },
        {
          "word": " once",
          "start": 20.619999999999997,
          "end": 21.2,
          "probability": 0.9850126504898071
        },
        {
          "word": " lived",
          "start": 21.2,
          "end": 21.5,
          "probability": 0.9961034059524536
        },
       ...
      {
          "word": " nothing",
          "start": 26.28,
          "end": 26.54,
          "probability": 0.9977497458457947
        },
        {
          "word": " but",
          "start": 26.54,
          "end": 26.82,
          "probability": 0.9585630893707275
        },
        {
          "word": " play",
          "start": 26.82,
          "end": 27.12,
          "probability": 0.9835618138313293
        },
        {
          "word": " all",
          "start": 27.12,
          "end": 27.32,
          "probability": 0.9963666200637817
        },
        {
          "word": " day",
          "start": 27.32,
          "end": 27.54,
          "probability": 0.9964172840118408
        },

If we only correct the the start timestamp (though that seems to be an incorrect fix I've made) we would get the following timestamps in the second segment:

         "text": " nothing",                                                   
          "timestamps": {                                                       
            "from": "00:00:29,330",                                             
            "to": "00:00:30,020"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 29330,                                                      
            "to": 30020                                                         
          },                                                                    
          "id": 2147,                                                           
          "p": 0.99731,                                                         
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " but",                                                       
          "timestamps": {                                                       
            "from": "00:00:30,040",                                             
            "to": "00:00:30,040"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 30040,                                                      
            "to": 30040                                                         
          },                                                                    
          "id": 475,                                                            
          "p": 0.923541,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " play",                                                      
          "timestamps": {                                                       
            "from": "00:00:30,040",                                             
            "to": "00:00:30,040"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 30040,                                                      
            "to": 30040                                                         
          },                                                                    
          "id": 711,                                                            
          "p": 0.978036,                                                        
          "t_dtw": -1                                                           
        },

And actually this is also the case when running without first fix as well (so using the master branch).
So I think there is migth be an issue with the initial start timestamp and these repeating timestamps unless I'm missing something.

ggerganov · 2025-06-06T12:25:18Z

The --word_timestamps True flag in OpenAI enables an additional post-processing algorithm. We have a similar implementation using the -dtw flag, but this has been broken for a long time.

If you remove --word_timestamps True does it produce the same timestamp as whisper.cpp?

danbev · 2025-06-06T13:13:15Z

The --word_timestamps True flag in OpenAI enables an additional post-processing algorithm. We have a similar implementation using the -dtw flag, but this has been broken for a long time.

Ah I see, I thought this was the same thing as enabling full json output and get the timestamps and tried to follow that.

If you remove --word_timestamps True does it produce the same timestamp as whisper.cpp?

This is the output when I remove the --word_timestamps:

[00:00.000 --> 00:25.960]  There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.960 --> 00:30.040]  would do nothing but play all day long in the streets with little idle boys.

aladdin-first30.json

{
  "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 25.96,
      "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
      "tokens": [
        50363,
        1318,
        1752,
        5615,
        257,
        3595,
        35280,
        11,
        508,
        550,
        257,
        3367,
        1444,
        978,
        46782,
        11,
        257,
        36138,
        21696,
        2933,
        11,
        508,
        51661
      ],
      "temperature": 0.0,
      "avg_logprob": -0.22081296920776367,
      "compression_ratio": 1.075,
      "no_speech_prob": 0.03549518063664436
    },
    {
      "id": 1,
      "seek": 2596,
      "start": 25.96,
      "end": 30.04,
      "text": " would do nothing but play all day long in the streets with little idle boys.",
      "tokens": [
        50363,
        561,
        466,
        2147,
        475,
        711,
        477,
        1110,
        890,
        287,
        262,
        6483,
        351,
        1310,
        21696,
        6510,
        13,
        50567
      ],
      "temperature": 0.0,
      "avg_logprob": -0.264762376484118,
      "compression_ratio": 1.0857142857142856,
      "no_speech_prob": 0.9776378870010376
    }
  ],
  "language": "en"
}

And this is the ouput from whisper.cpp (using master branch):

[00:00:00.000 --> 00:00:26.000]   There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000]   would do nothing but play all day long in the streets, with little idle boys
output_json: saving output to 'aladdin.json'

aladdin.json

{
	"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "./models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:26,000"
			},
			"offsets": {
				"from": 0,
				"to": 26000
			},
			"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:00,000",
						"to": "00:00:00,000"
					},
					"offsets": {
						"from": 0,
						"to": 0
					},
					"id": 50363,
					"p": 0.755106,
					"t_dtw": -1
				},
				{
					"text": " There",
					"timestamps": {
						"from": "00:00:01,140",
						"to": "00:00:21,120"
					},
					"offsets": {
						"from": 1140,
						"to": 21120
					},
					"id": 1318,
					"p": 0.338112,
					"t_dtw": -1
				},
				{
					"text": " once",
					"timestamps": {
						"from": "00:00:21,120",
						"to": "00:00:21,290"
					},
					"offsets": {
						"from": 21120,
						"to": 21290
					},
					"id": 1752,
					"p": 0.937644,
					"t_dtw": -1
				},
				{
					"text": " lived",
					"timestamps": {
						"from": "00:00:21,290",
						"to": "00:00:21,540"
					},
					"offsets": {
						"from": 21290,
						"to": 21540
					},
					"id": 5615,
					"p": 0.980597,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:21,540",
						"to": "00:00:21,600"
					},
					"offsets": {
						"from": 21540,
						"to": 21600
					},
					"id": 257,
					"p": 0.989806,
					"t_dtw": -1
				},
				{
					"text": " poor",
					"timestamps": {
						"from": "00:00:21,600",
						"to": "00:00:21,860"
					},
					"offsets": {
						"from": 21600,
						"to": 21860
					},
					"id": 3595,
					"p": 0.973454,
					"t_dtw": -1
				},
				{
					"text": " tailor",
					"timestamps": {
						"from": "00:00:21,860",
						"to": "00:00:22,250"
					},
					"offsets": {
						"from": 21860,
						"to": 22250
					},
					"id": 35280,
					"p": 0.670927,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:22,250",
						"to": "00:00:22,320"
					},
					"offsets": {
						"from": 22250,
						"to": 22320
					},
					"id": 11,
					"p": 0.727574,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:22,570",
						"to": "00:00:22,570"
					},
					"offsets": {
						"from": 22570,
						"to": 22570
					},
					"id": 508,
					"p": 0.942266,
					"t_dtw": -1
				},
				{
					"text": " had",
					"timestamps": {
						"from": "00:00:22,570",
						"to": "00:00:22,760"
					},
					"offsets": {
						"from": 22570,
						"to": 22760
					},
					"id": 550,
					"p": 0.997018,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:22,760",
						"to": "00:00:22,840"
					},
					"offsets": {
						"from": 22760,
						"to": 22840
					},
					"id": 257,
					"p": 0.996136,
					"t_dtw": -1
				},
				{
					"text": " son",
					"timestamps": {
						"from": "00:00:22,840",
						"to": "00:00:23,140"
					},
					"offsets": {
						"from": 22840,
						"to": 23140
					},
					"id": 3367,
					"p": 0.977694,
					"t_dtw": -1
				},
				{
					"text": " called",
					"timestamps": {
						"from": "00:00:23,140",
						"to": "00:00:23,590"
					},
					"offsets": {
						"from": 23140,
						"to": 23590
					},
					"id": 1444,
					"p": 0.66705,
					"t_dtw": -1
				},
				{
					"text": " Al",
					"timestamps": {
						"from": "00:00:23,590",
						"to": "00:00:23,740"
					},
					"offsets": {
						"from": 23590,
						"to": 23740
					},
					"id": 978,
					"p": 0.95969,
					"t_dtw": -1
				},
				{
					"text": "addin",
					"timestamps": {
						"from": "00:00:23,740",
						"to": "00:00:23,960"
					},
					"offsets": {
						"from": 23740,
						"to": 23960
					},
					"id": 46782,
					"p": 0.964032,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:24,280",
						"to": "00:00:24,280"
					},
					"offsets": {
						"from": 24280,
						"to": 24280
					},
					"id": 11,
					"p": 0.853909,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:24,320",
						"to": "00:00:24,350"
					},
					"offsets": {
						"from": 24320,
						"to": 24350
					},
					"id": 257,
					"p": 0.958133,
					"t_dtw": -1
				},
				{
					"text": " careless",
					"timestamps": {
						"from": "00:00:24,350",
						"to": "00:00:24,900"
					},
					"offsets": {
						"from": 24350,
						"to": 24900
					},
					"id": 36138,
					"p": 0.943495,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:24,970",
						"to": "00:00:25,270"
					},
					"offsets": {
						"from": 24970,
						"to": 25270
					},
					"id": 21696,
					"p": 0.909963,
					"t_dtw": -1
				},
				{
					"text": "-",
					"timestamps": {
						"from": "00:00:25,270",
						"to": "00:00:25,340"
					},
					"offsets": {
						"from": 25270,
						"to": 25340
					},
					"id": 12,
					"p": 0.31478,
					"t_dtw": -1
				},
				{
					"text": "boy",
					"timestamps": {
						"from": "00:00:25,350",
						"to": "00:00:25,570"
					},
					"offsets": {
						"from": 25350,
						"to": 25570
					},
					"id": 7081,
					"p": 0.998407,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:25,570",
						"to": "00:00:25,610"
					},
					"offsets": {
						"from": 25570,
						"to": 25610
					},
					"id": 11,
					"p": 0.841821,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:25,900",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 25900,
						"to": 26000
					},
					"id": 508,
					"p": 0.985908,
					"t_dtw": -1
				},
				{
					"text": "[_TT_1300]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 51663,
					"p": 0.0790712,
					"t_dtw": -1
				}
			]
		},
		{
			"timestamps": {
				"from": "00:00:26,000",
				"to": "00:00:56,000"
			},
			"offsets": {
				"from": 26000,
				"to": 56000
			},
			"text": " would do nothing but play all day long in the streets, with little idle boys",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 50363,
					"p": 0.99269,
					"t_dtw": -1
				},
				{
					"text": " would",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:28,350"
					},
					"offsets": {
						"from": 26000,
						"to": 28350
					},
					"id": 561,
					"p": 0.731398,
					"t_dtw": -1
				},
				{
					"text": " do",
					"timestamps": {
						"from": "00:00:28,460",
						"to": "00:00:29,330"
					},
					"offsets": {
						"from": 28460,
						"to": 29330
					},
					"id": 466,
					"p": 0.976556,
					"t_dtw": -1
				},
				{
					"text": " nothing",
					"timestamps": {
						"from": "00:00:29,330",
						"to": "00:00:30,020"
					},
					"offsets": {
						"from": 29330,
						"to": 30020
					},
					"id": 2147,
					"p": 0.99731,
					"t_dtw": -1
				},
				{
					"text": " but",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 475,
					"p": 0.923541,
					"t_dtw": -1
				},
				{
					"text": " play",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 711,
					"p": 0.978036,
					"t_dtw": -1
				},
				{
					"text": " all",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 477,
					"p": 0.993187,
					"t_dtw": -1
				},
				{
					"text": " day",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 1110,
					"p": 0.991072,
					"t_dtw": -1
				},
				{
					"text": " long",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 890,
					"p": 0.987279,
					"t_dtw": -1
				},
				{
					"text": " in",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 287,
					"p": 0.981879,
					"t_dtw": -1
				},
				{
					"text": " the",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 262,
					"p": 0.989247,
					"t_dtw": -1
				},
				{
					"text": " streets",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 6483,
					"p": 0.994812,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 11,
					"p": 0.619108,
					"t_dtw": -1
				},
				{
					"text": " with",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 351,
					"p": 0.988525,
					"t_dtw": -1
				},
				{
					"text": " little",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 1310,
					"p": 0.964262,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 21696,
					"p": 0.964748,
					"t_dtw": -1
				},
				{
					"text": " boys",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 6510,
					"p": 0.961095,
					"t_dtw": -1
				},
				{
					"text": "<|endoftext|>",
					"timestamps": {
						"from": "00:00:56,000",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 56000,
						"to": 56000
					},
					"id": 50256,
					"p": 0.167677,
					"t_dtw": -1
				}
			]
		}
	]
}

And this is the output from whipser.cpp using this PR:

[00:00:21.120 --> 00:00:26.000]   There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000]   would do nothing but play all day long in the streets, with little idle boys

aladdin.json

{
	"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "./models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:21,120",
				"to": "00:00:26,000"
			},
			"offsets": {
				"from": 21120,
				"to": 26000
			},
			"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:21,120",
						"to": "00:00:21,120"
					},
					"offsets": {
						"from": 21120,
						"to": 21120
					},
					"id": 50363,
					"p": 0.755106,
					"t_dtw": -1
				},
				{
					"text": " There",
					"timestamps": {
						"from": "00:00:21,280",
						"to": "00:00:21,440"
					},
					"offsets": {
						"from": 21280,
						"to": 21440
					},
					"id": 1318,
					"p": 0.338112,
					"t_dtw": -1
				},
				{
					"text": " once",
					"timestamps": {
						"from": "00:00:21,440",
						"to": "00:00:21,700"
					},
					"offsets": {
						"from": 21440,
						"to": 21700
					},
					"id": 1752,
					"p": 0.937644,
					"t_dtw": -1
				},
				{
					"text": " lived",
					"timestamps": {
						"from": "00:00:21,700",
						"to": "00:00:22,020"
					},
					"offsets": {
						"from": 21700,
						"to": 22020
					},
					"id": 5615,
					"p": 0.980597,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:22,020",
						"to": "00:00:22,080"
					},
					"offsets": {
						"from": 22020,
						"to": 22080
					},
					"id": 257,
					"p": 0.989806,
					"t_dtw": -1
				},
				{
					"text": " poor",
					"timestamps": {
						"from": "00:00:22,260",
						"to": "00:00:22,340"
					},
					"offsets": {
						"from": 22260,
						"to": 22340
					},
					"id": 3595,
					"p": 0.973454,
					"t_dtw": -1
				},
				{
					"text": " tailor",
					"timestamps": {
						"from": "00:00:22,340",
						"to": "00:00:22,730"
					},
					"offsets": {
						"from": 22340,
						"to": 22730
					},
					"id": 35280,
					"p": 0.670927,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:22,730",
						"to": "00:00:22,860"
					},
					"offsets": {
						"from": 22730,
						"to": 22860
					},
					"id": 11,
					"p": 0.727574,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:22,860",
						"to": "00:00:23,050"
					},
					"offsets": {
						"from": 22860,
						"to": 23050
					},
					"id": 508,
					"p": 0.942266,
					"t_dtw": -1
				},
				{
					"text": " had",
					"timestamps": {
						"from": "00:00:23,050",
						"to": "00:00:23,240"
					},
					"offsets": {
						"from": 23050,
						"to": 23240
					},
					"id": 550,
					"p": 0.997018,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:23,240",
						"to": "00:00:23,250"
					},
					"offsets": {
						"from": 23240,
						"to": 23250
					},
					"id": 257,
					"p": 0.996136,
					"t_dtw": -1
				},
				{
					"text": " son",
					"timestamps": {
						"from": "00:00:23,310",
						"to": "00:00:23,400"
					},
					"offsets": {
						"from": 23310,
						"to": 23400
					},
					"id": 3367,
					"p": 0.977694,
					"t_dtw": -1
				},
				{
					"text": " called",
					"timestamps": {
						"from": "00:00:23,500",
						"to": "00:00:23,790"
					},
					"offsets": {
						"from": 23500,
						"to": 23790
					},
					"id": 1444,
					"p": 0.66705,
					"t_dtw": -1
				},
				{
					"text": " Al",
					"timestamps": {
						"from": "00:00:23,910",
						"to": "00:00:24,010"
					},
					"offsets": {
						"from": 23910,
						"to": 24010
					},
					"id": 978,
					"p": 0.95969,
					"t_dtw": -1
				},
				{
					"text": "addin",
					"timestamps": {
						"from": "00:00:24,010",
						"to": "00:00:24,330"
					},
					"offsets": {
						"from": 24010,
						"to": 24330
					},
					"id": 46782,
					"p": 0.964032,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:24,330",
						"to": "00:00:24,460"
					},
					"offsets": {
						"from": 24330,
						"to": 24460
					},
					"id": 11,
					"p": 0.853909,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:24,460",
						"to": "00:00:24,520"
					},
					"offsets": {
						"from": 24460,
						"to": 24520
					},
					"id": 257,
					"p": 0.958133,
					"t_dtw": -1
				},
				{
					"text": " careless",
					"timestamps": {
						"from": "00:00:24,520",
						"to": "00:00:24,690"
					},
					"offsets": {
						"from": 24520,
						"to": 24690
					},
					"id": 36138,
					"p": 0.943495,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:25,300",
						"to": "00:00:25,300"
					},
					"offsets": {
						"from": 25300,
						"to": 25300
					},
					"id": 21696,
					"p": 0.909963,
					"t_dtw": -1
				},
				{
					"text": "-",
					"timestamps": {
						"from": "00:00:25,330",
						"to": "00:00:25,350"
					},
					"offsets": {
						"from": 25330,
						"to": 25350
					},
					"id": 12,
					"p": 0.31478,
					"t_dtw": -1
				},
				{
					"text": "boy",
					"timestamps": {
						"from": "00:00:25,360",
						"to": "00:00:25,550"
					},
					"offsets": {
						"from": 25360,
						"to": 25550
					},
					"id": 7081,
					"p": 0.998407,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:25,550",
						"to": "00:00:25,660"
					},
					"offsets": {
						"from": 25550,
						"to": 25660
					},
					"id": 11,
					"p": 0.841821,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:25,740",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 25740,
						"to": 26000
					},
					"id": 508,
					"p": 0.985908,
					"t_dtw": -1
				},
				{
					"text": "[_TT_1300]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 51663,
					"p": 0.0790712,
					"t_dtw": -1
				}
			]
		},
		{
			"timestamps": {
				"from": "00:00:26,000",
				"to": "00:00:56,000"
			},
			"offsets": {
				"from": 26000,
				"to": 56000
			},
			"text": " would do nothing but play all day long in the streets, with little idle boys",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 50363,
					"p": 0.99269,
					"t_dtw": -1
				},
				{
					"text": " would",
					"timestamps": {
						"from": "00:00:27,140",
						"to": "00:00:28,280"
					},
					"offsets": {
						"from": 27140,
						"to": 28280
					},
					"id": 561,
					"p": 0.731398,
					"t_dtw": -1
				},
				{
					"text": " do",
					"timestamps": {
						"from": "00:00:28,380",
						"to": "00:00:29,330"
					},
					"offsets": {
						"from": 28380,
						"to": 29330
					},
					"id": 466,
					"p": 0.976556,
					"t_dtw": -1
				},
				{
					"text": " nothing",
					"timestamps": {
						"from": "00:00:29,330",
						"to": "00:00:32,660"
					},
					"offsets": {
						"from": 29330,
						"to": 32660
					},
					"id": 2147,
					"p": 0.99731,
					"t_dtw": -1
				},
				{
					"text": " but",
					"timestamps": {
						"from": "00:00:32,660",
						"to": "00:00:34,060"
					},
					"offsets": {
						"from": 32660,
						"to": 34060
					},
					"id": 475,
					"p": 0.923541,
					"t_dtw": -1
				},
				{
					"text": " play",
					"timestamps": {
						"from": "00:00:34,120",
						"to": "00:00:35,980"
					},
					"offsets": {
						"from": 34120,
						"to": 35980
					},
					"id": 711,
					"p": 0.978036,
					"t_dtw": -1
				},
				{
					"text": " all",
					"timestamps": {
						"from": "00:00:35,980",
						"to": "00:00:37,400"
					},
					"offsets": {
						"from": 35980,
						"to": 37400
					},
					"id": 477,
					"p": 0.993187,
					"t_dtw": -1
				},
				{
					"text": " day",
					"timestamps": {
						"from": "00:00:37,400",
						"to": "00:00:38,750"
					},
					"offsets": {
						"from": 37400,
						"to": 38750
					},
					"id": 1110,
					"p": 0.991072,
					"t_dtw": -1
				},
				{
					"text": " long",
					"timestamps": {
						"from": "00:00:38,830",
						"to": "00:00:40,720"
					},
					"offsets": {
						"from": 38830,
						"to": 40720
					},
					"id": 890,
					"p": 0.987279,
					"t_dtw": -1
				},
				{
					"text": " in",
					"timestamps": {
						"from": "00:00:40,720",
						"to": "00:00:41,560"
					},
					"offsets": {
						"from": 40720,
						"to": 41560
					},
					"id": 287,
					"p": 0.981879,
					"t_dtw": -1
				},
				{
					"text": " the",
					"timestamps": {
						"from": "00:00:42,340",
						"to": "00:00:43,090"
					},
					"offsets": {
						"from": 42340,
						"to": 43090
					},
					"id": 262,
					"p": 0.989247,
					"t_dtw": -1
				},
				{
					"text": " streets",
					"timestamps": {
						"from": "00:00:43,090",
						"to": "00:00:44,840"
					},
					"offsets": {
						"from": 43090,
						"to": 44840
					},
					"id": 6483,
					"p": 0.994812,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:46,830",
						"to": "00:00:47,370"
					},
					"offsets": {
						"from": 46830,
						"to": 47370
					},
					"id": 11,
					"p": 0.619108,
					"t_dtw": -1
				},
				{
					"text": " with",
					"timestamps": {
						"from": "00:00:47,370",
						"to": "00:00:49,220"
					},
					"offsets": {
						"from": 47370,
						"to": 49220
					},
					"id": 351,
					"p": 0.988525,
					"t_dtw": -1
				},
				{
					"text": " little",
					"timestamps": {
						"from": "00:00:49,270",
						"to": "00:00:52,120"
					},
					"offsets": {
						"from": 49270,
						"to": 52120
					},
					"id": 1310,
					"p": 0.964262,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:52,120",
						"to": "00:00:54,020"
					},
					"offsets": {
						"from": 52120,
						"to": 54020
					},
					"id": 21696,
					"p": 0.964748,
					"t_dtw": -1
				},
				{
					"text": " boys",
					"timestamps": {
						"from": "00:00:54,020",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 54020,
						"to": 56000
					},
					"id": 6510,
					"p": 0.961095,
					"t_dtw": -1
				},
				{
					"text": "<|endoftext|>",
					"timestamps": {
						"from": "00:00:56,000",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 56000,
						"to": 56000
					},
					"id": 50256,
					"p": 0.167677,
					"t_dtw": -1
				}
			]
		}
	]
}

So I was wrong about the start timestamp and OpenAI's whisper does the same thing in this respect.

But if someone uses -ojf the produced timestamps would perhaps not be useful due to the first text timestamp starting to early, and also the repeating timestamps later.

ggerganov · 2025-06-06T14:15:38Z

Yes, the output should not contain token-level timestamps when they are not enabled. We have some warnings in the whisper.h header:

whisper.cpp/include/whisper.h

Lines 140 to 149 in b175baa

    
           // token-level timestamp data 
        
           // do not use if you haven't computed token-level timestamps 
        
           int64_t t0;        // start time of the token 
        
           int64_t t1;        //   end time of the token 
        
           // [EXPERIMENTAL] Token-level timestamps with DTW 
        
           // do not use if you haven't computed token-level timestamps with dtw 
        
           // Roughly corresponds to the moment in audio in which the token was output 
        
           int64_t t_dtw;

But it would be better to prevent incorrect usage in the examples too.

Ideally, the DTW algorithm should be fixed so we can provide token-level timestamps, but I'm not sure how difficult this would be. Anyway, not a very big priority atm.

ggerganov · 2025-06-06T14:31:36Z

But if someone uses -ojf the produced timestamps would perhaps not be useful due to the first text timestamp starting to early, and also the repeating timestamps later.

I just noticed that the -ojf flag enables token-level timestamps:

whisper.cpp/examples/cli/cli.cpp

Lines 1166 to 1168 in a73e240

    
           wparams.token_timestamps = params.output_wts || params.output_jsn_full || params.max_len > 0; 
        
           wparams.thold_pt         = params.word_thold;

So I have forgotten that we have 2 different token-level timestamp algorithms:

The simple heuristic approach that is enabled by this parameter:

whisper.cpp/include/whisper.h

Lines 501 to 504 in a73e240

    
           // [EXPERIMENTAL] token-level timestamps 
        
           bool  token_timestamps; // enable token-level timestamps 
        
           float thold_pt;         // timestamp token probability threshold (~0.01)

The more advanced DTW approach (which should correspond to the OAI implementation):

whisper.cpp/include/whisper.h

Lines 120 to 129 in a73e240

    
               // [EXPERIMENTAL] Token-level timestamps with DTW 
        
               bool dtw_token_timestamps; 
        
               enum whisper_alignment_heads_preset dtw_aheads_preset; 
        
               int dtw_n_top; 
        
               struct whisper_aheads dtw_aheads; 
        
               size_t dtw_mem_size; // TODO: remove 
        
           };

The second one is probably too difficult to fix, so no need to look into it.

But the first one should give OK results. So in that sense, your observation about the timestamp repetitions is correct and this does look like a bug.

hlevring · 2025-06-08T15:09:42Z

Maybe not the right place to mention, but was looking for issues with reference to dtw. It would definitely be great to have -dtw fixed, and with the new VAD I would suspect that timestamps may get even more precise?

src/whisper.cpp

This commit addresses an issue with token timestamps when audio segments are skipped, in `whisper_exp_compute_token_level_timestamps` related to the VAD processing and the energy levels. The motivation for this is that the token timestamps exceed the energy array bounds due to segment timing misalignment: ```console (skipped introduction) ↓ Audio segment: [2600ms → 5600ms] (3 seconds of actual audio) Energy array: [0 → 480652] (samples for 3 seconds) Token timestamps: [3266ms → 3408ms] (absolute timestamps) ``` So both `s0` and `t1` get clamped to the maximum sample index (480652) which causes the start/end timestamps to be the same for all the tokens after a certain point. This is addressed by using segment-relative timestamps in the `timestamp_to_sample` and `sample_to_timestamp`. Resolves: ggml-org#3207

danbev · 2025-06-13T14:10:31Z

It would definitely be great to have -dtw fixed, and with the new VAD I would suspect that timestamps may get even more precise?

@ggerganov Is this something I could/should take a look at?

ggerganov · 2025-06-13T14:19:54Z

@danbev Yes, but I don't have a sense about the amount of effort that is needed to fix it as I didn't familiarize deeply with the initial implementation. It might be something simple, but it also might be quite difficult. AFAIR it required to extract intermediate data from the encoder attention and the original implementation was a bit hacky in this regard.

Feel free to take a look, but if it turns out to be something too complicated, then no need to spend efforts on it.

danbev requested review from Copilot and ggerganov and removed request for Copilot June 5, 2025 16:16

This comment was marked as outdated.

Sign in to view

danbev changed the title ~~whisper : add support for for skipped audio~~ whisper : add support for skipped audio Jun 9, 2025

danbev force-pushed the token-timestamp-intro-issue branch 2 times, most recently from f3ee700 to c34ee30 Compare June 11, 2025 11:54

ggerganov approved these changes Jun 13, 2025

View reviewed changes

src/whisper.cpp Outdated Show resolved Hide resolved

danbev force-pushed the token-timestamp-intro-issue branch from c34ee30 to deb1c0d Compare June 13, 2025 14:04

danbev changed the title ~~whisper : add support for skipped audio~~ whisper : fix VAD processing for skipped audio segments Jun 13, 2025

danbev merged commit 705db0f into ggml-org:master Jun 13, 2025
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : fix VAD processing for skipped audio segments #3230

whisper : fix VAD processing for skipped audio segments #3230

Uh oh!

danbev commented Jun 5, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

ggerganov commented Jun 6, 2025 •

edited

Loading

Uh oh!

danbev commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

danbev commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

hlevring commented Jun 8, 2025

Uh oh!

Uh oh!

danbev commented Jun 13, 2025

Uh oh!

ggerganov commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

whisper : fix VAD processing for skipped audio segments #3230

whisper : fix VAD processing for skipped audio segments #3230

Uh oh!

Conversation

danbev commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

ggerganov commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danbev commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

danbev commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

hlevring commented Jun 8, 2025

Uh oh!

Uh oh!

danbev commented Jun 13, 2025

Uh oh!

ggerganov commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

danbev commented Jun 5, 2025 •

edited

Loading

ggerganov commented Jun 6, 2025 •

edited

Loading