Skip to content

whisper : fix VAD processing for skipped audio segments #3230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 13, 2025

Conversation

danbev
Copy link
Member

@danbev danbev commented Jun 5, 2025

This commit addresses an issue with token timestamps when audio segments
are skipped, in whisper_exp_compute_token_level_timestamps related to
the VAD processing and the energy levels.

The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:

                      (skipped introduction)

Audio segment:     [2600ms → 5600ms]  (3 seconds of actual audio)
Energy array:      [0 → 480652]       (samples for 3 seconds)
Token timestamps:  [3266ms → 3408ms]  (absolute timestamps)

So both s0 and t1 get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.

This is addressed by using segment-relative timestamps in the
timestamp_to_sample and sample_to_timestamp.

@danbev danbev requested review from Copilot and ggerganov and removed request for Copilot June 5, 2025 16:16
Copilot

This comment was marked as outdated.

@ggerganov
Copy link
Member

ggerganov commented Jun 6, 2025

This caused the segment start timestamp to start at 0 but the end timestamp was relative to the skipped audio (so must larger and incorrect).

I think the end timestamp is correct. It is just the start timestamp that begins too early. But this is also what the reference OpenAI implementation produces, so I don't think it is a bug.

The proposed changes in this PR don't seem to work with gb1.wav:

make -j && ./build/bin/whisper-cli -f ./samples/gb1.wav -m models/ggml-medium.en.bin -ps
main: processing './samples/gb1.wav' (3179750 samples, 198.7 sec), 1 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_exp_compute_token_level_timestamps: audio samples skipped, setting segment.t0 to 2962

[00:00:29.620 --> 00:00:07.360]  [_BEG_] My fellow Americans, this day has brought terrible news and great sadness to our[_TT_368]
[00:00:07.360 --> 00:00:14.800]   country. At nine o'clock this morning, Mission Control in Houston lost contact[_TT_740]
[00:00:14.800 --> 00:00:21.400]   with our space shuttle Columbia. A short time later, debris was seen falling from[_TT_1070]
...

@danbev
Copy link
Member Author

danbev commented Jun 6, 2025

I think the end timestamp is correct. It is just the start timestamp that begins too early. But this is also what the reference OpenAI implementation produces, so I don't think it is a bug.

This is what I'm seeing when I run this sample audio using OpenAI's whisper:

whisper samples/aladdin-first30.mp3 --model medium.en --word_timestamps True
[00:20.040 --> 00:25.980]  There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.980 --> 00:29.920]  would do nothing but play all day long in the streets with little idle boys.

And the output in aladdin-first30.json looks like this:

{
  "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 20.039999999999996,
      "end": 25.98,
      "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
      "tokens": [
        50363,
        1318,
        1752,
        5615,
        257,
        3595,
        35280,
        11,
        508,
        550,
        257,
        3367,
        1444,
        978,
        46782,
        11,
        257,
        36138,
        21696,
        2933,
        11,
        508,
        51661
      ],
      "temperature": 0.0,
      "avg_logprob": -0.22081296920776367,
      "compression_ratio": 1.075,
      "no_speech_prob": 0.03549518063664436,
      "words": [
        {
          "word": " There",
          "start": 20.039999999999996,
          "end": 20.619999999999997,
          "probability": 0.18882536888122559
        },
        {
          "word": " once",
          "start": 20.619999999999997,
          "end": 21.2,
          "probability": 0.9850126504898071
        },
        {
          "word": " lived",
          "start": 21.2,
          "end": 21.5,
          "probability": 0.9961034059524536
        },
       ...
      {
          "word": " nothing",
          "start": 26.28,
          "end": 26.54,
          "probability": 0.9977497458457947
        },
        {
          "word": " but",
          "start": 26.54,
          "end": 26.82,
          "probability": 0.9585630893707275
        },
        {
          "word": " play",
          "start": 26.82,
          "end": 27.12,
          "probability": 0.9835618138313293
        },
        {
          "word": " all",
          "start": 27.12,
          "end": 27.32,
          "probability": 0.9963666200637817
        },
        {
          "word": " day",
          "start": 27.32,
          "end": 27.54,
          "probability": 0.9964172840118408
        },

If we only correct the the start timestamp (though that seems to be an incorrect fix I've made) we would get the following timestamps in the second segment:

         "text": " nothing",                                                   
          "timestamps": {                                                       
            "from": "00:00:29,330",                                             
            "to": "00:00:30,020"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 29330,                                                      
            "to": 30020                                                         
          },                                                                    
          "id": 2147,                                                           
          "p": 0.99731,                                                         
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " but",                                                       
          "timestamps": {                                                       
            "from": "00:00:30,040",                                             
            "to": "00:00:30,040"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 30040,                                                      
            "to": 30040                                                         
          },                                                                    
          "id": 475,                                                            
          "p": 0.923541,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " play",                                                      
          "timestamps": {                                                       
            "from": "00:00:30,040",                                             
            "to": "00:00:30,040"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 30040,                                                      
            "to": 30040                                                         
          },                                                                    
          "id": 711,                                                            
          "p": 0.978036,                                                        
          "t_dtw": -1                                                           
        },                               

And actually this is also the case when running without first fix as well (so using the master branch).
So I think there is migth be an issue with the initial start timestamp and these repeating timestamps unless I'm missing something.

@ggerganov
Copy link
Member

The --word_timestamps True flag in OpenAI enables an additional post-processing algorithm. We have a similar implementation using the -dtw flag, but this has been broken for a long time.

If you remove --word_timestamps True does it produce the same timestamp as whisper.cpp?

@danbev
Copy link
Member Author

danbev commented Jun 6, 2025

The --word_timestamps True flag in OpenAI enables an additional post-processing algorithm. We have a similar implementation using the -dtw flag, but this has been broken for a long time.

Ah I see, I thought this was the same thing as enabling full json output and get the timestamps and tried to follow that.

If you remove --word_timestamps True does it produce the same timestamp as whisper.cpp?

This is the output when I remove the --word_timestamps:

[00:00.000 --> 00:25.960]  There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who
[00:25.960 --> 00:30.040]  would do nothing but play all day long in the streets with little idle boys.
aladdin-first30.json
{
  "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who would do nothing but play all day long in the streets with little idle boys.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 25.96,
      "text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle boy, who",
      "tokens": [
        50363,
        1318,
        1752,
        5615,
        257,
        3595,
        35280,
        11,
        508,
        550,
        257,
        3367,
        1444,
        978,
        46782,
        11,
        257,
        36138,
        21696,
        2933,
        11,
        508,
        51661
      ],
      "temperature": 0.0,
      "avg_logprob": -0.22081296920776367,
      "compression_ratio": 1.075,
      "no_speech_prob": 0.03549518063664436
    },
    {
      "id": 1,
      "seek": 2596,
      "start": 25.96,
      "end": 30.04,
      "text": " would do nothing but play all day long in the streets with little idle boys.",
      "tokens": [
        50363,
        561,
        466,
        2147,
        475,
        711,
        477,
        1110,
        890,
        287,
        262,
        6483,
        351,
        1310,
        21696,
        6510,
        13,
        50567
      ],
      "temperature": 0.0,
      "avg_logprob": -0.264762376484118,
      "compression_ratio": 1.0857142857142856,
      "no_speech_prob": 0.9776378870010376
    }
  ],
  "language": "en"
}

And this is the ouput from whisper.cpp (using master branch):

[00:00:00.000 --> 00:00:26.000]   There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000]   would do nothing but play all day long in the streets, with little idle boys
output_json: saving output to 'aladdin.json'
aladdin.json
{
	"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "./models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:26,000"
			},
			"offsets": {
				"from": 0,
				"to": 26000
			},
			"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:00,000",
						"to": "00:00:00,000"
					},
					"offsets": {
						"from": 0,
						"to": 0
					},
					"id": 50363,
					"p": 0.755106,
					"t_dtw": -1
				},
				{
					"text": " There",
					"timestamps": {
						"from": "00:00:01,140",
						"to": "00:00:21,120"
					},
					"offsets": {
						"from": 1140,
						"to": 21120
					},
					"id": 1318,
					"p": 0.338112,
					"t_dtw": -1
				},
				{
					"text": " once",
					"timestamps": {
						"from": "00:00:21,120",
						"to": "00:00:21,290"
					},
					"offsets": {
						"from": 21120,
						"to": 21290
					},
					"id": 1752,
					"p": 0.937644,
					"t_dtw": -1
				},
				{
					"text": " lived",
					"timestamps": {
						"from": "00:00:21,290",
						"to": "00:00:21,540"
					},
					"offsets": {
						"from": 21290,
						"to": 21540
					},
					"id": 5615,
					"p": 0.980597,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:21,540",
						"to": "00:00:21,600"
					},
					"offsets": {
						"from": 21540,
						"to": 21600
					},
					"id": 257,
					"p": 0.989806,
					"t_dtw": -1
				},
				{
					"text": " poor",
					"timestamps": {
						"from": "00:00:21,600",
						"to": "00:00:21,860"
					},
					"offsets": {
						"from": 21600,
						"to": 21860
					},
					"id": 3595,
					"p": 0.973454,
					"t_dtw": -1
				},
				{
					"text": " tailor",
					"timestamps": {
						"from": "00:00:21,860",
						"to": "00:00:22,250"
					},
					"offsets": {
						"from": 21860,
						"to": 22250
					},
					"id": 35280,
					"p": 0.670927,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:22,250",
						"to": "00:00:22,320"
					},
					"offsets": {
						"from": 22250,
						"to": 22320
					},
					"id": 11,
					"p": 0.727574,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:22,570",
						"to": "00:00:22,570"
					},
					"offsets": {
						"from": 22570,
						"to": 22570
					},
					"id": 508,
					"p": 0.942266,
					"t_dtw": -1
				},
				{
					"text": " had",
					"timestamps": {
						"from": "00:00:22,570",
						"to": "00:00:22,760"
					},
					"offsets": {
						"from": 22570,
						"to": 22760
					},
					"id": 550,
					"p": 0.997018,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:22,760",
						"to": "00:00:22,840"
					},
					"offsets": {
						"from": 22760,
						"to": 22840
					},
					"id": 257,
					"p": 0.996136,
					"t_dtw": -1
				},
				{
					"text": " son",
					"timestamps": {
						"from": "00:00:22,840",
						"to": "00:00:23,140"
					},
					"offsets": {
						"from": 22840,
						"to": 23140
					},
					"id": 3367,
					"p": 0.977694,
					"t_dtw": -1
				},
				{
					"text": " called",
					"timestamps": {
						"from": "00:00:23,140",
						"to": "00:00:23,590"
					},
					"offsets": {
						"from": 23140,
						"to": 23590
					},
					"id": 1444,
					"p": 0.66705,
					"t_dtw": -1
				},
				{
					"text": " Al",
					"timestamps": {
						"from": "00:00:23,590",
						"to": "00:00:23,740"
					},
					"offsets": {
						"from": 23590,
						"to": 23740
					},
					"id": 978,
					"p": 0.95969,
					"t_dtw": -1
				},
				{
					"text": "addin",
					"timestamps": {
						"from": "00:00:23,740",
						"to": "00:00:23,960"
					},
					"offsets": {
						"from": 23740,
						"to": 23960
					},
					"id": 46782,
					"p": 0.964032,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:24,280",
						"to": "00:00:24,280"
					},
					"offsets": {
						"from": 24280,
						"to": 24280
					},
					"id": 11,
					"p": 0.853909,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:24,320",
						"to": "00:00:24,350"
					},
					"offsets": {
						"from": 24320,
						"to": 24350
					},
					"id": 257,
					"p": 0.958133,
					"t_dtw": -1
				},
				{
					"text": " careless",
					"timestamps": {
						"from": "00:00:24,350",
						"to": "00:00:24,900"
					},
					"offsets": {
						"from": 24350,
						"to": 24900
					},
					"id": 36138,
					"p": 0.943495,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:24,970",
						"to": "00:00:25,270"
					},
					"offsets": {
						"from": 24970,
						"to": 25270
					},
					"id": 21696,
					"p": 0.909963,
					"t_dtw": -1
				},
				{
					"text": "-",
					"timestamps": {
						"from": "00:00:25,270",
						"to": "00:00:25,340"
					},
					"offsets": {
						"from": 25270,
						"to": 25340
					},
					"id": 12,
					"p": 0.31478,
					"t_dtw": -1
				},
				{
					"text": "boy",
					"timestamps": {
						"from": "00:00:25,350",
						"to": "00:00:25,570"
					},
					"offsets": {
						"from": 25350,
						"to": 25570
					},
					"id": 7081,
					"p": 0.998407,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:25,570",
						"to": "00:00:25,610"
					},
					"offsets": {
						"from": 25570,
						"to": 25610
					},
					"id": 11,
					"p": 0.841821,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:25,900",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 25900,
						"to": 26000
					},
					"id": 508,
					"p": 0.985908,
					"t_dtw": -1
				},
				{
					"text": "[_TT_1300]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 51663,
					"p": 0.0790712,
					"t_dtw": -1
				}
			]
		},
		{
			"timestamps": {
				"from": "00:00:26,000",
				"to": "00:00:56,000"
			},
			"offsets": {
				"from": 26000,
				"to": 56000
			},
			"text": " would do nothing but play all day long in the streets, with little idle boys",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 50363,
					"p": 0.99269,
					"t_dtw": -1
				},
				{
					"text": " would",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:28,350"
					},
					"offsets": {
						"from": 26000,
						"to": 28350
					},
					"id": 561,
					"p": 0.731398,
					"t_dtw": -1
				},
				{
					"text": " do",
					"timestamps": {
						"from": "00:00:28,460",
						"to": "00:00:29,330"
					},
					"offsets": {
						"from": 28460,
						"to": 29330
					},
					"id": 466,
					"p": 0.976556,
					"t_dtw": -1
				},
				{
					"text": " nothing",
					"timestamps": {
						"from": "00:00:29,330",
						"to": "00:00:30,020"
					},
					"offsets": {
						"from": 29330,
						"to": 30020
					},
					"id": 2147,
					"p": 0.99731,
					"t_dtw": -1
				},
				{
					"text": " but",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 475,
					"p": 0.923541,
					"t_dtw": -1
				},
				{
					"text": " play",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 711,
					"p": 0.978036,
					"t_dtw": -1
				},
				{
					"text": " all",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 477,
					"p": 0.993187,
					"t_dtw": -1
				},
				{
					"text": " day",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 1110,
					"p": 0.991072,
					"t_dtw": -1
				},
				{
					"text": " long",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 890,
					"p": 0.987279,
					"t_dtw": -1
				},
				{
					"text": " in",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 287,
					"p": 0.981879,
					"t_dtw": -1
				},
				{
					"text": " the",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 262,
					"p": 0.989247,
					"t_dtw": -1
				},
				{
					"text": " streets",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 6483,
					"p": 0.994812,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 11,
					"p": 0.619108,
					"t_dtw": -1
				},
				{
					"text": " with",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 351,
					"p": 0.988525,
					"t_dtw": -1
				},
				{
					"text": " little",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 1310,
					"p": 0.964262,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 21696,
					"p": 0.964748,
					"t_dtw": -1
				},
				{
					"text": " boys",
					"timestamps": {
						"from": "00:00:30,040",
						"to": "00:00:30,040"
					},
					"offsets": {
						"from": 30040,
						"to": 30040
					},
					"id": 6510,
					"p": 0.961095,
					"t_dtw": -1
				},
				{
					"text": "<|endoftext|>",
					"timestamps": {
						"from": "00:00:56,000",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 56000,
						"to": 56000
					},
					"id": 50256,
					"p": 0.167677,
					"t_dtw": -1
				}
			]
		}
	]
}

And this is the output from whipser.cpp using this PR:

[00:00:21.120 --> 00:00:26.000]   There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who
[00:00:26.000 --> 00:00:56.000]   would do nothing but play all day long in the streets, with little idle boys
aladdin.json
{
	"systeminfo": "WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 890 | F16 = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | ",
	"model": {
		"type": "medium",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"text": {
			"ctx": 448,
			"state": 1024,
			"head": 16,
			"layer": 24
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "./models/ggml-medium.en.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:21,120",
				"to": "00:00:26,000"
			},
			"offsets": {
				"from": 21120,
				"to": 26000
			},
			"text": " There once lived a poor tailor, who had a son called Aladdin, a careless idle-boy, who",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:21,120",
						"to": "00:00:21,120"
					},
					"offsets": {
						"from": 21120,
						"to": 21120
					},
					"id": 50363,
					"p": 0.755106,
					"t_dtw": -1
				},
				{
					"text": " There",
					"timestamps": {
						"from": "00:00:21,280",
						"to": "00:00:21,440"
					},
					"offsets": {
						"from": 21280,
						"to": 21440
					},
					"id": 1318,
					"p": 0.338112,
					"t_dtw": -1
				},
				{
					"text": " once",
					"timestamps": {
						"from": "00:00:21,440",
						"to": "00:00:21,700"
					},
					"offsets": {
						"from": 21440,
						"to": 21700
					},
					"id": 1752,
					"p": 0.937644,
					"t_dtw": -1
				},
				{
					"text": " lived",
					"timestamps": {
						"from": "00:00:21,700",
						"to": "00:00:22,020"
					},
					"offsets": {
						"from": 21700,
						"to": 22020
					},
					"id": 5615,
					"p": 0.980597,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:22,020",
						"to": "00:00:22,080"
					},
					"offsets": {
						"from": 22020,
						"to": 22080
					},
					"id": 257,
					"p": 0.989806,
					"t_dtw": -1
				},
				{
					"text": " poor",
					"timestamps": {
						"from": "00:00:22,260",
						"to": "00:00:22,340"
					},
					"offsets": {
						"from": 22260,
						"to": 22340
					},
					"id": 3595,
					"p": 0.973454,
					"t_dtw": -1
				},
				{
					"text": " tailor",
					"timestamps": {
						"from": "00:00:22,340",
						"to": "00:00:22,730"
					},
					"offsets": {
						"from": 22340,
						"to": 22730
					},
					"id": 35280,
					"p": 0.670927,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:22,730",
						"to": "00:00:22,860"
					},
					"offsets": {
						"from": 22730,
						"to": 22860
					},
					"id": 11,
					"p": 0.727574,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:22,860",
						"to": "00:00:23,050"
					},
					"offsets": {
						"from": 22860,
						"to": 23050
					},
					"id": 508,
					"p": 0.942266,
					"t_dtw": -1
				},
				{
					"text": " had",
					"timestamps": {
						"from": "00:00:23,050",
						"to": "00:00:23,240"
					},
					"offsets": {
						"from": 23050,
						"to": 23240
					},
					"id": 550,
					"p": 0.997018,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:23,240",
						"to": "00:00:23,250"
					},
					"offsets": {
						"from": 23240,
						"to": 23250
					},
					"id": 257,
					"p": 0.996136,
					"t_dtw": -1
				},
				{
					"text": " son",
					"timestamps": {
						"from": "00:00:23,310",
						"to": "00:00:23,400"
					},
					"offsets": {
						"from": 23310,
						"to": 23400
					},
					"id": 3367,
					"p": 0.977694,
					"t_dtw": -1
				},
				{
					"text": " called",
					"timestamps": {
						"from": "00:00:23,500",
						"to": "00:00:23,790"
					},
					"offsets": {
						"from": 23500,
						"to": 23790
					},
					"id": 1444,
					"p": 0.66705,
					"t_dtw": -1
				},
				{
					"text": " Al",
					"timestamps": {
						"from": "00:00:23,910",
						"to": "00:00:24,010"
					},
					"offsets": {
						"from": 23910,
						"to": 24010
					},
					"id": 978,
					"p": 0.95969,
					"t_dtw": -1
				},
				{
					"text": "addin",
					"timestamps": {
						"from": "00:00:24,010",
						"to": "00:00:24,330"
					},
					"offsets": {
						"from": 24010,
						"to": 24330
					},
					"id": 46782,
					"p": 0.964032,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:24,330",
						"to": "00:00:24,460"
					},
					"offsets": {
						"from": 24330,
						"to": 24460
					},
					"id": 11,
					"p": 0.853909,
					"t_dtw": -1
				},
				{
					"text": " a",
					"timestamps": {
						"from": "00:00:24,460",
						"to": "00:00:24,520"
					},
					"offsets": {
						"from": 24460,
						"to": 24520
					},
					"id": 257,
					"p": 0.958133,
					"t_dtw": -1
				},
				{
					"text": " careless",
					"timestamps": {
						"from": "00:00:24,520",
						"to": "00:00:24,690"
					},
					"offsets": {
						"from": 24520,
						"to": 24690
					},
					"id": 36138,
					"p": 0.943495,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:25,300",
						"to": "00:00:25,300"
					},
					"offsets": {
						"from": 25300,
						"to": 25300
					},
					"id": 21696,
					"p": 0.909963,
					"t_dtw": -1
				},
				{
					"text": "-",
					"timestamps": {
						"from": "00:00:25,330",
						"to": "00:00:25,350"
					},
					"offsets": {
						"from": 25330,
						"to": 25350
					},
					"id": 12,
					"p": 0.31478,
					"t_dtw": -1
				},
				{
					"text": "boy",
					"timestamps": {
						"from": "00:00:25,360",
						"to": "00:00:25,550"
					},
					"offsets": {
						"from": 25360,
						"to": 25550
					},
					"id": 7081,
					"p": 0.998407,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:25,550",
						"to": "00:00:25,660"
					},
					"offsets": {
						"from": 25550,
						"to": 25660
					},
					"id": 11,
					"p": 0.841821,
					"t_dtw": -1
				},
				{
					"text": " who",
					"timestamps": {
						"from": "00:00:25,740",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 25740,
						"to": 26000
					},
					"id": 508,
					"p": 0.985908,
					"t_dtw": -1
				},
				{
					"text": "[_TT_1300]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 51663,
					"p": 0.0790712,
					"t_dtw": -1
				}
			]
		},
		{
			"timestamps": {
				"from": "00:00:26,000",
				"to": "00:00:56,000"
			},
			"offsets": {
				"from": 26000,
				"to": 56000
			},
			"text": " would do nothing but play all day long in the streets, with little idle boys",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:26,000",
						"to": "00:00:26,000"
					},
					"offsets": {
						"from": 26000,
						"to": 26000
					},
					"id": 50363,
					"p": 0.99269,
					"t_dtw": -1
				},
				{
					"text": " would",
					"timestamps": {
						"from": "00:00:27,140",
						"to": "00:00:28,280"
					},
					"offsets": {
						"from": 27140,
						"to": 28280
					},
					"id": 561,
					"p": 0.731398,
					"t_dtw": -1
				},
				{
					"text": " do",
					"timestamps": {
						"from": "00:00:28,380",
						"to": "00:00:29,330"
					},
					"offsets": {
						"from": 28380,
						"to": 29330
					},
					"id": 466,
					"p": 0.976556,
					"t_dtw": -1
				},
				{
					"text": " nothing",
					"timestamps": {
						"from": "00:00:29,330",
						"to": "00:00:32,660"
					},
					"offsets": {
						"from": 29330,
						"to": 32660
					},
					"id": 2147,
					"p": 0.99731,
					"t_dtw": -1
				},
				{
					"text": " but",
					"timestamps": {
						"from": "00:00:32,660",
						"to": "00:00:34,060"
					},
					"offsets": {
						"from": 32660,
						"to": 34060
					},
					"id": 475,
					"p": 0.923541,
					"t_dtw": -1
				},
				{
					"text": " play",
					"timestamps": {
						"from": "00:00:34,120",
						"to": "00:00:35,980"
					},
					"offsets": {
						"from": 34120,
						"to": 35980
					},
					"id": 711,
					"p": 0.978036,
					"t_dtw": -1
				},
				{
					"text": " all",
					"timestamps": {
						"from": "00:00:35,980",
						"to": "00:00:37,400"
					},
					"offsets": {
						"from": 35980,
						"to": 37400
					},
					"id": 477,
					"p": 0.993187,
					"t_dtw": -1
				},
				{
					"text": " day",
					"timestamps": {
						"from": "00:00:37,400",
						"to": "00:00:38,750"
					},
					"offsets": {
						"from": 37400,
						"to": 38750
					},
					"id": 1110,
					"p": 0.991072,
					"t_dtw": -1
				},
				{
					"text": " long",
					"timestamps": {
						"from": "00:00:38,830",
						"to": "00:00:40,720"
					},
					"offsets": {
						"from": 38830,
						"to": 40720
					},
					"id": 890,
					"p": 0.987279,
					"t_dtw": -1
				},
				{
					"text": " in",
					"timestamps": {
						"from": "00:00:40,720",
						"to": "00:00:41,560"
					},
					"offsets": {
						"from": 40720,
						"to": 41560
					},
					"id": 287,
					"p": 0.981879,
					"t_dtw": -1
				},
				{
					"text": " the",
					"timestamps": {
						"from": "00:00:42,340",
						"to": "00:00:43,090"
					},
					"offsets": {
						"from": 42340,
						"to": 43090
					},
					"id": 262,
					"p": 0.989247,
					"t_dtw": -1
				},
				{
					"text": " streets",
					"timestamps": {
						"from": "00:00:43,090",
						"to": "00:00:44,840"
					},
					"offsets": {
						"from": 43090,
						"to": 44840
					},
					"id": 6483,
					"p": 0.994812,
					"t_dtw": -1
				},
				{
					"text": ",",
					"timestamps": {
						"from": "00:00:46,830",
						"to": "00:00:47,370"
					},
					"offsets": {
						"from": 46830,
						"to": 47370
					},
					"id": 11,
					"p": 0.619108,
					"t_dtw": -1
				},
				{
					"text": " with",
					"timestamps": {
						"from": "00:00:47,370",
						"to": "00:00:49,220"
					},
					"offsets": {
						"from": 47370,
						"to": 49220
					},
					"id": 351,
					"p": 0.988525,
					"t_dtw": -1
				},
				{
					"text": " little",
					"timestamps": {
						"from": "00:00:49,270",
						"to": "00:00:52,120"
					},
					"offsets": {
						"from": 49270,
						"to": 52120
					},
					"id": 1310,
					"p": 0.964262,
					"t_dtw": -1
				},
				{
					"text": " idle",
					"timestamps": {
						"from": "00:00:52,120",
						"to": "00:00:54,020"
					},
					"offsets": {
						"from": 52120,
						"to": 54020
					},
					"id": 21696,
					"p": 0.964748,
					"t_dtw": -1
				},
				{
					"text": " boys",
					"timestamps": {
						"from": "00:00:54,020",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 54020,
						"to": 56000
					},
					"id": 6510,
					"p": 0.961095,
					"t_dtw": -1
				},
				{
					"text": "<|endoftext|>",
					"timestamps": {
						"from": "00:00:56,000",
						"to": "00:00:56,000"
					},
					"offsets": {
						"from": 56000,
						"to": 56000
					},
					"id": 50256,
					"p": 0.167677,
					"t_dtw": -1
				}
			]
		}
	]
}

So I was wrong about the start timestamp and OpenAI's whisper does the same thing in this respect.

But if someone uses -ojf the produced timestamps would perhaps not be useful due to the first text timestamp starting to early, and also the repeating timestamps later.

@ggerganov
Copy link
Member

Yes, the output should not contain token-level timestamps when they are not enabled. We have some warnings in the whisper.h header:

// token-level timestamp data
// do not use if you haven't computed token-level timestamps
int64_t t0; // start time of the token
int64_t t1; // end time of the token
// [EXPERIMENTAL] Token-level timestamps with DTW
// do not use if you haven't computed token-level timestamps with dtw
// Roughly corresponds to the moment in audio in which the token was output
int64_t t_dtw;

But it would be better to prevent incorrect usage in the examples too.

Ideally, the DTW algorithm should be fixed so we can provide token-level timestamps, but I'm not sure how difficult this would be. Anyway, not a very big priority atm.

@ggerganov
Copy link
Member

But if someone uses -ojf the produced timestamps would perhaps not be useful due to the first text timestamp starting to early, and also the repeating timestamps later.

I just noticed that the -ojf flag enables token-level timestamps:

wparams.token_timestamps = params.output_wts || params.output_jsn_full || params.max_len > 0;
wparams.thold_pt = params.word_thold;

So I have forgotten that we have 2 different token-level timestamp algorithms:

  • The simple heuristic approach that is enabled by this parameter:
    // [EXPERIMENTAL] token-level timestamps
    bool token_timestamps; // enable token-level timestamps
    float thold_pt; // timestamp token probability threshold (~0.01)
  • The more advanced DTW approach (which should correspond to the OAI implementation):
    // [EXPERIMENTAL] Token-level timestamps with DTW
    bool dtw_token_timestamps;
    enum whisper_alignment_heads_preset dtw_aheads_preset;
    int dtw_n_top;
    struct whisper_aheads dtw_aheads;
    size_t dtw_mem_size; // TODO: remove
    };

The second one is probably too difficult to fix, so no need to look into it.

But the first one should give OK results. So in that sense, your observation about the timestamp repetitions is correct and this does look like a bug.

@hlevring
Copy link

hlevring commented Jun 8, 2025

Maybe not the right place to mention, but was looking for issues with reference to dtw. It would definitely be great to have -dtw fixed, and with the new VAD I would suspect that timestamps may get even more precise?

@danbev danbev changed the title whisper : add support for for skipped audio whisper : add support for skipped audio Jun 9, 2025
@danbev danbev force-pushed the token-timestamp-intro-issue branch 2 times, most recently from f3ee700 to c34ee30 Compare June 11, 2025 11:54
This commit addresses an issue with token timestamps when audio segments
are skipped, in `whisper_exp_compute_token_level_timestamps` related to
the VAD processing and the energy levels.

The motivation for this is that the token timestamps exceed the energy
array bounds due to segment timing misalignment:
```console
                  (skipped introduction)
                    ↓
Audio segment:     [2600ms → 5600ms]  (3 seconds of actual audio)
Energy array:      [0 → 480652]       (samples for 3 seconds)
Token timestamps:  [3266ms → 3408ms]  (absolute timestamps)
```
So both `s0` and `t1` get clamped to the maximum sample index (480652)
which causes the start/end timestamps to be the same for all the tokens
after a certain point.

This is addressed by using segment-relative timestamps in the
`timestamp_to_sample` and `sample_to_timestamp`.

Resolves: ggml-org#3207
@danbev danbev force-pushed the token-timestamp-intro-issue branch from c34ee30 to deb1c0d Compare June 13, 2025 14:04
@danbev danbev changed the title whisper : add support for skipped audio whisper : fix VAD processing for skipped audio segments Jun 13, 2025
@danbev
Copy link
Member Author

danbev commented Jun 13, 2025

It would definitely be great to have -dtw fixed, and with the new VAD I would suspect that timestamps may get even more precise?

@ggerganov Is this something I could/should take a look at?

@ggerganov
Copy link
Member

@danbev Yes, but I don't have a sense about the amount of effort that is needed to fix it as I didn't familiarize deeply with the initial implementation. It might be something simple, but it also might be quite difficult. AFAIR it required to extract intermediate data from the encoder attention and the original implementation was a bit hacky in this regard.

Feel free to take a look, but if it turns out to be something too complicated, then no need to spend efforts on it.

@danbev danbev merged commit 705db0f into ggml-org:master Jun 13, 2025
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants