Skip to content

Commit 31f1b20

Browse files
committed
Fix logical errors in tsquery selectivity estimation for prefix queries.
I made multiple errors in commit 97532f7, stemming mostly from failure to think about the available frequency data as being element frequencies not value frequencies (so that occurrences of different elements are not mutually exclusive). This led to sillinesses such as estimating that "word" would match more rows than "word:*". The choice to clamp to a minimum estimate of DEFAULT_TS_MATCH_SEL also seems pretty ill-considered in hindsight, as it would frequently result in an estimate much larger than the available data suggests. We do need some sort of clamp, since a pattern not matching any of the MCELEMs probably still needs a selectivity estimate of more than zero. I chose instead to clamp to at least what a non-MCELEM word would be estimated as, preserving the property that "word:*" doesn't get an estimate less than plain "word", whether or not the word appears in MCELEM. Per investigation of a gripe from Bill Martin, though I suspect that his example case actually isn't even reaching the erroneous code. Back-patch to 9.1 where this code was introduced.
1 parent 4c25df5 commit 31f1b20

File tree

1 file changed

+28
-16
lines changed

1 file changed

+28
-16
lines changed

src/backend/tsearch/ts_selfuncs.c

Lines changed: 28 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -303,23 +303,29 @@ tsquery_opr_selec(QueryItem *item, char *operand,
303303
{
304304
/* Prefix match, ie the query item is lexeme:* */
305305
Selectivity matched,
306-
allmcvs;
307-
int i;
306+
allmces;
307+
int i,
308+
n_matched;
308309

309310
/*
310-
* Our strategy is to scan through the MCV list and add up the
311-
* frequencies of the ones that match the prefix, thereby assuming
312-
* that the MCVs are representative of the whole lexeme population
313-
* in this respect. Compare histogram_selectivity().
311+
* Our strategy is to scan through the MCELEM list and combine the
312+
* frequencies of the ones that match the prefix. We then
313+
* extrapolate the fraction of matching MCELEMs to the remaining
314+
* rows, assuming that the MCELEMs are representative of the whole
315+
* lexeme population in this respect. (Compare
316+
* histogram_selectivity().) Note that these are most common
317+
* elements not most common values, so they're not mutually
318+
* exclusive. We treat occurrences as independent events.
314319
*
315320
* This is only a good plan if we have a pretty fair number of
316-
* MCVs available; we set the threshold at 100. If no stats or
321+
* MCELEMs available; we set the threshold at 100. If no stats or
317322
* insufficient stats, arbitrarily use DEFAULT_TS_MATCH_SEL*4.
318323
*/
319324
if (lookup == NULL || length < 100)
320325
return (Selectivity) (DEFAULT_TS_MATCH_SEL * 4);
321326

322-
matched = allmcvs = 0;
327+
matched = allmces = 0;
328+
n_matched = 0;
323329
for (i = 0; i < length; i++)
324330
{
325331
TextFreq *t = lookup + i;
@@ -328,20 +334,26 @@ tsquery_opr_selec(QueryItem *item, char *operand,
328334
if (tlen >= key.length &&
329335
strncmp(key.lexeme, VARDATA_ANY(t->element),
330336
key.length) == 0)
331-
matched += t->frequency;
332-
allmcvs += t->frequency;
337+
{
338+
matched += t->frequency - matched * t->frequency;
339+
n_matched++;
340+
}
341+
allmces += t->frequency - allmces * t->frequency;
333342
}
334343

335-
if (allmcvs > 0) /* paranoia about zero divide */
336-
selec = matched / allmcvs;
337-
else
338-
selec = (Selectivity) (DEFAULT_TS_MATCH_SEL * 4);
344+
/* Clamp to ensure sanity in the face of roundoff error */
345+
CLAMP_PROBABILITY(matched);
346+
CLAMP_PROBABILITY(allmces);
347+
348+
selec = matched + (1.0 - allmces) * ((double) n_matched / length);
339349

340350
/*
341351
* In any case, never believe that a prefix match has selectivity
342-
* less than DEFAULT_TS_MATCH_SEL.
352+
* less than we would assign for a non-MCELEM lexeme. This
353+
* preserves the property that "word:*" should be estimated to
354+
* match at least as many rows as "word" would be.
343355
*/
344-
selec = Max(DEFAULT_TS_MATCH_SEL, selec);
356+
selec = Max(Min(DEFAULT_TS_MATCH_SEL, minfreq / 2), selec);
345357
}
346358
else
347359
{

0 commit comments

Comments
 (0)