Skip to content

Commit 80a5cf6

Browse files
committed
Improve contrib/pg_trgm's heuristics for regexp index searches.
When extracting trigrams from a regular expression for search of a GIN or GIST trigram index, it's useful to penalize (preferentially discard) trigrams that contain whitespace, since those are typically far more common in the index than trigrams not containing whitespace. Of course, this should only be a preference not a hard rule, since we might otherwise end up with no trigrams to search for. The previous coding tended to produce fairly inefficient trigram search sets for anchored regexp patterns, as reported by Erik Rijkers. This patch penalizes whitespace-containing trigrams, and also reduces the target number of extracted trigrams, since experience suggests that the original coding tended to select too many trigrams to search for. Alexander Korotkov, reviewed by Tom Lane
1 parent 5d8117e commit 80a5cf6

File tree

1 file changed

+76
-28
lines changed

1 file changed

+76
-28
lines changed

contrib/pg_trgm/trgm_regexp.c

+76-28
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,23 @@
122122
* thousands of trigrams would be slow, and would likely produce so many
123123
* false positives that we would have to traverse a large fraction of the
124124
* index, the graph is simplified further in a lossy fashion by removing
125-
* color trigrams until the number of trigrams after expansion is below
126-
* the MAX_TRGM_COUNT threshold. When a color trigram is removed, the states
127-
* connected by any arcs labelled with that trigram are merged.
125+
* color trigrams. When a color trigram is removed, the states connected by
126+
* any arcs labelled with that trigram are merged.
127+
*
128+
* Trigrams do not all have equivalent value for searching: some of them are
129+
* more frequent and some of them are less frequent. Ideally, we would like
130+
* to know the distribution of trigrams, but we don't. But because of padding
131+
* we know for sure that the empty character is more frequent than others,
132+
* so we can penalize trigrams according to presence of whitespace. The
133+
* penalty assigned to each color trigram is the number of simple trigrams
134+
* it would produce, times the penalties[] multiplier associated with its
135+
* whitespace content. (The penalties[] constants were calculated by analysis
136+
* of some real-life text.) We eliminate color trigrams starting with the
137+
* highest-penalty one, until we get to a total penalty of no more than
138+
* WISH_TRGM_PENALTY. However, we cannot remove a color trigram if that would
139+
* lead to merging the initial and final states, so we may not be able to
140+
* reach WISH_TRGM_PENALTY. It's still okay so long as we have no more than
141+
* MAX_TRGM_COUNT simple trigrams in total, otherwise we fail.
128142
*
129143
* 4) Pack the graph into a compact representation
130144
* -----------------------------------------------
@@ -199,13 +213,30 @@
199213
* MAX_EXPANDED_STATES - How many states we allow in expanded graph
200214
* MAX_EXPANDED_ARCS - How many arcs we allow in expanded graph
201215
* MAX_TRGM_COUNT - How many simple trigrams we allow to be extracted
216+
* WISH_TRGM_PENALTY - Maximum desired sum of color trigram penalties
202217
* COLOR_COUNT_LIMIT - Maximum number of characters per color
203218
*/
204219
#define MAX_EXPANDED_STATES 128
205220
#define MAX_EXPANDED_ARCS 1024
206221
#define MAX_TRGM_COUNT 256
222+
#define WISH_TRGM_PENALTY 16
207223
#define COLOR_COUNT_LIMIT 256
208224

225+
/*
226+
* Penalty multipliers for trigram counts depending on whitespace contents.
227+
* Numbers based on analysis of real-life texts.
228+
*/
229+
const float4 penalties[8] = {
230+
1.0, /* "aaa" */
231+
3.5, /* "aa " */
232+
0.0, /* "a a" (impossible) */
233+
0.0, /* "a " (impossible) */
234+
4.2, /* " aa" */
235+
2.1, /* " a " */
236+
25.0, /* " a" */
237+
0.0 /* " " (impossible) */
238+
};
239+
209240
/* Struct representing a single pg_wchar, converted back to multibyte form */
210241
typedef struct
211242
{
@@ -339,6 +370,7 @@ typedef struct
339370
ColorTrgm ctrgm;
340371
int number;
341372
int count;
373+
float4 penalty;
342374
bool expanded;
343375
List *arcs;
344376
} ColorTrgmInfo;
@@ -459,7 +491,7 @@ static TRGM *expandColorTrigrams(TrgmNFA *trgmNFA, MemoryContext rcontext);
459491
static void fillTrgm(trgm *ptrgm, trgm_mb_char s[3]);
460492
static void mergeStates(TrgmState *state1, TrgmState *state2);
461493
static int colorTrgmInfoCmp(const void *p1, const void *p2);
462-
static int colorTrgmInfoCountCmp(const void *p1, const void *p2);
494+
static int colorTrgmInfoPenaltyCmp(const void *p1, const void *p2);
463495
static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
464496
static int packArcInfoCmp(const void *a1, const void *a2);
465497

@@ -1424,6 +1456,7 @@ selectColorTrigrams(TrgmNFA *trgmNFA)
14241456
TrgmState *state;
14251457
ColorTrgmInfo *colorTrgms;
14261458
int64 totalTrgmCount;
1459+
float4 totalTrgmPenalty;
14271460
int number;
14281461

14291462
/* Collect color trigrams from all arcs */
@@ -1482,53 +1515,67 @@ selectColorTrigrams(TrgmNFA *trgmNFA)
14821515
}
14831516

14841517
/*
1485-
* Count number of simple trigrams generated by each color trigram.
1518+
* Count number of simple trigrams generated by each color trigram, and
1519+
* also compute a penalty value, which is the number of simple trigrams
1520+
* times a multiplier that depends on its whitespace content.
14861521
*
14871522
* Note: per-color-trigram counts cannot overflow an int so long as
14881523
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
14891524
* 1290. However, the grand total totalTrgmCount might conceivably
1490-
* overflow an int, so we use int64 for that within this routine.
1525+
* overflow an int, so we use int64 for that within this routine. Also,
1526+
* penalties are calculated in float4 arithmetic to avoid any overflow
1527+
* worries.
14911528
*/
14921529
totalTrgmCount = 0;
1530+
totalTrgmPenalty = 0.0f;
14931531
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
14941532
{
14951533
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
14961534
int j,
1497-
count = 1;
1535+
count = 1,
1536+
typeIndex = 0;
14981537

14991538
for (j = 0; j < 3; j++)
15001539
{
15011540
TrgmColor c = trgmInfo->ctrgm.colors[j];
15021541

1503-
if (c != COLOR_BLANK)
1542+
typeIndex *= 2;
1543+
if (c == COLOR_BLANK)
1544+
typeIndex++;
1545+
else
15041546
count *= trgmNFA->colorInfo[c].wordCharsCount;
15051547
}
15061548
trgmInfo->count = count;
15071549
totalTrgmCount += count;
1550+
trgmInfo->penalty = penalties[typeIndex] * (float4) count;
1551+
totalTrgmPenalty += trgmInfo->penalty;
15081552
}
15091553

1510-
/* Sort color trigrams in descending order of simple trigram counts */
1554+
/* Sort color trigrams in descending order of their penalties */
15111555
qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
1512-
colorTrgmInfoCountCmp);
1556+
colorTrgmInfoPenaltyCmp);
15131557

15141558
/*
1515-
* Remove color trigrams from the graph so long as total number of simple
1516-
* trigrams exceeds MAX_TRGM_COUNT. We prefer to remove color trigrams
1517-
* with the most associated simple trigrams, since those are the most
1518-
* promising for reducing the total number of simple trigrams. When
1519-
* removing a color trigram we have to merge states connected by arcs
1520-
* labeled with that trigram. It's necessary to not merge initial and
1521-
* final states, because our graph becomes useless if that happens; so we
1522-
* cannot always remove the trigram we'd prefer to.
1559+
* Remove color trigrams from the graph so long as total penalty of color
1560+
* trigrams exceeds WISH_TRGM_PENALTY. (If we fail to get down to
1561+
* WISH_TRGM_PENALTY, it's OK so long as total count is no more than
1562+
* MAX_TRGM_COUNT.) We prefer to remove color trigrams with higher
1563+
* penalty, since those are the most promising for reducing the total
1564+
* penalty. When removing a color trigram we have to merge states
1565+
* connected by arcs labeled with that trigram. It's necessary to not
1566+
* merge initial and final states, because our graph becomes useless if
1567+
* that happens; so we cannot always remove the trigram we'd prefer to.
15231568
*/
1524-
for (i = 0;
1525-
(i < trgmNFA->colorTrgmsCount) && (totalTrgmCount > MAX_TRGM_COUNT);
1526-
i++)
1569+
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
15271570
{
15281571
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
15291572
bool canRemove = true;
15301573
ListCell *cell;
15311574

1575+
/* Done if we've reached the target */
1576+
if (totalTrgmPenalty <= WISH_TRGM_PENALTY)
1577+
break;
1578+
15321579
/*
15331580
* Does any arc of this color trigram connect initial and final
15341581
* states? If so we can't remove it.
@@ -1570,9 +1617,10 @@ selectColorTrigrams(TrgmNFA *trgmNFA)
15701617
mergeStates(source, target);
15711618
}
15721619

1573-
/* Mark trigram unexpanded, and update totalTrgmCount */
1620+
/* Mark trigram unexpanded, and update totals */
15741621
trgmInfo->expanded = false;
15751622
totalTrgmCount -= trgmInfo->count;
1623+
totalTrgmPenalty -= trgmInfo->penalty;
15761624
}
15771625

15781626
/* Did we succeed in fitting into MAX_TRGM_COUNT? */
@@ -1746,17 +1794,17 @@ colorTrgmInfoCmp(const void *p1, const void *p2)
17461794

17471795
/*
17481796
* Compare function for sorting color trigrams in descending order of
1749-
* their simple trigrams counts.
1797+
* their penalty fields.
17501798
*/
17511799
static int
1752-
colorTrgmInfoCountCmp(const void *p1, const void *p2)
1800+
colorTrgmInfoPenaltyCmp(const void *p1, const void *p2)
17531801
{
1754-
const ColorTrgmInfo *c1 = (const ColorTrgmInfo *) p1;
1755-
const ColorTrgmInfo *c2 = (const ColorTrgmInfo *) p2;
1802+
float4 penalty1 = ((const ColorTrgmInfo *) p1)->penalty;
1803+
float4 penalty2 = ((const ColorTrgmInfo *) p2)->penalty;
17561804

1757-
if (c1->count < c2->count)
1805+
if (penalty1 < penalty2)
17581806
return 1;
1759-
else if (c1->count == c2->count)
1807+
else if (penalty1 == penalty2)
17601808
return 0;
17611809
else
17621810
return -1;

0 commit comments

Comments
 (0)