Skip to content

Commit 3838fa2

Browse files
committed
Build de-escaped JSON strings in larger chunks during lexing
During COPY BINARY with large JSONB blobs, it was found that half the time was spent parsing JSON, with much of that spent in separate appendStringInfoChar() calls for each input byte. Add lookahead loop to json_lex_string() to allow batching multiple bytes via appendBinaryStringInfo(). Also use this same logic when de-escaping is not done, to avoid code duplication. Report and proof of concept patch by Jelte Fennema, reworked by Andres Freund and John Naylor Discussion: https://www.postgresql.org/message-id/CAGECzQQuXbies_nKgSiYifZUjBk6nOf2%3DTSXqRjj2BhUh8CTeA%40mail.gmail.com Discussion: https://www.postgresql.org/message-id/flat/PR3PR83MB0476F098CBCF68AF7A1CA89FF7B49@PR3PR83MB0476.EURPRD83.prod.outlook.com
1 parent a6434b9 commit 3838fa2

File tree

1 file changed

+39
-19
lines changed

1 file changed

+39
-19
lines changed

src/common/jsonapi.c

Lines changed: 39 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -686,15 +686,6 @@ json_lex_string(JsonLexContext *lex)
686686
lex->token_terminator = s;
687687
return JSON_INVALID_TOKEN;
688688
}
689-
else if (*s == '"')
690-
break;
691-
else if ((unsigned char) *s < 32)
692-
{
693-
/* Per RFC4627, these characters MUST be escaped. */
694-
/* Since *s isn't printable, exclude it from the context string */
695-
lex->token_terminator = s;
696-
return JSON_ESCAPING_REQUIRED;
697-
}
698689
else if (*s == '\\')
699690
{
700691
/* OK, we have an escape character. */
@@ -849,22 +840,51 @@ json_lex_string(JsonLexContext *lex)
849840
return JSON_ESCAPING_INVALID;
850841
}
851842
}
852-
else if (lex->strval != NULL)
843+
else
853844
{
845+
char *p;
846+
854847
if (hi_surrogate != -1)
855848
return JSON_UNICODE_LOW_SURROGATE;
856849

857-
appendStringInfoChar(lex->strval, *s);
858-
}
859-
}
850+
/*
851+
* Skip to the first byte that requires special handling, so we
852+
* can batch calls to appendBinaryStringInfo.
853+
*/
854+
for (p = s; p < end; p++)
855+
{
856+
if (*p == '\\' || *p == '"')
857+
break;
858+
else if ((unsigned char) *p < 32)
859+
{
860+
/* Per RFC4627, these characters MUST be escaped. */
861+
/*
862+
* Since *p isn't printable, exclude it from the context
863+
* string
864+
*/
865+
lex->token_terminator = p;
866+
return JSON_ESCAPING_REQUIRED;
867+
}
868+
}
860869

861-
if (hi_surrogate != -1)
862-
return JSON_UNICODE_LOW_SURROGATE;
870+
if (lex->strval != NULL)
871+
appendBinaryStringInfo(lex->strval, s, p - s);
863872

864-
/* Hooray, we found the end of the string! */
865-
lex->prev_token_terminator = lex->token_terminator;
866-
lex->token_terminator = s + 1;
867-
return JSON_SUCCESS;
873+
if (*p == '"')
874+
{
875+
/* Hooray, we found the end of the string! */
876+
lex->prev_token_terminator = lex->token_terminator;
877+
lex->token_terminator = p + 1;
878+
return JSON_SUCCESS;
879+
}
880+
881+
/*
882+
* s will be incremented at the top of the loop, so set it to just
883+
* behind our lookahead position
884+
*/
885+
s = p - 1;
886+
}
887+
}
868888
}
869889

870890
/*

0 commit comments

Comments
 (0)