Skip to content

Actions: Improve Bash parsing performance on command and string interpolations #19701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

adityasharad
Copy link
Collaborator

See linked internal issues for details. This set of changes should improve the performance of the Actions Bash script parsing and make it slightly more readable.

The bottlenecks:

  • In the Bash parser, we compute a mostly-unique ID for each command substitution and each quoted string within a shell script block.
  • Commands and quoted strings are then ranked and referred to individually by their rank. This allows us to reference command/string expressions individually within a shell script block, and so compute an ordered set of Bash statements in the block.
  • However, ranking by their original text ends up evaluating multiple possible orderings, which is slow on workflows that contain multiple complex command substitutions. Customers provided examples of repeated commands containing JSON payloads. For command substitutions this didn't affect performance too badly, but the quoted string computation blew up the size of the database string pool and led to out-of-disk errors during query evaluation.

Key changes:

  • Refactor the logic around identifying Bash command substitutions, to use more intuitive predicate and variable names.
  • Change the logic around ranking Bash commands. Rank commands by their ID, not by their source text. I think this was the original intent of the code.
  • Refactor the logic around identifying Bash quoted strings, to use more intuitive predicate and variable names.
  • Change the logic around ranking Bash quoted strings. Rank quoted strings by their ID, not by their source text. I think this was the original intent of the code.
  • Add a test case (with no expected results) similar to the customer reports, to stress-test this in future.

Not changed:

  • I have not attempted a more thorough rewrite, although I think we could greatly simplify this by using datatypes for the parsed commands and strings, rather than constructing string IDs for them.

Tagging @alexet @asgerf for having provided advice in this area, and @cklin to consider if we want this in the upcoming release.

Extract helper predicates for `$(...)` command interpolation
and backtick-quoted commands.
Add some doc comments and meaningful variable names.
In the Bash parser, we compute a mostly-unique ID for each
command substitution within a shell script block.
Commands are then ranked and referred to individually.

Avoid a performance bottleneck by ranking commands by their
ID, not by their source text.
I think this was the original intent of the code.
Ranking by their original text ends up evaluating multiple
possible orderings, which is slow on workflows that contain
multiple complex command substitutions.
Add some doc comments and meaningful variable names.
In the Bash parser, we compute a mostly-unique ID for each
quoted string within a shell script block.
Quoted strings are then ranked and referred to individually.

Avoid a performance bottleneck by ranking quoted strings by their
ID, not by their source text.
I think this was the original intent of the code.

Ranking by their original text ends up evaluating multiple
possible orderings, which is slow on workflows that contain
multiple complex quoted strings, such as JSON payloads.
Anonymised version of a customer report that led to
performance bottlenecks in Bash parsing.
No results are expected from both query and library tests.
@Copilot Copilot AI review requested due to automatic review settings June 9, 2025 16:40
@adityasharad adityasharad requested a review from a team as a code owner June 9, 2025 16:40
@github-actions github-actions bot added the Actions Analysis of GitHub Actions label Jun 9, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Bash parser in CodeQL to improve performance by using unique IDs for command substitutions and quoted strings, and adds a stress-test workflow.

  • Refactor command substitution predicates for clearer naming and ID-based ranking
  • Refactor quoted string predicates similarly and rank by ID
  • Add a new workflow test (interpolation.yml) to exercise complex interpolations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
actions/ql/test/query-tests/Security/CWE-094/.github/workflows/interpolation.yml New workflow to stress-test Bash interpolation parsing
actions/ql/lib/codeql/actions/Bash.qll Refactored predicates and ranking logic for commands and strings
Comments suppressed due to low confidence (3)

actions/ql/lib/codeql/actions/Bash.qll:38

  • [nitpick] The TODO comment indicates trimming could be folded into the regex. Consider updating the regex to capture only the inner command or removing the extra trim step to simplify maintenance.
.regexpReplaceAll("^\\$\\(", "")

actions/ql/lib/codeql/actions/Bash.qll:140

  • [nitpick] The parameter name i is ambiguous here. Renaming it to something like rankIndex would make its role clearer in the ranking logic.
private predicate rankedQuotedStringReplacements(int i, string quotedString, string quotedStringId) {

actions/ql/lib/codeql/actions/Bash.qll:104

  • [nitpick] This complex regex is used directly in the predicate. Extracting it into a well-named regex constant would improve readability and reuse.
quotedStr = this.cmdSubstitutedLineProducer(lineIndex).regexpFind("\"((?:[^\"\\]|\\.)*)\"", occurrenceIndex, occurrenceOffset) and

@adityasharad adityasharad force-pushed the actions/bash-parsing-ranking-performance branch from 3ec0624 to e48a7da Compare June 9, 2025 16:56
@adityasharad
Copy link
Collaborator Author

DCA results look fine. No changes to alerts, and timings are comparable. Note the sample size is small (just 5).

Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adityasharad adityasharad merged commit d659d40 into github:main Jun 10, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Actions Analysis of GitHub Actions documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants