Skip to content

Bulk MAD generator: Support databases from DCA runs #19627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented May 29, 2025

This PR generalizes the bulk MAD generator script that was added in #19499. In particular, this PR:

  • Moves the locations of the generated models for C++ so that it aligns with where Rust stores them. This should have no impact on C/C++ analysis.
  • Moves the Rust specific parts over to a JSON file and puts that in the rust folder.
  • It adds a "Download databases from DCA" feature to the script so that the script can be pointed to a DCA run and generate models based on those databases.
  • Finally, it adds C++ support for bulk building by adding a C++ configuration file.

After these changes Rust can invoke the bulk builder script as:

python3 misc/scripts/models-as-data/bulk_generate_mad.py --lang rust --with-summaries --with-sources --with-sinks --config rust/misc/bulk_generation_targets.json

and C++ can invoke the script as:

python3 misc/scripts/models-as-data/bulk_generate_mad.py --dca NAME_OF_DCA_RUN --pat PAT --lang cpp --with-summaries --config cpp/misc/bulk_generation_targets.json

The personal access token (PAT) required for the C++ run is necessary to download the databases from DCA. It is the same PAT as the one you use for DCA.

(Note: The name fields in the C++ configuration JSON file requires naming the relevant DCA projects in the DCA source suite. I'll do that as a follow-up when I create a suite specifically for MaD generation)

@paldepind would you mind taking a look at this?

Commit-by-commit review recommended.

@MathiasVP MathiasVP requested review from paldepind and Copilot May 29, 2025 16:54
@MathiasVP MathiasVP requested review from a team as code owners May 29, 2025 16:54
@github-actions github-actions bot added C++ Rust Pull requests that update Rust code labels May 29, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Generalizes the bulk MaD generator by externalizing per-language targets into JSON files, enabling DCA-based database downloads, and adding C++ support.

  • Extract Rust project list into a JSON config and remove the old Rust-only script.
  • Introduce a C++ config file that uses DCA strategy for database downloads.
  • Update the bulk generator to read JSON configs and handle both Rust and C++.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

File Description
rust/misc/bulk_generation_targets.json New JSON config listing Rust crates for bulk MaD generation
misc/scripts/models-as-data/rust_bulk_generate_mad.py Removed legacy Rust-specific bulk generation script
cpp/misc/bulk_generation_targets.json New JSON config for C++ bulk MaD generation using DCA
Comments suppressed due to low confidence (3)

rust/misc/bulk_generation_targets.json:22

  • The git_tag values use inconsistent 'v' prefixes. Standardize tag formats to match the repository’s official tag naming or remove the 'v' prefix for consistency.
"git_tag": "v1.21.3"

cpp/misc/bulk_generation_targets.json:2

  • [nitpick] Indentation in this JSON file uses two spaces but the Rust config uses four; consider aligning formatting across config files for consistency.
  "strategy": "dca",

cpp/misc/bulk_generation_targets.json:8

  • [nitpick] Consider adding an 'extractor_options' field (similar to the Rust config) if any custom CodeQL extractor options are required for C++ analysis.
}

@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label May 29, 2025
@MathiasVP MathiasVP force-pushed the generalize-bulk-generation branch from a9067a6 to cb93870 Compare May 29, 2025 17:14
Copy link
Contributor

@paldepind paldepind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work! 😄

A few high-level remaks:

  • I think it would be very useful if the strategy was not specified for the whole list of projects, but could be changed on a per-project basis.

    We could add a "dca": true field to DCA projects, or simply say that those without a git_repo are taken to mean DCA projects. Then we'd split the list of projects in two and run build_databases_from_projects one list and download_dca_databases on the other.

  • I've been formatting the script with the Black Python formatter. I think it would be great to keep doing that going forward, as syntactical inconsistencies otherwise really easily sneaks in. For instance in download_dca_databases the indentation shifts from begin 4 spaces to 2 spaces at one point.

]
return database_results

def github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a slightly more descriptive name?

Suggested change
def github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict:
def get_json_from_github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Fixed in 566bf43

database_results = download_dca_databases(experiment_name, pat, projects)

# Phase 3: Generate models for all projects
print("\n=== Phase 3: Generating models ===")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this "Phase 3" doesn't really make sense anymore as it depends on the strategy.

Suggested change
print("\n=== Phase 3: Generating models ===")
print("\n=== Generating models ===")

Same for the two other "Phase"s being printed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b640474

parser.add_argument("--lang", type=str, help="The language to generate models for", required=True)
parser.add_argument("--with-sources", action="store_true", help="Generate sources", required=False)
parser.add_argument("--with-sinks", action="store_true", help="Generate sinks", required=False)
parser.add_argument("--with-summaries", action="store_true", help="Generate sinks", required=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great if running the script is as simple as possible and if the output is fully determined by the data in the JSON file.

Would it work for your use case to move this into the configuration file? Perhaps with language on the root-level and the --with-* options on a per-project basis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I've fixed the --with-* parts in fc165db and the language part in 1228080

# A project to generate models for
class Project(TypedDict):
"""
Type definition for projects (acquired via a GitHub repo) to model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use this type for all project, including those from DCA. I think we'd just have to make git_repo into NotRequired[str].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Fixed in 7121f5c

# Sort the database results based on the order in the projects file
return sorted(database_results, key=cmp_to_key(compare))

def get_destination_for_project(config, name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "destination" by itself is a bit ambiguous.

Suggested change
def get_destination_for_project(config, name: str) -> str:
def get_mad_destination_for_project(config, name: str) -> str:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c89d6d

parser = argparse.ArgumentParser()
parser.add_argument("--config", type=str, help="Path to the configuration file.", required=True)
parser.add_argument("--dca", type=str, help="Name of a DCA run that built all the projects", required=False)
parser.add_argument("--pat", type=str, help="PAT token to grab DCA databases (the same as the one you use for DCA)", required=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a path to a PAT instead of the PAT itself? I don't think it's super nice to paste tokens on the command line (not that having it in a plaintext file is much better in terms of security).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, that's an excellent point 😅 Fixed in 7c2612a

return a_index - b_index

# Sort the database results based on the order in the projects file
return sorted(database_results, key=cmp_to_key(compare))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making database_results a map and picking the results out of the map based on projects. Similar to what's done in clone_projects:

    project_dirs = [project_dirs_map[project["name"]] for project in projects]

That would avoid the gnarly compare function.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented May 30, 2025

Thanks, Simon!

I've been formatting the script with the Black Python formatter. I think it would be great to keep doing that going forward, as syntactical inconsistencies otherwise really easily sneaks in. For instance in download_dca_databases the indentation shifts from begin 4 spaces to 2 spaces at one point.

I've done this in 7ecf8c8. I'll work on the other requests and keep it formatted as I push changes.

@MathiasVP MathiasVP force-pushed the generalize-bulk-generation branch from d480b8b to 1228080 Compare May 30, 2025 11:40
@MathiasVP
Copy link
Contributor Author

I think I've addressed all of your comments now, @paldepind. Thanks a lot for the detailed feedback! ❤️

I think it would be very useful if the strategy was not specified for the whole list of projects, but could be changed on a per-project basis.

We could add a "dca": true field to DCA projects, or simply say that those without a git_repo are taken to mean DCA projects. Then we'd split the list of projects in two and run build_databases_from_projects one list and download_dca_databases on the other.

I think that's a really good idea. I'm happy to do this, but because I'm going on vacation next week (and half of the week following that) I'd like to delay that change until I'm back so that I don't have to keep this PR open for when I'm back.

Would that be okay with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note Rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants