Bulk MAD generator: Support databases from DCA runs #19627

MathiasVP · 2025-05-29T16:54:26Z

This PR generalizes the bulk MAD generator script that was added in #19499. In particular, this PR:

Moves the locations of the generated models for C++ so that it aligns with where Rust stores them. This should have no impact on C/C++ analysis.
Moves the Rust specific parts over to a JSON file and puts that in the rust folder.
It adds a "Download databases from DCA" feature to the script so that the script can be pointed to a DCA run and generate models based on those databases.
Finally, it adds C++ support for bulk building by adding a C++ configuration file.

After these changes Rust can invoke the bulk builder script as:

python3 misc/scripts/models-as-data/bulk_generate_mad.py --lang rust --with-summaries --with-sources --with-sinks --config rust/misc/bulk_generation_targets.json

and C++ can invoke the script as:

python3 misc/scripts/models-as-data/bulk_generate_mad.py --dca NAME_OF_DCA_RUN --pat PAT --lang cpp --with-summaries --config cpp/misc/bulk_generation_targets.json

The personal access token (PAT) required for the C++ run is necessary to download the databases from DCA. It is the same PAT as the one you use for DCA.

(Note: The name fields in the C++ configuration JSON file requires naming the relevant DCA projects in the DCA source suite. I'll do that as a follow-up when I create a suite specifically for MaD generation)

@paldepind would you mind taking a look at this?

Commit-by-commit review recommended.

Copilot

Pull Request Overview

Generalizes the bulk MaD generator by externalizing per-language targets into JSON files, enabling DCA-based database downloads, and adding C++ support.

Extract Rust project list into a JSON config and remove the old Rust-only script.
Introduce a C++ config file that uses DCA strategy for database downloads.
Update the bulk generator to read JSON configs and handle both Rust and C++.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

File	Description
rust/misc/bulk_generation_targets.json	New JSON config listing Rust crates for bulk MaD generation
misc/scripts/models-as-data/rust_bulk_generate_mad.py	Removed legacy Rust-specific bulk generation script
cpp/misc/bulk_generation_targets.json	New JSON config for C++ bulk MaD generation using DCA

Comments suppressed due to low confidence (3)

rust/misc/bulk_generation_targets.json:22

The git_tag values use inconsistent 'v' prefixes. Standardize tag formats to match the repository’s official tag naming or remove the 'v' prefix for consistency.

"git_tag": "v1.21.3"

cpp/misc/bulk_generation_targets.json:2

[nitpick] Indentation in this JSON file uses two spaces but the Rust config uses four; consider aligning formatting across config files for consistency.

  "strategy": "dca",

cpp/misc/bulk_generation_targets.json:8

[nitpick] Consider adding an 'extractor_options' field (similar to the Rust config) if any custom CodeQL extractor options are required for C++ analysis.

…o a configuration file.

…eralizes the existing functionality to be independent of Rust and instead depend on the configuration file and the command-line arguments.

paldepind

Really great work! 😄

A few high-level remaks:

I think it would be very useful if the strategy was not specified for the whole list of projects, but could be changed on a per-project basis.

We could add a "dca": true field to DCA projects, or simply say that those without a git_repo are taken to mean DCA projects. Then we'd split the list of projects in two and run build_databases_from_projects one list and download_dca_databases on the other.
I've been formatting the script with the Black Python formatter. I think it would be great to keep doing that going forward, as syntactical inconsistencies otherwise really easily sneaks in. For instance in download_dca_databases the indentation shifts from begin 4 spaces to 2 spaces at one point.

paldepind · 2025-05-30T08:21:57Z