-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Bulk MAD generator: Support databases from DCA runs #19627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Generalizes the bulk MaD generator by externalizing per-language targets into JSON files, enabling DCA-based database downloads, and adding C++ support.
- Extract Rust project list into a JSON config and remove the old Rust-only script.
- Introduce a C++ config file that uses DCA strategy for database downloads.
- Update the bulk generator to read JSON configs and handle both Rust and C++.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
File | Description |
---|---|
rust/misc/bulk_generation_targets.json | New JSON config listing Rust crates for bulk MaD generation |
misc/scripts/models-as-data/rust_bulk_generate_mad.py | Removed legacy Rust-specific bulk generation script |
cpp/misc/bulk_generation_targets.json | New JSON config for C++ bulk MaD generation using DCA |
Comments suppressed due to low confidence (3)
rust/misc/bulk_generation_targets.json:22
- The git_tag values use inconsistent 'v' prefixes. Standardize tag formats to match the repository’s official tag naming or remove the 'v' prefix for consistency.
"git_tag": "v1.21.3"
cpp/misc/bulk_generation_targets.json:2
- [nitpick] Indentation in this JSON file uses two spaces but the Rust config uses four; consider aligning formatting across config files for consistency.
"strategy": "dca",
cpp/misc/bulk_generation_targets.json:8
- [nitpick] Consider adding an 'extractor_options' field (similar to the Rust config) if any custom CodeQL extractor options are required for C++ analysis.
}
…o a configuration file.
…eralizes the existing functionality to be independent of Rust and instead depend on the configuration file and the command-line arguments.
a9067a6
to
cb93870
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work! 😄
A few high-level remaks:
-
I think it would be very useful if the strategy was not specified for the whole list of projects, but could be changed on a per-project basis.
We could add a
"dca": true
field to DCA projects, or simply say that those without agit_repo
are taken to mean DCA projects. Then we'd split the list of projects in two and runbuild_databases_from_projects
one list anddownload_dca_databases
on the other. -
I've been formatting the script with the Black Python formatter. I think it would be great to keep doing that going forward, as syntactical inconsistencies otherwise really easily sneaks in. For instance in
download_dca_databases
the indentation shifts from begin 4 spaces to 2 spaces at one point.
] | ||
return database_results | ||
|
||
def github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a slightly more descriptive name?
def github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict: | |
def get_json_from_github(url: str, pat: str, extra_headers: dict[str, str] = {}) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Fixed in 566bf43
database_results = download_dca_databases(experiment_name, pat, projects) | ||
|
||
# Phase 3: Generate models for all projects | ||
print("\n=== Phase 3: Generating models ===") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this "Phase 3" doesn't really make sense anymore as it depends on the strategy.
print("\n=== Phase 3: Generating models ===") | |
print("\n=== Generating models ===") |
Same for the two other "Phase"s being printed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b640474
parser.add_argument("--lang", type=str, help="The language to generate models for", required=True) | ||
parser.add_argument("--with-sources", action="store_true", help="Generate sources", required=False) | ||
parser.add_argument("--with-sinks", action="store_true", help="Generate sinks", required=False) | ||
parser.add_argument("--with-summaries", action="store_true", help="Generate sinks", required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great if running the script is as simple as possible and if the output is fully determined by the data in the JSON file.
Would it work for your use case to move this into the configuration file? Perhaps with language on the root-level and the --with-*
options on a per-project basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# A project to generate models for | ||
class Project(TypedDict): | ||
""" | ||
Type definition for projects (acquired via a GitHub repo) to model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use this type for all project, including those from DCA. I think we'd just have to make git_repo
into NotRequired[str]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Fixed in 7121f5c
# Sort the database results based on the order in the projects file | ||
return sorted(database_results, key=cmp_to_key(compare)) | ||
|
||
def get_destination_for_project(config, name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "destination" by itself is a bit ambiguous.
def get_destination_for_project(config, name: str) -> str: | |
def get_mad_destination_for_project(config, name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 7c89d6d
parser = argparse.ArgumentParser() | ||
parser.add_argument("--config", type=str, help="Path to the configuration file.", required=True) | ||
parser.add_argument("--dca", type=str, help="Name of a DCA run that built all the projects", required=False) | ||
parser.add_argument("--pat", type=str, help="PAT token to grab DCA databases (the same as the one you use for DCA)", required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a path to a PAT instead of the PAT itself? I don't think it's super nice to paste tokens on the command line (not that having it in a plaintext file is much better in terms of security).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, that's an excellent point 😅 Fixed in 7c2612a
return a_index - b_index | ||
|
||
# Sort the database results based on the order in the projects file | ||
return sorted(database_results, key=cmp_to_key(compare)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about making database_results
a map and picking the results out of the map based on projects
. Similar to what's done in clone_projects
:
project_dirs = [project_dirs_map[project["name"]] for project in projects]
That would avoid the gnarly compare
function.
Thanks, Simon!
I've done this in 7ecf8c8. I'll work on the other requests and keep it formatted as I push changes. |
…cifying the formatting requirements.
Co-authored-by: Simon Friis Vindum <paldepind@github.com>
…ination_for_project'.
…inks' in the config file.
d480b8b
to
1228080
Compare
Co-authored-by: Simon Friis Vindum <paldepind@github.com>
I think I've addressed all of your comments now, @paldepind. Thanks a lot for the detailed feedback! ❤️
I think that's a really good idea. I'm happy to do this, but because I'm going on vacation next week (and half of the week following that) I'd like to delay that change until I'm back so that I don't have to keep this PR open for when I'm back. Would that be okay with you? |
This PR generalizes the bulk MAD generator script that was added in #19499. In particular, this PR:
After these changes Rust can invoke the bulk builder script as:
and C++ can invoke the script as:
The personal access token (PAT) required for the C++ run is necessary to download the databases from DCA. It is the same PAT as the one you use for DCA.
(Note: The
name
fields in the C++ configuration JSON file requires naming the relevant DCA projects in the DCA source suite. I'll do that as a follow-up when I create a suite specifically for MaD generation)@paldepind would you mind taking a look at this?
Commit-by-commit review recommended.