Skip to content

tools/verifygitlog.py: Apply stricter rules on git subject line. #15823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dpgeorge
Copy link
Member

Summary

There is a bit of ambiguity as to how the prefix of the git subject line should look like. Eg py/vm: ... vs py/vm.c: ... (whether the extension should be there or not).

This commit makes the existing CI check of the git commit message stricter, by applying extra rules to the prefix, the bit before the : in the subject line. Full error messages are given when a rule does not pass.

This helps to reduce maintainer burden by applying stricter rules, to keep the git commit history consistent.

Testing

Tested by making a dummy commit with various errors in the subject prefix.

@dpgeorge dpgeorge added the tools Relates to tools/ directory in source, or other tooling label Sep 10, 2024
@dpgeorge dpgeorge requested a review from projectgus September 10, 2024 02:52
Copy link

codecov bot commented Sep 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (3f1df4b) to head (50e9378).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15823   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files         169      169           
  Lines       21890    21890           
=======================================
  Hits        21572    21572           
  Misses        318      318           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dpgeorge dpgeorge force-pushed the tools-verifygitlog-stricter-subject-prefix branch from 423269c to 37ec6a4 Compare September 10, 2024 02:57
Copy link

github-actions bot commented Sep 10, 2024

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@projectgus
Copy link
Contributor

projectgus commented Sep 10, 2024

I support whatever tooling makes it easier for you to generate changelogs(*), but I am concerned that the "Subject prefix must correspond to a path" requirement will make certain kinds of commit messages less informative generally, and might feel pretty punitive for contributors to bounce off in CI.

Based on a quick survey of recent history(^), the commit message I'm most concerned about are:

Commit title is summarising multiple things with a common prefix

The prefix is meaningful, but the the actual prefix name doesn't exist. Some examples:

  • "py/asm: Add ASM_NOT_REG and ASM_NEG_REG macros for unary ops." - error: commit f52b0d0: Subject prefix must correspond to a path in the repository.
  • "stm32/boards/stm32f4x9_af.csv: Fix DCMI_VSYNC." - error: commit c028f95: Subject prefix must correspond to a path in the repository.
  • "extmod/modtls: Move the native ssl module to tls." - error: commit b802f0f: Subject prefix must correspond to a path in the repository."

For any of these examples, changing them to confirm to the rule would seemingly make the title a significantly worse summary of the change.

Commit makes some small change in multiple places

In particular, multiple ports. Example:

  • "esp32,mimxrt,stm32: Implement ipconfig() for more network interfaces." - error: commit 1f23ab1: Subject prefix must correspond to a path in the repository.

You could argue that this should be split into N commits, but in a lot of cases I'd argue that makes the commit history much noisier for understanding what a particular change is (because need to figure out retrospectively which changes were applied atomically, rather than having them grouped in a commit).


What if the prefix requirement was:

  1. Relaxed to the first component (before the /, if any). That part required to be either a top-level directory, a directory in ports, etc?
  2. Expanded to also accept a comma-delimited list of these, provided each element in the list meets the other requirements.

This should still allow programmatic extraction of which section a commit applies to in the changelog, and I am guessing that is the main goal here?


Separate suggestion, but after they're decided please enumerate all of these rules very clearly in the CODECONVENTIONS doc, and maybe with a few more examples. MicroPython is already quite strict regarding commit message formats, and having this kind of detail arbitrarily rejected (from a new contributor's point of view) can be very dispiriting - especially if it's not obvious what exactly will be accepted.


(*) Note there also some third party tools we could lean on for this stuff, i.e. https://git-cliff.org/

(^) My survey method was for commit in $(./tools/verifygitlog.py | head -n100 | cut -d' ' -f3 | sed s/://); do git show $commit; done - but in retrospect I should also have also grepped out the signed-off-by errors and maybe provided a range where the newest commits come first. Note there are also multiple false positives for renamed files.

@projectgus projectgus requested a review from jimmo September 10, 2024 06:19
@projectgus
Copy link
Contributor

projectgus commented Sep 10, 2024

makes it easier for you to generate changelogs(*),

I realised this was my assumption rather than stated as the reason. Is there a different reason?

@jimmo
Copy link
Member

jimmo commented Sep 10, 2024

Angus' examples above of historical commits that don't match real paths is quite relevant to my experience -- I often find it hard to know which path to use for a given commit, and e.g. using a prefix was a good escape hatch.

But I'm not sure I'm entirely convinced that the filename is useful here -- when I'm looking through the history I want to know the conceptual area that was changed, rather than the exact file (we're just duplicating information that the tools already know).

As an example, let's say I was changing the code in py/vm.c that invokes the scheduler. From a software archaeology perspective, I think I'd rather the commit was py/scheduler: ....

If the purpose of the subject prefix is to make it easy e.g. to generate the release notes then maybe we should come up with a finite, curated set of valid areas/tags, and then the validation is just against that list. (Similar to what we did to get the board json features somewhat under control). We might even be able to generate that set from the existing git log.

Worth mentioning while we're here -- I did spend considerable time a while back trying to make https://github.com/jorisroovers/gitlint work for this repo (and some other tools too), but could not find a way to capture our rules. The change in this PR would make it even less likely to be able to move to a third-party tool.

@stinos
Copy link
Contributor

stinos commented Sep 10, 2024

by applying extra rules to the prefix

Can you add these rules, or examples thereof, to the commit message so that it's clear what this actually does?

In any case, more often than not when I'm writing a commit message I look at the files' history to figure out what the proper prefix is. So it would definitely not be bad to have a clear and fixed set of rules there.

Commit makes some small change in multiple places

Indeed; I assume all: is still allowed?

From a software archaeology perspective, I think I'd rather the commit was py/scheduler: ....

I could see that being clear, but since none of the previous commits modifying the scheduler do that breaking consistency would also not be ideal.

@robert-hh
Copy link
Contributor

The arguments Angus brought up are valid in my opinion. There are quite a few commits which are not limited to a single file. And sometimes even if it is a single file it may be inconvenient to use when the path name is long. Making a good and informative commit message is a challenge, like making a good headline in the newspaper business. The length is limited, it shall carry the core message and still be acceptable language. More restrictions make it even harder. Having a clear message in the commit message seems to me more important than calling a single file that has been affected. However, if it is a single file, it can be named.

Other examples where you would break the single file rule would be a change to a xyz.c file and the related xxx.h header file(s). So one would mention the xyz.c file in the commit message, but not the other changes. Or consider changes in the configuration split between a mpconfigport.h and mpconfigboard.h pair. Which file to mention? Or recently the change to the .mk files handling variants.

@dpgeorge
Copy link
Member Author

Hmm, I really didn't expect this to be so controversial!

Most of the rules here are already followed implicitly and my aim was to simply encode them so that they are properly enforced. Eg sometimes a contributor starts a commit message with ports/stm32 and that's not consistent with the existing commits which omit the ports/ bit. And other times people add .c/.h (myself included) and that's also not consistent. So I was just trying to enforce the existing status quo.

Doing a quick check of the 1000 most recent commits, about 98% of them conform to the rules here (would need to do a proper count to get a more accurate number, because some files were renamed and then the check fails on them after the rename). That's pretty consistent, and IMO worth making the remaining 2% also conform.

Expanded to also accept a comma-delimited list of these, provided each element in the list meets the other requirements.

Yes, I missed this, it should accept a comma-separated list of valid prefixes.

Separate suggestion, but after they're decided please enumerate all of these rules very clearly in the CODECONVENTIONS doc,

Yes, good idea.

@agatti
Copy link
Contributor

agatti commented Sep 10, 2024

Apologies for derailing the discussion a bit but since an issue for this topic is already here...

Can this be also updated to not discard the commit message if there's a typo in there? It's really frustrating to type a 10-12 lines commit message only for it to disappear due to a typo, it should probably be saved somewhere else. The typo doesn't even have to be in the commit message - even names of files that are not scheduled for commit trigger the rejection. Ideally the checks should occur between line 1 and wherever the sign-off should be.

If needed I can submit a new issue specifically for those two bits.

@projectgus
Copy link
Contributor

Can this be also updated to not discard the commit message if there's a typo in there?

This one is a git pre-commit hook thing, rather than a verifygitlog.py thing (although the more strict the rules then the more often it will happen). There are ways to get the old message back - I don't use the git command line much but you can tell it to reopen the last commit message there - if you use another client then it may have an even easier way.

If you can see a better way to retain the message when using pre-commit then we could also add that. Maybe echoing the old message to the console, or even just a clue about "Existing message saved in {path}" would help a lot?

even names of files that are not scheduled for commit trigger the rejection.

This seems like a bug in the pre-commit hook for sure, could you open an issue for it please?

@projectgus
Copy link
Contributor

Most of the rules here are already followed implicitly and my aim was to simply encode them so that they are properly enforced. Eg sometimes a contributor starts a commit message with ports/stm32 and that's not consistent with the existing commits which omit the ports/ bit. And other times people add .c/.h (myself included) and that's also not consistent. So I was just trying to enforce the existing status quo.

I don't want this to sound overly argumentative, but what problem is caused by these occasional inconsistencies? Is there some task that will become a lot easier with stricter enforcement here? If there is, then absolutely in favour of adding rules to make that thing easier (assuming that's the best way to improve it).

As @jimmo said, the git commit already has information about paths, etc. in the actual commit data. There's no end of options like git log --stat, or git log <path>, or git log -S<thing> to query that data in a reliable way and get consistent output that's guaranteed to be comprehensive.

On the other hand, like @robert-hh said, commit messages have a human component - they're subjective metadata by their nature. They give the author a chance to communicate intent, context, and similar things which aren't in the actual commit data. Those things are often slightly fuzzy.

Moreover, there's no set of mechanistic rules that will force someone to write a clear commit message if they're not inclined to do that. On the other hand, as discussed above, there might be excessively strict rules that force someone to write a less communicative commit message in order to get it past the checks. Should also not rule out the increased frustration that strict and arbitrary-seeming hook & CI failures create for contributors.

To be clear, I'm not arguing in favour of messy git commit logs or that commit messages don't matter. I'm mostly saying that it seems to me MicroPython already has a very good commit history! So if we're going to make the bots stricter, it should be in return for significant other benefit(s).

@stinos
Copy link
Contributor

stinos commented Sep 11, 2024

it should be in return for significant other benefit(s).

Imo the benefit of consistency alone is worth pursuing and significant enough. Makes reading easier, and writing as well.

And since the vast majortity of the messages are in fact correct already, it is unlikely that this is going to lead to substantial frustration?

@agatti
Copy link
Contributor

agatti commented Sep 11, 2024

@projectgus Thanks for the info about recovering the commit message - I didn't know that was possible.

If you can see a better way to retain the message when using pre-commit then we could also add that. Maybe echoing the old message to the console, or even just a clue about "Existing message saved in {path}" would help a lot?

Something like that or the git command line to recover the commit would help me for sure, but maybe I'm the exception here.

even names of files that are not scheduled for commit trigger the rejection.

This seems like a bug in the pre-commit hook for sure, could you open an issue for it please?

Done.

@andrewleech
Copy link
Contributor

but maybe I'm the exception here.

No I second your pain, I hate losing a carefully scripted commit message because of a pre-commit fail.

That's one of the things I love about the smartgit gui app I use, it remembers the message and reuses it by default if the commit doesn't go through.

I didn't realise there's possible ways to do the same on command line, I'll have to look them up too

@Gadgetoid
Copy link
Contributor

From the perspective of someone who's run afoul of the commit message formatting a number of times:

  1. I am all for stricter (or shall we say, less ambiguous) rules which help me to write consistent commit messages in the format wanted by the maintainers- there are definitely still some ambiguities but I cannot for love nor money force my brain to read more than a cursory amount of documentation so a "computer says no" message certainly helps me.

  2. I think strict rules and automated requirements are a good filter for contributors who are likely to be motivated to discuss, amend and stand by their pull-request. This might seem disingenuous and perhaps even a little gatekeepy, but if someone can't negotiate some strict commit message rules in good faith then are they going to get far as a contributor at all?

  3. People actually use the pre-commit hooks? 😆

@UnexpectedMaker
Copy link
Contributor

Well I'll chime in for the other side and say the rules were already too strict (and the error messages WAY too vague) and making them stricter is going to definitely push more people away from contributing... even with clearer error messages.

For example... leaving a . off the end of a commit subject line - really? That should not derail the commit. If it's not there and you really insist it should be there, add it automatically.

Also, I feel if the initial commit message used to raise a PR follows the correct formatting, then subsequent commit messages in a PR should not have to, if it's something minor like "fixing a code formatting fail" - like when I get told I can't use 'big' and MUST use "big" even though both are valid and acceptable in Python ;)

@dpgeorge
Copy link
Member Author

Let me give some recent examples of PRs and their commit messages (I'm not intending to single out people here, it's just that these were recent PRs I processed that are good examples of why I wanted to make the rules stricter):

#15822

This is a very simple, small PR, and it touches only docs/library/machine.Pin.rst. The title of the commit was:

doc: Document machine.Pin.toggle() method.

That should be docs: or docs/library:, for consistency with all other docs commits. A check for a valid path would have caught this.

This may seem like a very small inconsistency but when you scan through the commit history and some things start with docs: and others start with doc: then it adds a little bit of friction because you wonder why there's a difference and if it was just a typo or if there was some meaning to the difference. It also makes it harder to automatically categorise the commits for the release notes.

(Note that I rewrote this commit message to start with docs/library: during merge.)

#13572

This PR has two commits. Both of them touch exactly the same set of files (extmod/vfs_blockdev.c and an associated test). The commits are titled:

extmod/vfs_blockdev: Check block device function positive results.
vfs: Implement common helper for vfs_blockdev read and write.

I'm really not sure why they are different like that, but IMO they should both be the same and both start with extmod/vfs_blockdev: (for the same reasons above, consistency, less friction reading the commit history, and easier to collate).

#15798

This was titled:

ports/stm32/uart.c: Add UART RX/CTS pin pull config options.

I rewrote it to start with stm32/uart:, so it didn't start with ports/ or have a .c.

#15854

This was titled:

ports/rp2: Include check for CYW43_GPIO in machine_pin_cyw43.c.

I rewrote it to start with rp2/machine_pin_cyw43:, to have the file name before the : instead of part of the sentence. Again for consistency.

#12283

The two commits in this PR were:

ports/rp2: Additions for rp2 init hooks.   
ports/rp2: Increased ext pins to 32.

In addition to starting with ports/, the grammar/tense here doesn't match the grammar/tense of other commits. So I rewrote these messages as:

rp2: Add board-level hooks to main, and MICROPY_SOURCE_BOARD cmake var.
rp2: Increase ext pins from 10 to 32.

All the above examples pass the current checks but would have failed the checks added in this PR. Of course this PR doesn't check grammar/tense, that would be very hard to do (maybe with an LLM). So there's always going to be some level of human checking needed, if we want to keep consistency.

Maybe I'm being very picky here? I actually don't mind tweaking commit messages during merge, and I do like the consistency it brings. But moving forward, if other maintainers are going to be merging commits, it would be good to have a set of rules in place around what is an acceptable commit message.

This PR tries to do that in a simple way using path names (although still needs some exceptions to the rule). Maybe there's a better way to force consistency without being as strict.

@stinos
Copy link
Contributor

stinos commented Sep 25, 2024

Maybe I'm being very picky here?

In a sense, yes, but in the end it's just different semantics to say 'we want consistency' :)

As mentioned earlier: imo what is lacking from this commit is spelling out more clearly what is and what is not allowed. For example codeconventions.md still says "It's also ok to drop file extensions." whereas this commit enforces a slightly different rule.

@agatti
Copy link
Contributor

agatti commented Sep 25, 2024

Of course this PR doesn't check grammar/tense, that would be very hard to do (maybe with an LLM).

Hrm, would something like vale together with languagetool help here?

Given that it won't work 100% of the time, maybe having extra checks in the precommit hooks using those tools can still be helpful. Just run them only if the appropriate tools are installed on the system and report a warning instead of an error. It's still better than nothing, especially for people who aren't native English speakers :)

Vale already has an integration for GitHub CI, not sure about languagetool.

There is a bit of ambiguity as to how the prefix of the git subject line
should look like.  Eg `py/vm: ...` vs `py/vm.c: ...` (whether the extension
should be there or not).

This commit makes the existing CI check of the git commit message stricter,
by applying extra rules to the prefix, the bit before the : in the subject
line.  Full error messages are given when a rule does not pass.

This helps to reduce maintainer burden by applying stricter rules, to keep
the git commit history consistent.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge dpgeorge force-pushed the tools-verifygitlog-stricter-subject-prefix branch from c077308 to 50e9378 Compare May 1, 2025 02:10
@dpgeorge
Copy link
Member Author

dpgeorge commented May 1, 2025

I've updated this to not be as strict. It no longer checks that the prefix of the subject line is a file in the repo. So now it just checks:

  • the subject line doesn't start with unwanted bits: ., /, ports/
  • the subject prefix doesn't have an extension: .c, .h, .cpp, .js, .rst or .md

@dpgeorge dpgeorge added this to the release-1.26.0 milestone May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Relates to tools/ directory in source, or other tooling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants