Skip to content

Include comment text in token #4426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 7, 2023
Merged

Conversation

harupy
Copy link
Contributor

@harupy harupy commented Jan 6, 2023

Include comment text in the token for a linter like Ruff which checks comments to ignore errors.

@DimitrisJim
Copy link
Member

You use the comment text in some way in Ruff? IIRC CPython doesn't keep the text around but I could be remembering wrong.

@fanninpm
Copy link
Contributor

fanninpm commented Jan 6, 2023

I'm assuming @charliermarsh wants something like the # noqa: E404 annotations in Flake8. It turns out that Flake8 grabs the physical line and regexes it (which is rather inefficient).

@harupy
Copy link
Contributor Author

harupy commented Jan 6, 2023

@DimitrisJim

CPython doesn't keep the text around but I could be remembering wrong.

Looks like it does:

> cat a.py
# foo
print(1)

> python -m tokenize a.py
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,5:            COMMENT        '# foo'        
1,5-1,6:            NL             '\n'           
2,0-2,5:            NAME           'print'        
2,5-2,6:            OP             '('            
2,6-2,7:            NUMBER         '1'            
2,7-2,8:            OP             ')'            
2,8-2,9:            NEWLINE        ''             
3,0-3,0:            ENDMARKER      ''

@harupy
Copy link
Contributor Author

harupy commented Jan 6, 2023

@DimitrisJim I found this.

https://github.com/python/cpython/blob/15c44789bb125b93e96815a336ec73423c47508e/Parser/tokenizer.c#L1603

    /* Skip comment, unless it's a type comment */
    if (c == '#') {
        const char *prefix, *p, *type_start;
        int current_starting_col_offset;

        while (c != EOF && c != '\n') {
            c = tok_nextc(tok);
        }

        if (tok->type_comments) {
            p = tok->start;
            current_starting_col_offset = tok->starting_col_offset;
            prefix = type_comment_prefix;
            while (*prefix && p < tok->cur) {
                if (*prefix == ' ') {
                    while (*p == ' ' || *p == '\t') {
                        p++;
                        current_starting_col_offset++;
                    }
                } else if (*prefix == *p) {
                    p++;
                    current_starting_col_offset++;
                } else {
                    break;
                }

                prefix++;
            }

It looks like cpython only produces tokens for type comments.

@charliermarsh
Copy link
Collaborator

Right now, we use the locations on the token and extract the comments from the source code. It works fine. It'd be convenient to have the comment text directly, and likely more performant for Ruff, but if it's a CPython incompatibility and the RustPython team would prefer not to include it, I'd understand that too.

@harupy
Copy link
Contributor Author

harupy commented Jan 6, 2023

@charliermarsh Thanks for the comment!

@harupy
Copy link
Contributor Author

harupy commented Jan 6, 2023

https://github.com/python/cpython/blob/72263f2a20002ceff443e3a231c713f2e14fe3fe/Lib/tokenize.py#L17

It is designed to match the working of the Python tokenizer exactly, except
that it produces COMMENT tokens for comments and gives type OP for all
operators. ...

It's unclear why they made this decision.

@charliermarsh
Copy link
Collaborator

@harupy - Always grateful for all the work you're doing to improve Ruff and RustPython :)

@DimitrisJim
Copy link
Member

Right now, we use the locations on the token and extract the comments from the source code.

Oh, that's rough, pun intended. I think its fine if the comment contents are caught since it doesn't change the semantics in any way, its a relatively small change that's easily maintainable and it helps the Ruff team simplify things.

@harupy
Copy link
Contributor Author

harupy commented Jan 6, 2023

@DimitrisJim Awesome! I love the pun.

Copy link
Member

@DimitrisJim DimitrisJim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's let @youknowone also give the ok and merge it.

@charliermarsh
Copy link
Collaborator

Thanks @DimitrisJim, grateful for all the collaboration that's happening between these two projects!

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This incompatibility is not a big deal. Let's go.

@youknowone youknowone merged commit ddc2e1b into RustPython:main Jan 7, 2023
charliermarsh pushed a commit to astral-sh/ruff that referenced this pull request Jan 7, 2023
RustPython/RustPython#4426 has been merged. We
can simplify code using text in comment tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants