Skip to content

Changed dict key-matching regex to capture any valid dict key #920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 6, 2021
Merged

Changed dict key-matching regex to capture any valid dict key #920

merged 13 commits into from
Oct 6, 2021

Conversation

arian-deimling
Copy link
Contributor

Fixes #917

@arian-deimling arian-deimling marked this pull request as ready for review September 27, 2021 23:06
@thomasballinger
Copy link
Member

I tried this out and found that this more general pattern meant the key matching too much of the line, so tab completion of the second dictionary on a line no longer works. I think .* is too broad a pattern, could we restrict it further? Maybe we could allow anything if it started with a ' or " and once that quote is matched, not allow more?

2021-09-27 18 34 40

It would be nice to add some tests, both of the new functionality you're adding (backslashes in strings work!) and this regression (second dictionary on a line should still work) like

def test_simple(self):
self.assertAccess("asdf|")
self.assertAccess("asdf|")
self.assertAccess("asdf[<>|")
self.assertAccess("asdf[<>|]")
self.assertAccess("object.dict[<abc|>")
self.assertAccess("asdf|")
self.assertAccess("asdf[<(>|]")
self.assertAccess("asdf[<(1>|]")
self.assertAccess("asdf[<(1,>|]")
self.assertAccess("asdf[<(1, >|]")
self.assertAccess("asdf[<(1, 2)>|]")
# TODO self.assertAccess('d[d[<12|>')
self.assertAccess("d[<'a>|")

These tests use a DSL to describe the expected match when calling one of these functions:
def assertAccess(self, s):
r"""Asserts that self.func matches as described
by s, which uses a little language to describe matches:
abcd<efg>hijklmnopqrstuvwx|yz
/|\ /|\ /|\
| | |
the function should the current cursor position
match this "efg" is between the x and y
"""

so a test for this regression might look like

    self.assertAccess("(asdf['abc'], asdf[<'ab>|") 

Thanks for looking into this!

@thomasballinger
Copy link
Member

@arian-deimling could you add tests to keep track of the different cases we've talked about?

This pattern doesn't match

>>> d = {'asdf': 123, 'a\'b"\\c': 234}
>>> d['a\'

but that didn't work before either, so I think your code is a winner!

You can run the tests by running pytest in the repo.

@arian-deimling
Copy link
Contributor Author

@thomasballinger If you don't mind, I'd like like to work on this a little bit more. It's more nuanced than I originally thought, but I think with a few more minor modifications, we could have the tab completion working for most different types of potential dict keys (a.k.a. hashable types - https://stackoverflow.com/questions/14535730/what-does-hashable-mean-in-python/44880799#44880799). I agree that there will probably have to be a few edge cases that we cannot handle.

And to answer your first question, yes, I will add some tests!

@arian-deimling
Copy link
Contributor Author

arian-deimling commented Sep 29, 2021

@thomasballinger Ok - ready for you to have another look with these new commits. This newest version should be able to tab-complete all str, bytes, and int keys, float keys besides special values (i.e. nan, inf, -inf), and tuple keys for tuples containing any combination of the aforementioned key possibilities.

Also, that testing stuff is pretty neat and straightforward. The changes pass all tests as well as 9 new ones.

This may be off topic, but I think this test case is irrelevant:

self.assertAccess("object.dict[<abc|>")

because in the case where you have typed some_dict[abc, the program is not trying to match abc as a key, but rather as the beginning of a variable or function name.

@thomasballinger
Copy link
Member

This may be off topic, but I think this test case is irrelevant because in the case where you have typed some_dict[abc, the program is not trying to match abc as a key, but rather as the beginning of a variable or function name.

I agree that that code path isn't useful when running bpython, and I am ambivalent about inverting the test and changing the behavior. I wrote out the following to think through this.


If we implemented fuzzy completion for dict keys (which we do have implemented in AttrCompletion and GlobalCompletion) then it would useful to match this, assuming we still want DictKeyCompletion to take precedence over expression completion in that situation. I don't plan to do that though, and I'm not sure we'd want to make it so fuzzy that quotes will be added for you. If the completer were every changed to return matches that included global variables that referred to dictionary keys we would need it... but no plans for that. If we decided to show all matches instead of filtering on prefix we'd want this, but that doesn't sound very useful.

The only reason to leave this test case is that always matching the item in the dict key position makes it easier to describe what DictKeyCompletion.locate() does since its more in the spirit of what locate() is supposed to do according to the BaseCompletion class:

* `locate(cur, line)` their initial target word to replace given a
line and cursor

A reason to change the behavior and explicitly not match anything other than hashable literals is that it would be a more efficient solution in some cases than what I implemented for deciding which completer to use: if DictKeyCompletion.locate() returned None, we could bail out of DictKeyCompletion.matches() before needing to evaluate the dictionary expression, get keys, and see if the expression is a prefix of any keys (and it never is, since the only candidate keys are literals). However if we want to bail out early we could just check manually in DictKeyCompletion.matches instead of changing what locate() means.

for completer in completers:
try:
matches = completer.matches(cursor_offset, line, **kwargs)
except Exception as e:
# Instead of crashing the UI, log exceptions from autocompleters.
logger = logging.getLogger(__name__)
logger.debug(
"Completer {} failed with unhandled exception: {}".format(
completer, e
)
)
continue
if matches is not None:
return sorted(matches), (completer if matches else None)
return [], None

For now, starting to type an identifier in the dict key position means we will not be able to complete a dictionary key. So returning None for something that could not possibly be a prefix of a valid key literal does make sense: there is a dict key, but it's not a completable one and this code is all about completion. So I'd be ok with this change, but I think it's ok as-is too.

Copy link
Member

@thomasballinger thomasballinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to get this right, this looks much more precise than we had it. I learned a bunch from reading it.

Does matching more types like this expand what works for substitutions on tab autocomplete? I'd like to understand the behavior change to figure out what the more complex code is buying us.

bpython/line.py Outdated
# pieces of regex to match repr() of several hashable built-in types
_match_identifiers_and_functions = r"""[a-zA-Z_][a-zA-Z0-9_.(),\[\] ]*"""
_match_single_quote_str_bytes = \
r"""b?'(?:\\(?:\\\\)*['"nabfrtvxuU]|[^'\\])*(?:\\\\)*\\?'?"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to figure out what was going on here, maybe it'd be helpful to link to the grammar for strings as justification for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will take care of that.

bpython/line.py Outdated
@@ -34,7 +34,33 @@ def current_word(cursor_offset: int, line: str) -> Optional[LinePart]:
return LinePart(start, end, word)


_current_dict_key_re = LazyReCompile(r"""[\w_][\w0-9._]*\[([\w0-9._(), '"]*)""")
# pieces of regex to match repr() of several hashable built-in types
_match_identifiers_and_functions = r"""[a-zA-Z_][a-zA-Z0-9_.(),\[\] ]*"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this matching; identifiers and property lookup and indexing and function calls?

bpython/line.py Outdated
_match_identifiers_and_functions = r"""[a-zA-Z_][a-zA-Z0-9_.(),\[\] ]*"""
_match_single_quote_str_bytes = \
r"""b?'(?:\\(?:\\\\)*['"nabfrtvxuU]|[^'\\])*(?:\\\\)*\\?'?"""
_match_double_quote_str_bytes = r"""b?"[^"]*"?"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that double quotes simpler because repr() never surrounds a string with double quotes if it contains one? Playing around seems to be, cool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, based on some research and just testing different strings, repr() will only surround a string in double-quotes if it contains at least one single quote char (') and exactly zero double quote chars (").

bpython/line.py Outdated

_current_dict_key_re_base = r"""[\w_][\w0-9._]*\["""
_current_dict_key_re = LazyReCompile(
f'{_current_dict_key_re_base}('
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "_base" part of _current_dict_key_re_base confuses me a little, _base because it's the beginning of the pattern? because the dict is the base of the expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I could use some direction on the name if you don't like that, but basically, base is the part that I did not change which is everything before the [ character after which 'key matching mode' is entered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _current_dict_key_dict_re, or maybe since all of these substrings are not indented to be used separately, a different name like _current_dict_key_dict_part — OH or to match your current convention of _match_* maybe just _dict or _dict_before_match or something. 🤞 there'll be a beautiful was to do this with re.VERBOSE, but if not just changing to something shorter is fine. I just don't want to force readers to know the history of the code in order to understand the variable name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _match_dict_before_key.

@arian-deimling
Copy link
Contributor Author

Thanks for taking the time to get this right, this looks much more precise than we had it. I learned a bunch from reading it.

Does matching more types like this expand what works for substitutions on tab autocomplete? I'd like to understand the behavior change to figure out what the more complex code is buying us.

Yes, now you can tab-complete any of the types that I added a regex for, and the way I tested for this was to make a dict with those kinds of keys and then type my_dict[ and then make sure that for every new character I typed in, the completion options were still displayed to the user.

Tuples are a special case because you cannot tab-complete them unless you type my_dict[|] with the pipe character representing the cursor location. I'm not sure exactly what the cause is, but it matches the behavior prior to my changes, so I think it's okay. One thing I noticed in these cases where you type the closing brace [ is that when you use the tab completion to finish your key, bpython adds a closing brace ], so you are left with two ]. We may want to open another issue for that; however, this behavior is also consistent with the main branch on the bpython repo.

@arian-deimling
Copy link
Contributor Author

@thomasballinger Okay.. so I think I lost sight of what issue I was trying to solve but I think this final commit is the one!

This comment is what I'm basing this last change on:

The only reason to leave this test case is that always matching the item in the dict key position makes it easier to describe what DictKeyCompletion.locate() does since its more in the spirit of what locate() is supposed to do according to the BaseCompletion class

So the original issue was that we weren't matching everything that can possibly be in the key position, and the primary issue is strings because they can contain the ] character which is the natural way to detect the end of the dict key. To fix this I realized, we should first capture any string in the key position, and outside of strings, we should capture anything but the ] character. If we capture any number or combination of these using something like:
( (?: regex_to_match_any_string | regex_to_match_anything_but_close_square_bracket )* )
then we will capture the entire key. This has the added benefit of allowing text completion of for a multitude of other possible dict keys, such as ranges, complex numbers among others that my older version did not capture.

There is one regression here, but I think it is worth the additional functionality. Current in bpython, if you test
abcd[<'>|] the test will pass, but with my new version the version of the test that would pass is abcd[<'|]>. This is because the ' initiates string pattern matching and therefore the ] is considered part of the string that is being typed in. This means that you cannot tab-complete a string if you've typed my_dict[|] and then start typing the string at the cursor position; however, this regression is offset by the fact that you can tab-complete a string that contains the ] character.

Let me know what you think, but I think the solution to the regression is by making a change in the locate or match function down the line.

@arian-deimling
Copy link
Contributor Author

@thomasballinger Can you re-run the test? I fixed the code to pass the black linter test.

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2021

Codecov Report

Merging #920 (a0f7fae) into main (b66a29f) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #920      +/-   ##
==========================================
+ Coverage   67.93%   67.98%   +0.05%     
==========================================
  Files          61       61              
  Lines        9150     9165      +15     
==========================================
+ Hits         6216     6231      +15     
  Misses       2934     2934              
Impacted Files Coverage Δ
bpython/line.py 96.18% <100.00%> (+0.18%) ⬆️
bpython/test/test_line_properties.py 98.89% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b66a29f...a0f7fae. Read the comment docs.

@arian-deimling
Copy link
Contributor Author

I didn’t mean to do that. Just clicked on the button to see what is does. Sorry

@thomasballinger
Copy link
Member

No problem! I want Sebastian to take a look, and he's busy today so it'll be another day at least FYI.

@arian-deimling
Copy link
Contributor Author

@sebastinas I think it's good to go now.

@arian-deimling
Copy link
Contributor Author

Turned off black globally because I don't like double quotes and forgot to reenable it for this @sebastinas. Good to go now. Sorry

@thomasballinger
Copy link
Member

thomasballinger commented Oct 6, 2021

Thanks for the summary! Looks good to me.

This means that you cannot tab-complete a string if you've typed my_dict[|] and then start typing the string at the cursor position; however, this regression is offset by the fact that you can tab-complete a string that contains the ] character.

That is unfortunate, I bet it's more common to autocomplete inside [] than to want to complete keys with ] in them; but the way you have done it is more correct and if we want to fix this we can do it explicitly with a workaround instead of an accident of the way the code works. There are other similar edge cases here, it would be neat if e.g. my_dict["abc|"] completed the key "abcdef" but prefix matching isn't going to do that. We can fix those sometime separately.

@thomasballinger thomasballinger merged commit 18f6901 into bpython:main Oct 6, 2021
@sebastinas sebastinas added this to the release-0.22 milestone Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tab completion for dict key causes crash
4 participants