Don't completely dump the regex cache when full #72480

rhettinger · 2016-09-28T04:49:30Z

BPO	28293
Nosy	@warsaw, @rhettinger, @vstinner, @ezio-melotti, @methane, @serhiy-storchaka, @zhangyangyu
PRs	bpo-28293: The regex cache no longer completely dump when full. #3768
Files	re_cache_pop.diff re_cache_del_first.patch re_cache_ordered_dict_del_first.patch re_cache_ordered_dict_popitem.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2017-09-26.16:58:08.958>
created_at = <Date 2016-09-28.04:49:29.801>
labels = ['expert-regex', '3.7', 'library', 'performance']
title = "Don't completely dump the regex cache when full"
updated_at = <Date 2017-09-26.16:58:08.957>
user = 'https://github.com/rhettinger'

bugs.python.org fields:

activity = <Date 2017-09-26.16:58:08.957>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2017-09-26.16:58:08.958>
closer = 'serhiy.storchaka'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2016-09-28.04:49:29.801>
creator = 'rhettinger'
dependencies = []
files = ['44852', '44853', '44854', '44855']
hgrepos = []
issue_num = 28293
keywords = ['patch']
message_count = 20.0
messages = ['277580', '277582', '277583', '277585', '277586', '277587', '277588', '277589', '277590', '277591', '277620', '277667', '277677', '277682', '278885', '278920', '278937', '283983', '303054', '303055']
nosy_count = 9.0
nosy_names = ['barry', 'rhettinger', 'vstinner', 'ezio.melotti', 'mrabarnett', 'methane', 'SilentGhost', 'serhiy.storchaka', 'xiang.zhang']
pr_nums = ['3768']
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue28293'
versions = ['Python 3.7']

rhettinger · 2016-09-28T04:49:29Z

When the regex cache gets filled, it is cleared in its entirety. Instead, it only needs to evict one arbitrary entry. This will save the us from having to rebuild and recache frequently used regexes.

serhiy-storchaka · 2016-09-28T05:00:01Z

LGTM.

zhangyangyu · 2016-09-28T05:06:30Z

But with the compact dict implementation, popitem is not going to evict an arbitrary entry but always the last one. Will this cause two interchangeably used regexes always need to recompile?

serhiy-storchaka · 2016-09-28T05:30:29Z

Nice catch! Here is a patch that deletes the first key.

rhettinger · 2016-09-28T05:46:46Z

Perhaps: _cache.pop(next(iter(_cache)))

The for-loop version indirect about what it is trying to do and relies on an obscure quirk of exactly when it is an error to mutate while iterating.

I do like that the side-effect of the compact dict is that is lets us choose the oldest entry to evict.

serhiy-storchaka · 2016-09-28T06:01:35Z

Perhaps: _cache.pop(next(iter(_cache)))

This can raise KeyError if the cache is cleared in other thread. And it is a little slower.

I remember why I didn't propose this idea earlier. This depends on the ordering of dict. But this is implementation detail, and in other Python implementation this can cause recompiling two interchangeably used regexes. No, this no longer LGTM.

Maybe use OrderedDict? But this adds heavy dependency to the re module.

Yet one idea: use two caches. Look first in the one cache, then in the other. When the first cache full, clear the second cache and swap caches.

serhiy-storchaka · 2016-09-28T06:14:01Z

Maybe use OrderedDict? But this adds heavy dependency to the re module.

We can use the same trick as in the enum module. Here is a patch.

zhangyangyu · 2016-09-28T06:20:01Z

Why not OrderedDict.popitem(last=False)?

rhettinger · 2016-09-28T06:25:49Z

This has gotten crazy. I withdraw the suggestion.

serhiy-storchaka · 2016-09-28T06:30:40Z

Why not OrderedDict.popitem(last=False)?

Yes, this should work. Here is a patch.

SilentGhost · 2016-09-28T14:04:29Z

Serhiy, your patch doesn't seem to apply cleanly. Also, it's perhaps worth opening the issue if you'd like to continue working on it.

serhiy-storchaka · 2016-09-28T21:59:02Z

It is applied cleanly on the default branch. I don't think there is a need to open a new issue. The last patch implements the original Raymond's intention, but taking into account all comments: drops the oldest item, doesn't depend on implementation details of dict, thread-safe.

rhettinger · 2016-09-29T01:23:00Z

I would rather not introduce an OrderedDict dependency to the re module while other implementations are using the pure python version or where the OD implementation is in flux. This is a barely worth it proposal, so we should either make a modest, minimalistic change or just drop it. Really, its too bad that the regex got contorted in a way that precluded the use of the lru_cache (DEBUG isn't the central feature of regex and is only minimally useful).

serhiy-storchaka · 2016-09-29T04:56:26Z

The re module now depends on the enum module, and the latter already depends on OrderedDict. Thus the patch doesn't introduce new dependency. Python implementation of OrderedDict doesn't override __getitem__ and therefore don't slow down the common case. I considered all this.

In contrary, Python implementation of lru_cache slows down the code (that is why using it was dropped in the past). But I did not object to this because re.sub() is less used than re.match() or re.search() and already is much slower.

vstinner · 2016-10-18T14:42:56Z

Can't we redesign the function to reuse @functools.lru_cache(_MAXCACHE) but take care of the debug flag? For example, split the function in smaller parts.

serhiy-storchaka · 2016-10-18T18:43:58Z

This is unlikely possible with current API of lru_cache. Testing flags before looking up in a cache adds an overhead.

vstinner · 2016-10-18T21:12:48Z

re_cache_ordered_dict_popitem.patch: LGTM, it can be pushed to the default branch.

--

"This is unlikely possible with current API of lru_cache. Testing flags before looking up in a cache adds an overhead."

Oh wait, I looked at the code and this case is complex because the function depends on the current locale if the LOCALE flag is set... A lookup in the the cache requires to get the current locale, but not if the LOCALE flag is not set... It seems too complex for @lru_cache().

methane · 2016-12-25T03:07:29Z

LGTM for re_cache_ordered_dict_popitem.patch

serhiy-storchaka · 2017-09-26T16:47:39Z

New changeset 114454e by Serhiy Storchaka in branch 'master':
bpo-28293: Don't completely dump the regex cache when full. (bpo-3768)
114454e

serhiy-storchaka · 2017-09-26T16:58:09Z

Thank you for your reviews.

Currently the caching logic is simpler, it no longer depends on the LOCALE flag and the current locale. But it still is too complex for lru_cache. It needs a way to bypass caching based on arguments or result.

rhettinger assigned serhiy-storchaka Sep 28, 2016

rhettinger added 3.7 (EOL) end of life stdlib Python modules in the Lib dir performance Performance or resource usage labels Sep 28, 2016

serhiy-storchaka assigned rhettinger and unassigned serhiy-storchaka Sep 28, 2016

rhettinger closed this as completed Sep 28, 2016

serhiy-storchaka reopened this Sep 28, 2016

serhiy-storchaka added the topic-regex label Oct 16, 2016

rhettinger assigned serhiy-storchaka and unassigned rhettinger Oct 16, 2016

serhiy-storchaka closed this as completed Sep 26, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

xmo-odoo mentioned this issue Apr 27, 2022

ParseUserAgent, ParseOS and ParseDevice are not using the cache ua-parser/uap-python#97

Closed

serhiy-storchaka mentioned this issue Aug 27, 2022

Use double caching for re._compile() #96346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't completely dump the regex cache when full #72480

Don't completely dump the regex cache when full #72480

rhettinger commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

zhangyangyu commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

zhangyangyu commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

SilentGhost mannequin commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

rhettinger commented Sep 29, 2016

serhiy-storchaka commented Sep 29, 2016

vstinner commented Oct 18, 2016

serhiy-storchaka commented Oct 18, 2016

vstinner commented Oct 18, 2016

methane commented Dec 25, 2016

serhiy-storchaka commented Sep 26, 2017

serhiy-storchaka commented Sep 26, 2017

Don't completely dump the regex cache when full #72480

Don't completely dump the regex cache when full #72480

Comments

rhettinger commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

zhangyangyu commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

zhangyangyu commented Sep 28, 2016

rhettinger commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

SilentGhost mannequin commented Sep 28, 2016

serhiy-storchaka commented Sep 28, 2016

rhettinger commented Sep 29, 2016

serhiy-storchaka commented Sep 29, 2016

vstinner commented Oct 18, 2016

serhiy-storchaka commented Oct 18, 2016

vstinner commented Oct 18, 2016

methane commented Dec 25, 2016

serhiy-storchaka commented Sep 26, 2017

serhiy-storchaka commented Sep 26, 2017