Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-96346: Use double caching for re._compile() #96347

Merged
merged 5 commits into from
Oct 7, 2022

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Aug 27, 2022

@serhiy-storchaka serhiy-storchaka added the performance Performance or resource usage label Aug 27, 2022
@rhettinger
Copy link
Contributor

rhettinger commented Aug 27, 2022

This looks like a reasonable change.

Consider whether OrderedDict should be used in the cache where the order changes. We've shown that next(iter(d)) is quadratic while od.popitem(False) is linear. For the cache sizes in the re module, this might not matter or might be outweighed by OrderedDict's slower __getitem__ and __setitem__.

It is fine to go ahead with this PR since it is improves on what we have now.

@serhiy-storchaka
Copy link
Member Author

It was discussed in #76519, performance difference is negligible. On other hand, OrderedDict adds a dependence. And it would be much worse if it is implemented in Python.

@serhiy-storchaka
Copy link
Member Author

I added some data in the issue.

Lib/re/__init__.py Outdated Show resolved Hide resolved
# _cache2 uses the simple FIFO policy which has better latency.
# _cache uses the LRU policy which has better hit rate.
# OrderedDict is not used because it adds a new dependence, and
# performance difference is negligible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit the OrderedDict part of the comment. It is debatable and doesn't need to be in the code. The important part is the two lines before that explain the two caches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an explanation why OrderedDict is not used at first place, no?

Also, I was afraid that some new contributor passing through the code and not aware about the history of this code can submit a PR with an "obvious" improvement, and it can be merged while I am not here. The history of this code contains many changes and reversions.

I removed this, and hope the new comment about next(iter(_cache)) will be enough.

@rhettinger
Copy link
Contributor

Side discussion: I was looking at the code in tomllib and its cache has maxsize=None. I wondered whether its use pattern is substantially different or whether the policy should be the same as for re. Perhaps the tomllib module should be bounded, or perhaps the re module didn't really need a limit. Another alternative is to set a limit that is very high so that most users never hit the limit and never trigger an eviction. After all, the objects being cached are very small, so setting a size limit is arguably a premature optimization. The code would be much simpler and a little faster without a size limit.

@hukkin
Copy link
Contributor

hukkin commented Aug 30, 2022

Side discussion: I was looking at the code in tomllib and its cache has maxsize=None. I wondered whether its use pattern is substantially different or whether the policy should be the same as for re

Hi there, it's tomllib author here! Tomllib's cache does not have an explicit bound, but is implicitly bound to 2880 items. This stems from the fact the the function's input always has to go through a regex before the lru_cached function is called.

The number 2880 comes from 24 hours * 60 minutes * 2 (offset direction).

@rhettinger
Copy link
Contributor

@hukkin I suggest adding a comment to that effect.

@serhiy-storchaka
Copy link
Member Author

Thank you Raymond for looking at this yet one time.

As for the limit, I think it is needed here, because RE patterns can be created on-fly, depending on the user data, and some programs may use millions of different patterns, but every pattern is only used once.

p = _compiler.compile(pattern, flags)
if not (flags & DEBUG):

key = (type(pattern), pattern, flags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be defined before the try/except above?

I also wonder if an if key in _cache2: return _cache2[key] would be more efficient than the try/except.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this would add a small but measurable overhead in the common case. I tested this when I wrote the current implementation.

if not (flags & DEBUG):

key = (type(pattern), pattern, flags)
p = _cache.pop(key, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it remove it from the _cache?
I think it would be better to add a comment to elaborate a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to move it to the end.

@ambv ambv merged commit c11b667 into python:main Oct 7, 2022
carljm added a commit to carljm/cpython that referenced this pull request Oct 8, 2022
* main: (38 commits)
  pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058)
  pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060)
  pythongh-57179: Add note on symlinks for os.walk (python#94799)
  pythongh-94808: Fix regex on exotic platforms (python#98036)
  pythongh-90085: Remove vestigial -t and -c timeit options (python#94941)
  pythonGH-83901: Improve Signature.bind error message for missing keyword-only params (python#95347)
  pythongh-61105: Add default param, note on using cookiejar subclass (python#95427)
  pythongh-96288: Add a sentence to `os.mkdir`'s docstring. (python#96271)
  pythongh-96073: fix backticks in NEWS entry (pythonGH-98056)
  pythongh-92886: [clinic.py] raise exception on invalid input instead of assertion (pythonGH-98051)
  pythongh-97997: Add col_offset field to tokenizer and use that for AST nodes (python#98000)
  pythonGH-88968: Reject socket that is already used as a transport (python#98010)
  pythongh-96346: Use double caching for re._compile() (python#96347)
  pythongh-91708: Revert params note in urllib.parse.urlparse table (python#96699)
  pythongh-96265: Fix some formatting in faq/design.rst (python#96924)
  pythongh-73196: Add namespace/scope clarification for inheritance section (python#92840)
  pythongh-97646: Change `.js` and `.mjs` files mimetype to conform to RFC 9239 (python#97934)
  pythongh-97923: Always run Ubuntu SSL tests with others in CI (python#97940)
  pythongh-97956: Mention `generate_global_objects.py` in `AC How-To` (python#97957)
  pythongh-96959: Update HTTP links which are redirected to HTTPS (python#98039)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants