-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenize.generate_tokens()
performance regression in 3.12
#119118
Comments
Suspecting it could be related to the work done on PEP-701, which landed in 3.12. |
Yeah we are now using a completely different backend (the actual parser) so we need to investigate a bit what's going on here. |
A data point that might help: when calling |
Oh wow that's very intriguing. I may be able to investigate this next week after PyCon but thanks for the hint! |
Ok, I think I have a guess (i need time to test my theory) but I am 99% sure I know the problem. The problem is this line: cpython/Python/Python-tokenize.c Line 213 in 31a28cb
We are creating unique strings per token with the entire line. If the dict spans the entire line this is a very expensive operation that we are doing over and over and over with the same result. The fix is to keep reusing the same result until there is a new line. |
Well, yeah, it's not that simple. Doing some scrappy caching seems to have brought down the performance penalty, but it's much worse than 3.11.
So we get about 0.1-0.5 seconds per 500 tokens, instead of the original ~9 seconds, but it's still very far away. |
Oh found it, it's not just that line, but the calls to
I'll push a PR shortly. |
@lysnikolaou any news on this? We have other people reporting a problem like this in 3.12. |
Thanks for the reminder @nedbat! Turns out that this was even trickier than I though and the fix I showed in #119118 (comment) was buggy. My patch is almost there, but I didn't get the time to fix some final details. I'll have a PR up by early next week. |
Tell me if you want me to help or pick it up if you done have time @lysnikolaou |
- Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference.
* gh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <[email protected]>
…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <[email protected]> Co-authored-by: Pablo Galindo <[email protected]>
…nGH-119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <[email protected]> Co-authored-by: Pablo Galindo <[email protected]>
…19615) (#119682) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <[email protected]> Co-authored-by: Pablo Galindo <[email protected]>
…19615) (#119683) - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. (cherry picked from commit d87b015) Co-authored-by: Lysandros Nikolaou <[email protected]> Co-authored-by: Pablo Galindo <[email protected]>
Fixed in #119615. |
…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <[email protected]>
…n#119615) * pythongh-119118: Fix performance regression in tokenize module - Cache line object to avoid creating a Unicode object for all of the tokens in the same line. - Speed up byte offset to column offset conversion by using the smallest buffer possible to measure the difference. Co-authored-by: Pablo Galindo <[email protected]>
Bug report
Bug description:
There seems to be a significant performance regression in
tokenize.generate_tokens()
between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.To reproduce, rename the file largedict.py.txt to
largedict.py
in the same directory as the script below, then run the script. That file comes from nedbat/coveragepy#1785.For Python 3.12, this results in:
For Python 3.11, this results in:
That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.
I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs
The text was updated successfully, but these errors were encountered: