-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Lib/_pydatetime.py file to optimize "import datetime" when _datetime is available #84976
Comments
Currently, "import datetime" starts by importing time, math, sys and operator modules, and then execute 2500 lines of Python code, define 7 classes, etc. For what? Just to remove all classes, functions, etc. to replace them with symbols from _decimal module:
I would prefer to use the same approach than the decimal module which also has large C and Python implementation. Lib/decimal.py is just:
Advantages:
IMO it also better separate the C and the Python implementations. Attached PR implements this idea. Currently, "import datetime" imports 4 modules:
With the PR, "import datetime" imports only 1 module:
Import performance:
Measured by:
Note: I noticed that "import datetime" imports the math module while working on minimizing "import test.support" imports, bpo-40275. |
What do decimals have to datetime? |
Oops. Sorry, I was confused between "datetime" and "decimal" when I created this issue. I fixed the issue title. My idea is to mimick Lib/decimal.py design for Lib/datetime.py. |
I basically agree with this — this is one of the reasons I structured the zoneinfo module the way I did rather than mimicking the pattern in datetime. I believe that there are other modules that have similar situations like heapq, but datetime is probably the worst offender. I am inclined to say that we should restructure datetime into a folder, containing __init__.py, _datetime.py and possibly _strptime.py (which I think is also only used in datetime), but I think that sort of restructuring is way more sensitive to weird import bugs than this one. As it is now, I would be shocked if this didn't break *someone*, because people are always relying on weird implementation details (knowingly or unknowingly), but I think it's worth doing; it's good to tackle it this early in the cycle. @vstinner What do you think about restructuring into a folder-based submodule rather than _pydatetime.py? It's way more likely to break someone, but I think it might be the better way to organize the code, and I don't want to have to go through *two* refactors of this sort. |
heapq seems to be a little bit different. _heapq is not a drop-in replacement of heapq.py. For example, nlargest() seems to only be implemented in pure Python.
I have no idea what are the side effects of converting datetime.py file into a package. A single file _pydatetime.py seems more convenient to me. I'm aware of _strptime.py but I don't see it as a datetime submodule and I don't see the value of moving it as a datetime submodule. I'm fine with _datetime accessing _strptime module. It sounds more complex to me if _datetime would be imported by datetime which contains datetime._strptime. I see a higher risk of subtle import issues, since datetime has two implementations (C and Python). But it may be wrong :-) Also, all other stdlib modules which have a C implementation are designed with files, not folders: io.py (_io and _pyio) and decimal.py (_decimal and _pydecimal) are good examples. I mostly case about reducing the number of indirect imports and import performance. I don't have a strong opinion about file vs folder.
I'm fine with breaking applications relying on implementation details. Also, we can adjust the code to fix such corner cases later if it's needed, possible and justified :-) |
About _strptime, I see that the time.strptime() imports internally the _strptime module. If we move _strptime inside datetime: does it mean that calling time.strptime() would have to import the datetime module? It doesn't sound right to me. I see the time as the low-level interface: it should not rely on the high-level interface. I prefer to separate _strptime module from the datetime module. |
Ah, sorry, my remark about including |
As for deciding between moving to The biggest advantage I see to moving Right now the C implementation of |
I don't have the bandwidth to work on this issue, so I just close it. |
I don't think we need to close this just because you won't be fixing it yourself. It is still a valid issue |
Well, I made a first attempt in 2020: PR #20472 which it went nowhere. Let me re-create the change: PR #99090. It's basically the same one, but I made tiny doc changes (ex: use GitHub issue number, not bpo number) and datetimetester.py tolerates that _pydatetime.py doesn't exist (use datetime in this case). I ran two benchmarks. Startup time:
Import time:
Results:
IMO the speedup is quite interesting. |
You need to add |
I don't think that this constraint (convert datetime into a package) should only this optimization. For me, it can be done later. Honestly, I'm not convinced that converting datetime into a module right now brings any benefits. I prefer a simple If tomorrow, there is strong pressure for large datetime feature which requires creating a package, sure, we can consider that. Today, Python stdlib is mostly flat and IMO it's a good thing and also a deliberate choice. |
Ok, I wrote a more elaborated script to measure import pyperf
import sys
def bench_import_datetime(modules):
import datetime
del modules["datetime"]
del modules["_datetime"]
modules.pop("time", None)
modules.pop("math", None)
modules.pop("operator", None)
modules.pop("_operator", None)
modules.pop("_pydatetime", None)
del datetime
runner = pyperf.Runner()
del pyperf
runner.bench_func('import', bench_import_datetime, sys.modules) Result (main => PR): Mean +- std dev: [bench_ref] 1.07 ms +- 0.02 ms -> [bench_change] 116 us +- 5 us: 9.20x faster. |
So I think on the mailing list we decided that switching to just I think I vaguely recall it being the case that I'm not sure how to test the miss-islington thing because I'm not sure how easy it is to point miss islington at a different repo and say, "Treat this like it's CPython main, and this other branch like it's 3.11" or whatever. We can try using the same git commands miss-islington does as a first pass check, but before merge I'd be way more comfortable if we could see an actual miss-islington instance creating a backport. |
By default,
|
For miss-islington, I did a test locally. I create a local bugfix from my PR
Backport the bugfix from main to the 3.11 branch:
Well. It doesn't work as expected. Maybe the problem is that in my PR, Example if I recreated my PR quickly (first steps) from my existing
Good, at this stage, Git understand that the file was renamed. Create a new datetime.py file:
Not good: _pydatetime.py is now seen as a "new file" :-( |
Another test where the rename is a dedicated commit. Spoiler: not good, I get the same conflict on backport. First commit, only to rename:
Second commit, to recreate datetime.py:
Now I created a bugfix (in the "main" branch);
What if I backport it to the 3.11 branch?
Well, I get the exact same conflict. Note:
|
For me, the quick conclusion is that we have 2 choices:
About manual backports, in 2 years (since 2021-01-01), Lib/datetime.py got 7 commits in the main branch. For comparison, it only got 2 commits in the 3.10 branch since 2021-01-01.
IMO the low number of changes make it manageable to be done manually. What do you think? Maybe a Git guru can help us here? |
In clear, I'm in favor of this option. @pganssle: Are you ok with that? |
I think we should explore the option of a merge strategy that doesn't require manual backports, but if we can't find one, or if someone with git expertise can explain why there's no way to accomplish this without breaking backports, I'm fine with breaking backports. I wonder if we could even special-case datetime in miss islington to preserve automated backports even without clean cherry picks. |
This issue is open for 2 years. I did many tests with Git. So far, I didn't find anything obvious. Git doesn't give much freedom for that. Git doesn't store renames: it "recomputes" them in all operations, check if two files look similar or not. |
FWIW I did some additional tests, without finding an actual solution. In the first test I tried to rename Then I tried another strategy. I started from In the third and final attempt I was able to duplicate the file while preserving the history of both copies, by doing: $ git switch -c B
Switched to a new branch 'B'
$ git mv Lib/datetime.py Lib/_pydatetime.py
$ git commit -m 'Rename datetime.py to _pydatetime.py.'
[B a932519123] Rename datetime.py to _pydatetime.py.
1 file changed, 0 insertions(+), 0 deletions(-)
rename Lib/{datetime.py => _pydatetime.py} (100%)
$ git show A:Lib/datetime.py > Lib/datetime.py
$ git add Lib/datetime.py
$ git commit -m 'Add back datetime.py.'
[B e1e1c92a5b] Add back datetime.py.
1 file changed, 2642 insertions(+)
create mode 100644 Lib/datetime.py
$ git switch A
Switched to branch 'A'
$ git merge --no-ff B # apparently the --no-ff is important here
Merge made by the 'ort' strategy.
Lib/_pydatetime.py | 2642 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 2642 insertions(+)
create mode 100644 Lib/_pydatetime.py
$ vim Lib/datetime.py # edit/remove stuff
$ git add Lib/datetime.py
$ git commit -m 'Update datetime.py'
[A 1d46a8cd30] Update datetime.py
1 file changed, 21 insertions(+), 2642 deletions(-)
rewrite Lib/datetime.py (99%) Now |
This issue is open for 2 years. @ezio-melotti and me investigated options for backports and no solution was found. @pganssle: So are you ok to create Lib/_pydateime.py to make "import datetime" faster? I have no idea of miss-islington. The bot just runs |
Sounds like it's not very urgent, then. Seriously, Victor, I appreciate the work you've put in on this, but I don't appreciate the rush you are imparting here. I didn't ask you to revive your PR when you closed the issue, I just said that it's still an open issue. I think it is worth taking a little bit of time investigating, but no one has provided a compelling explanation as to why this cannot work, nor have I had time to investigate this sufficiently to rule out a solution that doesn't require several years of manual backports. The feature freeze isn't for ~half a year. |
What I mean is that people had 2 years to look into the Git issue and so far nobody was able to provide a solution to the Git issue. So for me, there is no solution, and there are only 2 choices: do nothing (keep datetime.py as it is), or create Lib/_pydatetime.py and handle backports manually. You need more time, ok. In this case, I abandon my PR and I leave the issue for someone else. I consider that @pganssle is the datetime maintainer, and I prefer to have his bless for this change. But I don't have the bandwidth to spend more time on this topic (investigate the Git question). I care about "import datetime" performance, but not enough to spend more time. For me, I don't think that it's worth it. Anyone is free to clone my PR #99090 and restart from that. By the way, there is now a conflict, I guess that datetime.py was modified since I created my PR. That's one of the reason why I don't like to keep a PR open for too long. The other reason is that PR that I created are my TODO list and I want to keep this list as short as possible. Otherwise, I lost track. I closed this issue 1 month ago because the issues that I created are my second "TODO list" (and I want to keep it short). Here it seems like the consensus is towards "do nothing". But well, I will not close again the issue. I will just learn to skip it from my "TODO list" :-) Or maybe there is a way in GitHub to transfer the "ownership" of an issue to someone else. I don't know. I don't care much in fact :-) |
This breaks the tests, but we are keeping it as a separate commit so that the move operation and editing of the moved files are separate, for a cleaner history.
Without the change to the reprs, pure-python classes would have a repr of `datetime._pydatetime.time`, etc.
This breaks the tests, but we are keeping it as a separate commit so that the move operation and editing of the moved files are separate, for a cleaner history.
Without the change to the reprs, pure-python classes would have a repr of `datetime._pydatetime.time`, etc.
This breaks the tests, but we are keeping it as a separate commit so that the move operation and editing of the moved files are separate, for a cleaner history.
Without the change to the reprs, pure-python classes would have a repr of `datetime._pydatetime.time`, etc.
pythongh-84796: Add back UTC to datetime.__all__ This was mistakenly dropped in pythonGH-103637 Noticed when updating typeshed for Python 3.12 (cherry picked from commit 076b620) Co-authored-by: Shantanu <[email protected]>
Co-authored-by: Shantanu <[email protected]>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: