-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call design for Tier 2 (uops) interpreter #106581
Comments
You can have as many guards as you need. As long as they do not perturb the state of the VM, there can more than one. We have been assuming that there would be one action, but in fact there can be no action ( In summary, a straight line instruction should be made up of zero or more guards followed optionally by an action. |
I specifically said "straight line instruction" above, because some conditional branching instructions, like |
Sure, but any individual guard cannot use a cache entry and |
Off-line @markshannon proposed a simpler solution, that doesn't require changing all (There's also gh-105848 for reference.) |
Irrespective of the stack format, there's a bigger problem with projecting a trace through a call. Suppose we are projecting a trace and we encounter def add(x, y):
return x + y from e.g. total = 0
for i in range(1000):
total = add(total, i) we can add its translation to the trace, rather than ending the trace, and end up with a closed loop:
Where the commented-out lines are redundant because we really inline the call. Apart from the problem of eliding those call-related instructions, isn't it a problem that when we start projecting we don't (necessarily) have access to the function object being called? The cache associated with CALL just has a 32-bit "func_version" field, which is not enough to recover the actual function object or the code object (which we are really after). Working our way back from the CALL specialization to the place where the function being called is loaded (here LOAD_GLOBAL, but could be anything, really) does not seem feasible. I guess this is where we stop projecting and come back to improve the projected executor once we are about to execute the specialized call -- at that point the function object should be exactly @markshannon Does this seem feasible? |
Perhaps when we end a trace with a call to Python code we can insert a special instruction that causes a new executor to be created using the callable (which is now easily found on the stack) as additional input. |
Yes, that should work. The (slight) downside is that it forces us to execute the superblock in the tier 2 interpreter, which is likely slower than the tier 1 interpreter, as we can't optimize or JIT compile the superblock until it is complete. We could insert a special instruction in at the start of the function, to potentially resume superblock creation. FTR, the approach I origially had in mind, was some sort of LRU cache mapping version numbers to functions. |
Presumably we would execute the first version of the superblock in the Tier 2 interpreter only once. We could also throw away the first version away when the projection reaches a specialized call, and instead insert a special executor type that records the information needed to reconstruct that superblock quickly but incorporating the callable. Then when we reach that executor it finishes the superblock, inserts it where it should go, and it will be reached on the next iteration around the enclosing loop. |
I'm still pondering the logistics -- the optimizer receives neither the frame nor the jump origin, so it wouldn't know the location where to plug in the new executor. Updating the existing executor seems iffy, another thread could already have started executing it, and I hope to make uop executors variable-length (see gh-106608). I guess I could search for the current executor in co_executors to recover its index, but that feels expensive (O(N^2), except N is bounded). Logistics for the cache idea don't look entirely straightforward either. We could specialize a lot of calls before we get to incorporate one into a superblock, so the cache would have to be pretty large. But maybe I should play with this first. (The logic around (Also, there are two calls to |
I'm looking into an initial step for this -- just adding CALL_PY_EXACT_ARGS to the superblock, assuming that the Tier 2 interpreter will create, initialize and then return the new frame. For this we need to split CALL_PY_EXACT_ARGS into two uops, a guard and an action. But now we run into another yak to shave -- the generator doesn't yet handle variable-sized stack effects (in this case, args[oparg]) in uops: gh-106812. |
…106707) By turning `assert(kwnames == NULL)` into a macro that is not in the "forbidden" list, many instructions that formerly were skipped because they contained such an assert (but no other mention of `kwnames`) are now supported in Tier 2. This covers 10 instructions in total (all specializations of `CALL` that invoke some C code): - `CALL_NO_KW_TYPE_1` - `CALL_NO_KW_STR_1` - `CALL_NO_KW_TUPLE_1` - `CALL_NO_KW_BUILTIN_O` - `CALL_NO_KW_BUILTIN_FAST` - `CALL_NO_KW_LEN` - `CALL_NO_KW_ISINSTANCE` - `CALL_NO_KW_METHOD_DESCRIPTOR_O` - `CALL_NO_KW_METHOD_DESCRIPTOR_NOARGS` - `CALL_NO_KW_METHOD_DESCRIPTOR_FAST`
(This is mostly a brain dump after getting thoroughly confused and slowly getting my bearings back.) Now that I've shaved the yak, I'm free to think about how to do calls in Tier 2 again. Previously I wrote:
It's not that simple though. I can now split
And then we're back at the top of the instruction dispatch loop, starting with the first instruction of the called function's code object (which is where the new frame points after initialization). The idea is that at this point we can happily continue adding to the superblock from the called function's code object until we run out of space or find an instruction we can't handle. Apart from the issue of finding the code object (a tentative solution suggested above is to use some kind of cache with the function version as key), the Maybe the trick is to split the We may have to special-case Interesting times ahead. |
So I think (but haven't confirmed yet) that we can do something like this. Special-case
We make
and
The We'll have to introduce something like |
I got this working. I turned it into a PR, gh-107760, to take a break before starting on the cache for code objects indexed by |
This made me think. If we look at
it seems attractive to replace the former with a similar macro, basically equivalent to Note that such a macro would bulk up the tier 1 interpreter, but it would not change tier 2. |
The reason for the "zero or more guards followed by zero or one actions" rule, is to be sure that the VM is always in a consistent state. Bounds methods are special though. As far as Apart from the instrumented instructions, |
We don't have a good mechanism for doing something different in Tier 1 (bytecode) than in Tier 2 (microcode). Given that it's only one instruction I'll just use a macro like I suggested. Maybe the true constraint is a bit different -- what we care about is being able to jump back from the microcode stream to the bytecode stream when a micro instruction does one of the following:
The most important case is deoptimize -- it must be able to RE_EXECUTE the bytecode instruction from the start. This requires that the stack is as expected. In the example of bound method, if we deoptimize after the stack fiddling, the specialized bytecode instruction (CALL_BOUND_METHOD_EXACT_ARGS) will also deopt, to CALL, and all should be well, assuming that CALL is okay with the 'self' slot being already filled in. |
Hello, I am trying to learn quickly and follow along in this conversation, but I am having a little trouble to find enough information to do that. I look at https://docs.python.org/3/search.html?q=tier+1 and https://docs.python.org/3/search.html?q=tier+2 and https://docs.python.org/3/search.html?q=tier and https://docs.python.org/3/search.html?q=uops but the results do not seem to be the same as conversation here. I also look at https://docs.python.org/3/search.html?q=interpreter, which seems to be a little more useful, but I still not 100% sure since there are lots of results and not one seems to be the obvious corresponding explanation for this conversation. Where should I go to learn more about this and follow in this conversation? I have a little more questions, but I think I might know some of the answers if I can just find the information and read to it first. |
The Tier-1 and Tier-2 terms are our invention. The place to start is not docs.python.org but https://github.com/faster-cpython/ideas. FWIW Tier-1 refers to standard CPython bytecode, including specialized instructions (PEP 659), whereas Tier-2 refers to micro-instructions or micro-ops (often abbreviated as uops), which is a new thing we're introducing for Python 3.13 (and we don't plan to expose it to users, not even users of the C API). |
I found an interesting issue through #107927. This traces through calls and returns. Now consider the following:
When converting the outer (bottom) loop to a superblock, we trace into foo(). The bytecode for foo() contains a JUMP_BACKWARD instruction to the top of the inner loop. The superblock creation sees that this doesn't jump to the start of the superblock (which would be the top of the outer loop), so it ends the superblock. This is fine, except now we have a superblock that traces into a function and then stops. The inner loop is then also optimized into a superblock, which is fine, so it's not the end of the world. But the question that's nagging me is, should we perhaps just not create the outer superblock? |
* Split `CALL_PY_EXACT_ARGS` into uops This is only the first step for doing `CALL` in Tier 2. The next step involves tracing into the called code object and back. After that we'll have to do the remaining `CALL` specialization. Finally we'll have to deal with `KW_NAMES`. Note: this moves setting `frame->return_offset` directly in front of `DISPATCH_INLINED()`, to make it easier to move it into `_PUSH_FRAME`.
This finishes the work begun in gh-107760. When, while projecting a superblock, we encounter a call to a short, simple function, the superblock will now enter the function using `_PUSH_FRAME`, continue through it, and leave it using `_POP_FRAME`, and then continue through the original code. Multiple frame pushes and pops are even possible. It is also possible to stop appending to the superblock in the middle of a called function, when running out of space or encountering an unsupported bytecode.
…08380) I was comparing the last preceding poke with the *last* peek, rather than the *first* peek. Unfortunately this bug obscured another bug: When the last preceding poke is UNUSED, the first peek disappears, leaving the variable unassigned. This is how I fixed it: - Rename CopyEffect to CopyItem. - Change CopyItem to contain StackItems instead of StackEffects. - Update those StackItems when adjusting the manager higher or lower. - Assert that those StackItems' offsets are equivalent. - Other clever things. --------- Co-authored-by: Irit Katriel <[email protected]>
Instead of using `GO_TO_INSTRUCTION(CALL_PY_EXACT_ARGS)` we just add the macro elements of the latter to the macro for the former. This requires lengthening the uops array in struct opcode_macro_expansion. (It also required changes to stacking.py that were merged already.)
FWIW, this is what I did in my initial tracing implementation (only trace closed inner loops). Note that this probably would have avoided the issue with There are upsides and downsides to both approaches, though. It seems like we're moving in the direction of stitching many traces and side-exits together into trees, so I honestly don't think tracing the outer loop is too much of an issue. One can imagine a near future where we jump directly between the outer loop trace an inner loop trace without kicking it back into tier one each time. |
Yeah, when the trace execution bumps right into an ENTER_EXECUTOR we could do something to just go directly to that trace on the next iteration. Does your generated machine code have something equivalent to the refcounts that are keeping the executors alive while the Tier 2 interpreter is running? |
Also avoid the need for the awkward .clone() call in the argument to mgr.adjust_inverse() and mgr.adjust().
I must have overlooked this when refactoring the code generator. The Tier 1 interpreter contained a few silly things like ``` goto resume_frame; STACK_SHRINK(1); ``` (and other variations, some where the unconditional `goto` was hidden in a macro).
…09338) I must have overlooked this when refactoring the code generator. The Tier 1 interpreter contained a few silly things like ``` goto resume_frame; STACK_SHRINK(1); ``` (and other variations, some where the unconditional `goto` was hidden in a macro).
I have a question for
I am currently blocked. |
Might want to ping @markshannon about this on Discord (I just saw him mention in another context that he doesn't always act on GitHub pings). |
What's the question exactly?
It would be hard to optimize in that form though. As you said it needs to track two frames. What is inconsistent with tier 1? |
I think this issue has run its course. We've settled on the following stack layout:
|
(Maybe this is tentative enough that it still belongs in the faster-cpython/ideas tracker, but I hope we're close enough that we can hash it out here. CC @markshannon, @brandtbucher)
(This is a WIP until I have looked a bit deeper into this.)
First order of business is splitting some of the CALL specializations into multiple ops satisfying the uop requirement: either use oparg and no cache entries, or don't use oparg and use at most one cache entry. For example, one of the more important ones, CALL_PY_EXACT_ARGS, uses both
oparg
(the number of arguments) and a cache entry (func_version
). Splitting it into a guard and an action op is problematic: even discounting the possibility of encountering a bound method (i.e., assumingmethod
isNULL
), it contains the followingDEOPT
calls:If we wanted to combine all this in a single guard op, that guard would require access to both
oparg
(to dig outcallable
) andfunc_version
. The fundamental problem is that the callable, which needs to be prodded and poked for the guard to pass, is buried under the arguments, and we need to useoparg
to know how deep it is buried.What if we somehow reversed this so that the callable is on top of the stack, after the arguments? We could arrange for this by adding a
COPY n+1
opcode just before theCALL
opcode (or its specializations). In fact, this could even be a blessing in disguise, since now we would no longer need to push aNULL
before the callable to reserve space forself
-- instead, if the callable is found to be a bound method, itsself
can overwrite the original callable (below the arguments) and the function extracted from the bound method can overwrite the copy of the callable above the arguments. This has the advantage of no longer needing to have a "pushNULL
" bit in several other opcodes (theLOAD_GLOBAL
andLOAD_ATTR
families -- we'll have to review the logic inLOAD_ATTR
a bit more to make sure this can work).(Note that the key reason why the callable is buried below the arguments is a requirement about evaluation order in expressions -- the language reference requires that in the expression
F(X)
whereF
andX
themselves are possibly complex expressions,F
is evaluated beforeX
.)Comparing before and after, currently we have the following arrangement on the stack when
CALL n
or any of its specializations is reached:This is obtained by e.g.
or
or
Under my proposal the arrangement would change to
and it would be obtained by
It would (perhaps) even be permissible for the guard to overwrite both copies of the callable if a method is detected, since it would change from
to
where we would be assured that
func
has typePyFunctionObject *
. (However, I think we ought to have separate specializations for the two cases, since the transformation would also require bumpingoparg
.)The runtime cost would be an extra
COPY
instruction before eachCALL
; however I think this might actually be simpler than the dynamic check for bound methods, at least when using copy-and-patch.Another cost would be requiring extra specializations for some cases that currently dynamically decide between function and method; but again I think that with copy-and-patch that is probably worth it, given that we expect that dynamic check to always go the same way for a specific location.
Linked PRs
assert(kwnames == NULL)
#106707CALL_PY_EXACT_ARGS
into uops #107760The text was updated successfully, but these errors were encountered: