Ken Jin’s Blog

Python 3.15’s Ultra-Low Overhead Interpreter Profiling Mode

2026-07-01T00:00:00+00:00

Python 3.15’s Ultra-Low Overhead Interpreter Profiling Mode

1 July 2026

Python’s 3.15 JIT shows modest speedups over the interpreter. Key to that is a (to the best of my knowledge), new form of interpreter profiling that was developed for recording execution through the interpreter for JIT compilation. This form of profiling does not introduce much overhead over the interpreter as well, allowing for low-overhead JIT compilation. I wouldn’t be surprised if someone actually already came up with this before, but as far as I could find, it’s not documented anywhere.

Brief Background

In an earlier post, I explained trace recording in the CPython 3.15 JIT. The key idea is that we record an instruction stream through actual program execution, until we hit some terminating condition, then we send that for JIT compilation. Naturally, this requires an instrumented/profiled interpreter.

Profiling an Interpreter

There are normally two avenues of profiling an interpreter:

Have two bespoke interpreters (one for execution and one just for profiling).
Implement a “profiling mode”.

Approach 1: Two Interpreters

The normal interpreter will have points calling into the profiling interpreter, and a way for the profiling interpreter to return into the normal interpreter. This allows switching back and forth between interpreters. This was the initial implementation of trace recording in CPython. I found it to be alright for the tail-calling interpreter but too slow for the computed goto interpreter (roughly a 6% slowdown on pyperformance!). I suspect the main reason was that we doubled the size of the actual interpreter implemented in C effectively for computed gotos, as that feature is function-scoped. Doubling the size of your C binary is usually not good for performance for various reasons.

Approach 2: Profiling Mode

The other common way to implement a profiling interpreter is to have a “profiling mode” which conditions some profiling logic on a boolean, say bool profile. While this minimizes the code bloat, we deemed it too disruptive to the normal interpreter and it would also slow down execution despite the branch being almost always correctly predicted by CPUs.

Dual Dispatch

To get the best out of both worlds, there is one last alternative—swap out the dispatch tables.

Interpreters often dispatch to the next instruction by mapping the opcode (instruction ID) to a function pointer/label address. This tells the interpreter where to go for each instruction.

The idea instead here is to have two dispatch tables—one for normal execution, and one for the profilinng interpreter. At runtime, we just need to assign a local variable dispatch_table_var to decide what “mode” we are in. No branches needed! I’m quite sure I’m not the first person to do this. However, the naiive implementation of this actually gets code that is equivalent to two interpreters in one, which as I wrote above, is very slow. The key improvement then is to map all the instructions in the second table to a singular recording/profillng instruction. This instruction then does all profiling we need, and dispatches using a fixed first table to the actual next instruction in the normal interpreter for execution. You can think of this as a fan-in (to a single instruction) fan-out model. Entering profiling mode is just initializing our data structures, and interpreter wise it’s just swapping out the dispatch tables! Leaving profiling is once again, finalizing our data structures and swapping out the dispatch tables. This is the actual code in CPython 3.15:

#  define ENTER_TRACING() \
    DISPATCH_TABLE_VAR = TRACING_DISPATCH_TABLE;
#  define LEAVE_TRACING() \
    DISPATCH_TABLE_VAR = DISPATCH_TABLE;

The macros are just to handle the different tables when using comupted goto/tail calling interpreter.

Results

Turning off dynamic frequency scaling on my system and running a test script (found in the Appendix), these are the medians of 40 runs measuring the overhead of profiling the interpreter:

# No profiling (just interpreter)
1.72e-06s
# Interpreter + Profiling + JIT compilation
7.47e-06s

This essentially means profiling the interpreter in CPython 3.15 is only at most 4.5x slower for our toy benchmark! Other tracing systems like PyPy have slowdowns in the range of 900x-1000x ! This is of course, not a fair comparison, as PyPy is meta-tracing and thus naturally traces a lot more code than us for the same program (it has to trace the interpreter itself). However, I just put this here to give an example of how slow tracing can actually be.

Reflections

I’m very proud of what we managed to come up with for the profiling interpreter. This approach is not restricted to just trace recording. Other applications might be to introduce low-overhead profiling of an interpreter without radical rewrites, or recording an interpreter’s type profile seen during runtime, etc. Part of why I’m writing this blog post is that I believe in documenting technical knowledge and sharing it in case someone finds it useful. However, I do ask myself ocassionally: is this magical system we’ve come up with in CPython worth the complexity? I like systems that are elegant and simple, and this while elegant is definitely not that simple to reason about. I’ll pen down my thoughts on tracing more in the future..

Appendix

The benchmarking script to measure the overhead. To trigger JIT compilation, I use PYTHON_JIT_RESUME_INITIAL_VALUE=1:

def foo(x):
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x
    x + x




foo(1)
foo(1)
foo(1)
import sys
import time
start = time.time()
foo(1)
end = time.time()
print(end - start)
# sys._dump_tracelets("hello.gvz")

Python 3.15’s JIT is now back on track

2026-03-17T00:00:00+00:00

Python 3.15’s JIT is now back on track

17 Mar 2026

(JIT performance as of 17 March (PST). Lower is better versus interpreter. Image credits to https://doesjitgobrrr.com/).

Great news—we’ve hit our (very modest) performance goals for the CPython JIT over a year early for macOS AArch64, and a few months early for x86_64 Linux. The 3.15 alpha JIT is about 11-12% faster on macOS AArch64 than the tail calling interpreter, and 5-6% faster than the standard interpreter on x86_64 Linux. These numbers are geometric means and are preliminary. The actual range is something like a 20% slowdown to over 100% speedup (ignoring the unpack_sequence microbenchmark). We don’t have proper free-threading support yet, but we’re aiming for that in 3.15/3.16. The JIT is now back on track.

I cannot overstate how tough this was. There was a point where I was seriously wondering if the JIT project would ever produce meaningful speedups. To recap, the original CPython JIT had practically no speedups: 8 months ago I posted a JIT reflections article on how the original CPython JIT in 3.13 and 3.14 was often slower than the interpreter. That was also around the time where the Faster CPython team lost funding by its main sponsor. I’m a volunteer so this didn’t affect me, but more importantly it did affect my friends working there, and at a point of time it seemed the JIT’s future was uncertain.

So what changed from 3.13 and 3.14? I’m not going to give some heroic tale of how we rescued the JIT from the jaws of failure through our acumen. I honestly attribute a lot of our current success to luck—right time, right place, right people, right bets. I seriously don’t think this would’ve been possible if a single one of the core JIT contributors: Savannah Ostrowski, Mark Shannon, Diego Russo, Brandt Bucher, and me were not in the picture. To not exclude the other active JIT contributors, I will also name a few more people: Hai Zhu, Zheaoli, Tomas Roun, Reiden Ong, Donghee Na, and I am probably missing a few more.

I’m going to cover a lesser talked about part of a JIT: the people, and a bit of luck. If you want the technical details of how we did it, it’s here

Part 1: A community-led JIT

The Faster CPython team lost its main sponsor in 2025. I immediately raised the idea of community stewardship. At the time, I was pretty uncertain this would work. JIT projects are not known to be good for new contributors. It historically requires a lot of prior expertise.

At the CPython core sprint in Cambridge, the JIT core team met, and we wrote a plan for a 5% faster JIT by 3.15 and a 10% faster JIT by 3.16, with free-threading support. A side note, which was less headline grabbing, but vital to the health of the project: was to decrease the bus factor. We wanted 2 active maintainers in all 3 stages of the JIT; frontend (region selector), middle-end (optimizer), backend (code generator).

Previously, the JIT only had 2 active recurrent contributors middle-end. Today, the JIT has 4 active recurrent contributors to the middle-end, and I would consider the 2 non-core developers (Hai Zhu and Reiden) capable and valued members.

What worked in attracting people were the usual software engineering practices: breaking complex problems down into manageable parts. Brandt started this earlier in 3.14, where he opened multiple mega-issues that split optimizing the JIT into simple tasks. E.g. we would say “try optimizing a single instruction in the JIT”. I took Brandt’s idea and did this for 3.15. Luckily, I had an easier job as my issue involved converting the interpreter instructions to an easily optimizeable form. To encourage new contributors, I also laid out very detailed instructions that were immediately actionable. I also clearly demarcated units of work. I suspect that did help, as we have 11 contributors (including me) working on that issue, converting nearly the whole of the interpreter to something more JIT-optimizer friendly. The core was that the JIT could be broken down from an opaque blob to something that a C programmer with no JIT experience could contribute to.

Other things that worked: encouraging people, celebrating achievements big or small. Every JIT PR had a clear outcome, which I suspect gave people a sense of direction.

The community optimization efforts paid off. The JIT went from 1% faster on x86_64 Linux to 3-4% faster (see the blue line below) over that time period:

(Image credits to https://doesjitgobrrr.com/).

Part 2: Lucky bets

Trace recording

Again, I attribute a lot of this to luck, but during the CPython core sprints in Cambridge, Brandt nerd-sniped me to rewrite the JIT frontend to a tracing one. I initially didn’t like the idea, but as a friendly form of spite-driven-development, I thought I’d rewrite it just to prove to him it didn’t work.

The initial prototype worked in 3 days, however it took a month to get it JITting properly without failing the test suite. The initial results were dismal—about 6% slower on x86_64 Linux. I was about to ditch the idea, until a lucky accident happened: I misinterpertered a suggestion given by Mark.

Mark had suggested earlier to thread the dispatch table through the interpreter, thus having two dispatch tables in the interpreter (one normal interpreter, and one for tracing). Mark suggested we should have the tracing table be tracing versions of normal instructions. However, I misunderstood and came up with an even more extreme version: instead of tracing versions of normal instructions, I had only one instruction responsible for tracing, and all instructions in the second table point to that. Yes I know this part is confusing, I’ll hopefully try to explain better one day. This turned out to be a really really good choice. I found that the initial dual table approach was so much slower due to a doubling of the size of the interpreter, causing huge compiled code bloat, and naturally a slowdown. By using only a single instruction and two tables, we only increase the interpreter by a size of 1 instruction, and also keep the base interpreter ultra fast. I affectionally call this mechanism dual dispatch.

There’s a lot more that went into the design of the trace recording interpreter. I’m tooting my own horn here, but I truly think it’s a mini work of art. It took me 1 week to iterate on the interpreter until it was overall faster. It went from 6% slower to roughly no speedup after using dual dispatch. After that, I stamped out a bunch of slow edge cases in the tracing interpreter to eventually make it 1.x% faster. Tracing the interpreter itself is only 3-5x slower by my own estimations than the specializing interpreter. Key to this is that it respects all normal behavior of the specializing interpreter and mostly doesn’t intefere with it.

Just to give you an idea of how much trace recording mattered: it increased the JIT code coverage by 50%. This means all future optimizations would likely have been around 50% less effective (assuming all code executes the same, which of course isn’t true, just bear with me please :).

So I have to thank Brandt and Mark for leading me to stumble upon such a nice solution.

Reference count elimination

The other lucky bet we made early on was to try reference count elimination. This, again, was work originally by Matt Page done in CPython bytecode optimizer (more details in previous blog post on optimization). I noticed that there was still a branch left in the JITted code per reference count decrement even with the bytecode optimizer work. I thought: “why not try eliminating the branch”, and had no clue how much it would help. It turns out a single branch is actually quite expensive and these add up over time. Especially if it’s >=1 branch for every single Python instruction!

The other lucky part is how easy this was to parallelize and how great it was a tool to teach people about the interpreter and JIT. This was the main optimization that we directed people to work on in the Python 3.15 JIT. Although it was a mostly manual refactoring process, it taught people the key parts they needed to learn about the JIT without overhwhelming them.

Part 3: A great team

We have a great infrastructure team. I say this partly in jest, because it’s one person. In reality, our “team” is currently 4 machines running in Savannah’s closet. Nevertheless Savannah has done the work equivalent of an entire infrastructure team for the JIT. The JIT could not have progressed so quickly if we had nothing to report our performance numbers. Daily JIT runs have been a game changer in the feedback loop. It helped us catch regressions in JIT performance, and lets us know our optimizations actually work.

Mark is technically excellent, and I think he knows the Internet gives him too much praise already so I’m not going to say anything more here :).

Diego is also great. He’s responsible for the JIT on ARM hardware, and also has recently started work on making the JIT friendly to profilers. I cannot overstate how hard of a problem this is.

Brandt laid the original foundation for our machine code backend, without which we’d have new contributors writing assembler, which probably would’ve put more people off.

Part 4: Talking to people

I also want to encourage the idea of talking to people and sharing ideas.

A shoutout to CF Bolz-Tereick, who taught me a lot about PyPy. I spent a few months looking at PyPy’s source code, and I believe this made me a better JIT developer overall. CF was very helpful when I needed help.

I’m also part of a friendly compiler chat with Max Bernstein, without which I’d likely have lost motivation for this a long time ago. Max is a prolific writer, and a friendly compiler person.

Ideas don’t exist in a silo. I suspect I became better at writing JITs thanks to hanging out with a bunch of compiler people for some time. At the very least, looking at PyPy has broadened my view!

Conclusion

People are important, and with some luck, JIT go brrr.

Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

2025-12-24T00:00:00+00:00

Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

24 December 2025

Some time ago I posted an apology piece for Python’s tail calling results. I apologized for communicating performance results without noticing a compiler bug had occured.

I can proudly say today that I am partially retracting that apology, but only for two platforms—macOS AArch64 (XCode Clang) and Windows x86-64 (MSVC).

In our own experiments, the tail calling interpreter for CPython was found to beat the computed goto interpreter by 5% on pyperformance on AArch64 macOS using XCode Clang, and roughly 15% on pyperformance on Windows on an experimental internal version of MSVC. The Windows build is against a switch-case interpreter, but this in theory shouldn’t matter too much, more on that in the next section.

This is of course, a hopefully accurate result. I tried to be more diligent here, but I am of course not infallible. However, I have found that sharing early and making a fool of myself often works well, as it has led to people catching bugs in my code, so I shall continue doing so :).

Also this assumes the change doesn’t get reverted later in Python 3.15’s development cycle.

Brief background on interpreters

Just a recap. There are two popular current ways of writing C-based interpreters.

Switch-cases:

switch (opcode) {
    case INST_1: ...
    case INST_2: ...
}

Where we just switch-case to the correct instruction handler.

And the other popular way is a GCC/Clang extension called labels-as-values/computed gotos.

goto *dispatch_table[opcode];
INST_1: ...
INST_2: ...

Which is basically the same idea, but to instead jump to the address of the next label. Traditionally, the key optimization here is that it needs only one jump to go to the next instruction, while in the switch-case interpreter, a naiive compiler would need two jumps.

With modern compilers however, the benefits of the computed gotos is a lot less, mainly because modern compilers have gotten better and modern hardware has also gotten better. In Nelson Elhage’s excellent investigation on the next kind of interpreter, the speedup of computed gotos over switch case on modern Clang was only in the low single digits on pyperformance.

A 3rd way that was suggested decades ago, but not really entirely feasible is call/tail-call threaded interpreters. In this scheme, each bytecode handler is its own function, and we tail-call from one handler to the next in the instruction stream:

return dispatch_table[opcode];

PyObject *INST_1(...) {

}

PyObject *INST_2(...) {
}

This wasn’t too feasible in C for one main reason—tail call optimization was merely an optimization. It’s something the C compiler might do, or might not do. This means if you’re unlucky and the C compiler chooses not to perform the tail call, your interpreter might stack overflow!

Some time ago, Clang introduced __attribute__((musttail)), which allowed for mandating that a call must be tail-called. Otherwise, the compilation will fail. To my knowledge, the first time this was popularized for use in a mainstream interpreter was in Josh Haberman’s Protobuf blog post.

Later on, Haoran Xu noticed that the GHC calling convention combined with tail calls produced efficient code. They used this for their baseline JIT in a paper and termed the technique Copy-and-Patch.

So where are we now?

After using a fixed XCode Clang, our performance numbers on CPython 3.14/3.15 suggest that the tail calling interpreter does provide a modest speedup over computed gotos. Around the 5% geomean range on pyperformance.

To my understanding, uv already ships Python 3.14 on macOS with tail calling, which might be responsible for some of the speedups you see on there. We’re planning to ship the official 3.15 macOS binaries on python.org with tail calling as well.

However, you’re not here for that. The title of this blog post is clearly about MSVC Windows x86-64. So what about that?

Tail-calling for Windows

[!CAUTION] The features for MSVC discussed below are to my knowledge, experimental. They are not guaranteed to always be around unless the MSVC team decide to keep them. Use at your own risk!

These are the preliminary pyperformance results for CPython on MSVC with tail-calling vs switch-case. Any number above 1.00x is a speedup (e.g. 1.01x == 1% speedup), anything below 1.00x is a slowdown. The speedup is a geomtric mean of around 15-16%, with a range of ~60% slowdown (one or two outliers) to 78% speedup. However, the key thing is that the vast majority of benchmaarks sped up!

Chart credits to Michael Droettboom

[!WARNING] These results are on an experimental internal MSVC compiler, public results below.

To verify this and make sure I wasn’t wrong yet again, I checked the results on my machine with Visual Studio 2026. These are the results from this issue.

Mean +- std dev: [spectralnorm_tc_no] 146 ms +- 1 ms -> [spectralnorm_tc] 98.3 ms +- 1.1 ms: 1.48x faster
Mean +- std dev: [nbody_tc_no] 145 ms +- 2 ms -> [nbody_tc] 107 ms +- 2 ms: 1.35x faster
Mean +- std dev: [bm_django_template_tc_no] 26.9 ms +- 0.5 ms -> [bm_django_template_tc] 22.8 ms +- 0.4 ms: 1.18x faster
Mean +- std dev: [xdsl_tc_no] 64.2 ms +- 1.6 ms -> [xdsl_tc] 56.1 ms +- 1.5 ms: 1.14x faster

So yeah, the speedups are real! For a large-ish library like xDSL, we see a 14% speedup, while for smaller microbenchmarks like nbody and spectralnorm, the speedups are greater.

Thanks to Chris Eibl and Brandt Bucher, we managed to get the PR for this on MSVC over the finish line. I also want to sincerely thank the MSVC team. I can’t say this enough: they have been a joy to work with and I’m very impressed by what they’ve done, and I want to congratulate them on releasing Visual Studio 2026. This feature was made possible thanks to new features in Visual Studio 2026, and would not have been achievable with prior Visual Studio versions.

This is now listed in the What’s New for 3.15 notes:

Builds using Visual Studio 2026 (MSVC 18) may now use the new tail-calling interpreter. Results on Visual Studio 18.1.1 report between 15-20% speedup on the geometric mean of pyperformance on Windows x86-64 over the switch-case interpreter on an AMD Ryzen 7 5800X. We have observed speedups ranging from 14% for large pure-Python libraries to 40% for long-running small pure-Python scripts on Windows. This was made possible by a new feature introduced in MSVC 18. (Contributed by Chris Eibl, Ken Jin, and Brandt Bucher in gh-143068. Special thanks to the MSVC team including Hulon Jenkins.)

This is the documentation for [[msvc::musttail]].

Where exactly do the speedups come from?

I used to believe the the tail calling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.

My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.

Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.

This has caused many issues for compilers in the past, too many to list in fact. I have a EuroPython 2025 talk about this. In short, this overly large function breaks a lot of compiler heuristics.

One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop. I want to stress that this is not the fault of the compiler. It’s actually doing the correct thing—you usually don’t want to increase the code size of something already super large. Unfortunately, this does’t bode well for our interpreter.

You might say just write the interpreter in assembly! However, the whole point of this exercise is to not do that.

Ok enough talk, let’s take a look at the code now. Taking a real example, we examine BINARY_OP_ADD_INT which adds two Python integers. Cleaning up the code so it’s readable, things look like this:

TARGET(BINARY_OP_ADD_INT) {
    // Increment the instruction pointer.
    _Py_CODEUNIT* const this_instr = next_instr;
    frame->instr_ptr = next_instr;
    next_instr += 6;
    _PyStackRef right = stack_pointer[-1];

    // Check that LHS is an int.
    PyObject *value_o = PyStackRef_AsPyObjectBorrow(left);
    if (!_PyLong_CheckExactAndCompact(value_o)) {
        JUMP_TO_PREDICTED(BINARY_OP);
    }

    // Check that RHS is an int.
    // ... (same code as above for LHS)

    // Add them together.
    PyObject *left_o = PyStackRef_AsPyObjectBorrow(left);
    PyObject *right_o = PyStackRef_AsPyObjectBorrow(right);
    res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);

    // If the addition fails, fall back to the generic instruction.
    if (PyStackRef_IsNull(res)) {
        JUMP_TO_PREDICTED(BINARY_OP);
    }

    // Close the references.
    PyStackRef_CLOSE_SPECIALIZED(left, _PyLong_ExactDealloc);
    PyStackRef_CLOSE_SPECIALIZED(right, _PyLong_ExactDealloc);

    // Write to the stack, and dispatch.
    stack_pointer[-2] = res;
    stack_pointer += -1;
    DISPATCH();
}

Seems simple enough, let’s take a look at the assembly for switch-case on VS 2026. Note again, this is a non-PGO build for easy source information, PGO generally makes some of these problems go away, but not all of them:

                if (!_PyLong_CheckExactAndCompact(value_o)) {
00007FFC4DE24DCE  mov         rcx,rbx  
00007FFC4DE24DD1  mov         qword ptr [rsp+58h],rax  
00007FFC4DE24DD6  call        _PyLong_CheckExactAndCompact (07FFC4DE227F0h)  
00007FFC4DE24DDB  test        eax,eax  
00007FFC4DE24DDD  je          _PyEval_EvalFrameDefault+10EFh (07FFC4DE258FFh)
...
                res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
00007FFC4DE24DFF  mov         rdx,rbx  
00007FFC4DE24E02  mov         rcx,r15  
00007FFC4DE24E05  call        _PyCompactLong_Add (07FFC4DD34150h)  
00007FFC4DE24E0A  mov         rbx,rax  
...
                PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
00007FFC4DE24E17  lea         rdx,[_PyLong_ExactDealloc (07FFC4DD33BD0h)]  
00007FFC4DE24E1E  mov         rcx,rsi  
00007FFC4DE24E21  call        PyStackRef_CLOSE_SPECIALIZED (07FFC4DE222A0h)

Huh… all our functions were not inlined. Surely that must’ve mean they were too big or something right? Let’s look at PyStackReF_CLOSE_SPECIALIZED:

static inline void
PyStackRef_CLOSE_SPECIALIZED(_PyStackRef ref, destructor destruct)
{
    assert(!PyStackRef_IsNull(ref));
    if (PyStackRef_RefcountOnObject(ref)) {
        Py_DECREF_MORTAL_SPECIALIZED(BITS_TO_PTR(ref), destruct);
    }
}

That looks … inlineable?

Here’s how BINARY_OP_ADD_INT looks with tail calling on VS 2026 (again, no PGO):

                if (!_PyLong_CheckExactAndCompact(left_o)) {
00007FFC67164785  cmp         qword ptr [rax+8],rdx  
00007FFC67164789  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h)  
00007FFC6716478F  mov         r9,qword ptr [rax+10h]  
00007FFC67164793  cmp         r9,10h  
00007FFC67164797  jae         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+149h (07FFC67164879h) 
...
                res = _PyCompactLong_Add((PyLongObject *)left_o, (PyLongObject *)right_o);
00007FFC6716479D  mov         eax,dword ptr [rax+18h]  
00007FFC671647A0  and         r9d,3  
00007FFC671647A4  and         r8d,3  
00007FFC671647A8  mov         edx,1  
00007FFC671647AD  sub         rdx,r9  
00007FFC671647B0  mov         ecx,1  
00007FFC671647B5  imul        rdx,rax  
00007FFC671647B9  mov         eax,dword ptr [rbx+18h]  
00007FFC671647BC  sub         rcx,r8  
00007FFC671647BF  imul        rcx,rax  
00007FFC671647C3  add         rcx,rdx  
00007FFC671647C6  call        medium_from_stwodigits (07FFC6706E9E0h)  
00007FFC671647CB  mov         rbx,rax  
...
                PyStackRef_CLOSE_SPECIALIZED(value, _PyLong_ExactDealloc);
00007FFC671647EB  test        bpl,1  
00007FFC671647EF  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)  
00007FFC671647F1  add         dword ptr [rbp],0FFFFFFFFh  
00007FFC671647F5  jne         _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0ECh (07FFC6716481Ch)  
00007FFC671647F7  mov         rax,qword ptr [_PyRuntime+25F8h (07FFC675C45F8h)]  
00007FFC671647FE  test        rax,rax  
00007FFC67164801  je          _TAIL_CALL_BINARY_OP_ADD_INT@@_A+0E4h (07FFC67164814h)  
00007FFC67164803  mov         r8,qword ptr [_PyRuntime+2600h (07FFC675C4600h)]  
00007FFC6716480A  mov         edx,1  
00007FFC6716480F  mov         rcx,rbp  
00007FFC67164812  call        rax  
00007FFC67164814  mov         rcx,rbp  
00007FFC67164817  call        _PyLong_ExactDealloc (07FFC67073DA0h)

Would you look at that, suddenly our trivial functions get inlined :).

You might also say, surely this does not happen on PGO builds? Well the issue I linked above actually says it does! So yeah happy days.

Once again I want to stress, this is not the compiler’s fault! It’s just that the CPython interpreter loop is not the best thing to optimize.

How do I try this out?

Unfortunately, for now, you will have to build from source.

With VS 2026, after cloning CPython, for a release build with PGO:

$env:PlatformToolset = "v145"
./PCbuild/build.bat -p x64 --tail-call-interp --pgo

Hopefully, we can distribute this in an easier binary form in the future once Python 3.15’s development matures!

Addendum & Edits

I was asked for a cross-compiler test. So here’s a quick and dirty toy benchmark of pystones. The last row is the tail call enabled build. All configurations have PGO. On this toy benchmark, we get roughly a 30% uplift. Note that this is unscientific as it was only a sample size of 1 and I cannot disable Turbo Boost on my laptop on Windows for some reason.

Compiler	PlatformToolSet	Pystones/second (higher is better)
VS2019	142	677544
VS2022	143	710773
VS2026	145	682089
VS2026+TC	145	970306

Chris Eibl has done excellent work benchmarking tail calling on various configurations and processors. The results on Windows suggest between 15-20% improved performance on a AMD Ryzen 7 5800X with Visual Studio 2026, on CPython main branch, with a range of -11–55% speedups on the individual benchmarks.

A Plan for 5-10%* Faster Free-Threaded JIT by Python 3.16

2025-11-08T00:00:00+00:00

A Plan for 5-10%* Faster Free-Threaded JIT by Python 3.16

08-Nov-2025

During the Python Core Dev Sprint in Cambridge hosted by ARM, we planned to make the JIT in CPython 5% faster by 3.15 and 10% faster by 3.16. The planners present were Savannah Ostrowski, Mark Shannon, Ken Jin (me), Diego Russo and Brandt Bucher. We were accompanied by other CPython core team members as well.

You might wonder: 5% seems awfully conservative. However, note that this figure is the geometric mean. The number can range from slower to significantly faster. All numbers are pyperformance figures.

In my previous blog post, I talked about the Python 3.13 and 3.14 JIT’s state. We’re planning to change that for 3.15 and 3.16.

The Plan for 3.15

This is a paraphrase of what Savannah laid out here. The difference is that I’m listing things in chronological order of what I expect will be merged into CPython.

Profiling support via LLVM 21.
Trace recording JIT.
Better machine code.
Register allocation/Top-of-Stack Caching.
Reference count elimination.
More constant promotion.
Basic Free threading support.

Profiling support via LLVM 21

Profiling and debugger support is a must-have if we want the JIT to be production-ready. The JIT uses Copy-and-patch compilation to create its templates/stencils. Thanks to Savannah, we have support for LLVM 20 and soon LLVM 21. LLVM 21 in theory should allow us to support stack unwinding through the JIT frames. This would allow debuggers and other tools to see the JIT code as a single frame. Currently the debugger I use gets lost when it tries to introspect JIT code.

I can’t explain more, because I don’t know anything about debuggers and profilers :(.

Trace recording JIT

Our current JIT region selection algorithm could be improved. Here’s the current pipeline:

The region selector, aka. the JIT frontend, uses trace projection. In short, we guess where the traces will go, and use historical data from the interpreter’s inline caches to feed type information into our IR.

There are two problems with the above:

CPython’s inline caches are monomorphic to save space. We thus have little concept of distributions or historical data. The only information we have is “cache hit” or “cache miss”. This causes historical data to be stale/contradictory in our IR very often.
We need a lot of interpreter involvement to record where our code will execute next. For example, if we saw a call, we would inline the call into the call site based on the inline cache entry information. However, this is a best-effort guess due to the previous point. Generators are completely unhandled. Custom dunders are punted. You get the point.

Other tracing JIT compilers like PyPy and TorchDynamo (torch.compile). Use some form of trace recording. This is not entirely true for TorchDynamo, as that seems to introspect values then do a symbolic interpretation over the bytecode. However, the key point is that live up-to-date information is present in both these systems.

At the core dev sprint, Brandt nerd-sniped me to rewrite the entire JIT frontend. Using my free time in the past 2 months, I have done so. The preliminary results are: 1k more loc, roughly 1.5% faster geometric mean average on pyperformance. 100% faster (!!! hopefully not a bug) on the most improved benchmark (richards), and 15% slower on the slowest benchmark. The new JIT frontend now also supports generators (partially), custom dunders, object initialization, etc.

(Image credits to Meta’s Free-Threading Benchmarking Runner). Anything below 1.00x on the graph is a slowdown.

The details of the implementation are quite interesting to me, so you might want to give the PR a read. The key idea is to maintain two dispatch tables in a mechanism I call “dual dispatch”. One table is the standard interpreter, the other is the tracing interpreter. we (ab)use computed gotos/tail calling to dispatch from one table to the other.

Better machine code generation

Copy-and-Patch allows us to generate machine code templates with little effort. However, the base implementation of it without much optimizations does not yield much speedup on CPython. In the original paper, the Copy-and-Patch authors implemented other transformations in LLVM to produce better code for the JIT. We will be doing something similar.

Mark and Diego are working on better codegen for AArch64. For example, see #140683, #139757.

Brandt is also working on better code generation in general. The most interesting one (in my opinion) comes from the Copy-and-Patch paper which is to rearrange assembly control-flow to optimize the chances of things falling-through. In Brandt’s issue, he invertes branches in assembly code to increase the chance of that. The idea is pretty simple, but produces a 1% geometric mean speedup on pyperformance. Here’s how it looks like. If you have the following assembly (example taken from the issue):

cmpq    $0x1, -0x10(%r13)
je      _JIT_CONTINUE
jmp     _JIT_JUMP_TARGET

_JIT_CONTINUE points to the next micro-operation to execute. _JIT_JUMP_TARGET points to a deoptimization target. The hot main path is _JIT_CONTINUE and the cold path in the bad case is _JIT_JUMP_TARGET. You can optimize it to look like this instead:

cmpq    $0x1, -0x10(%r13)
jne      _JIT_JUMP_TARGET

This has the effect of causing the JIT to “fall-through” to the next instruction without any jumps. The jump is only taken in the uncommon deoptimization path! This means in the hot path, no jumps are taken. Modern branch predictors are quite complex, but it seems this performs better (at least on our benchmarks) on our hardware versus the old code.

Another idea is hot-cold splitting, but we don’t have PR up for that yet, so you’ll have to read the issue above!

Register allocation/Top-of-stack caching

Register allocation is likely one of the most worthwhile optimizations a compiler can do. The CPython bytecode interpreter is a stack machine. This means it pushes and pops from an operand stack instead of registers when performing computation. Think of an infinite register machine, but everything lives on the stack.

For obvious reasons, the stack is slower than registers. Luckily in 1995, Anton Ertl proposed a solution to cache the stack in registers. The key idea is to maintain a state machine of the stack. State transitions are loads/spills from registers to the stack and vice versa. Each state represents what is in the stack and what is in registers. Note that this state machine need only be maintained by the JIT optimizer. After the analysis pass, it need not be kept around.

Mark is working on this. We aren’t as advanced as what Ertl proposes in the paper. However it’s a good start. The preliminary results are roughly a 0.5% geometric mean speedup on pyperformance, with the highest speedup on nbody at 16%. You might be surprised that this number isn’t higher. More on that in the next paragraph.

The main problem is actually due to CPython’s reference count semantics. Simply put, CPython tracks object’s liveness using reference counting and tracing garbage collection. The problem with this is that Python supports arbitrary finalizers (__del__). This means anywhere that decrements a reference count could call arbitrary Python code, mandating a register spill as the garbage collector treats the stack as one of its roots.

For example, in Python C API, the seemingly innocuous:

LOAD x
LOAD y
ADD (+)

ADD normally needs to decrement the reference count of all its input operands, thus we are forced to spill there. There is a solution however, and that’s the next section!

Reference count elimiation

Thanks to work on Free-Threading (nogil) by Matt Page @ Meta, CPython’s bytecode optimizer in 3.14 does a simple pass to avoid reference counting local variables. The idea is that if a local variable lives in a CPython function object, then it has a reference to it that will outlive its temporary stack lifetime. In that case, skip the reference counting altogether. More details in the PR here. Interestingly, this PR uses tagged pointers which I was paid to implement by Quansight Labs (thanks Quansight and Meta!) in CPython.

To summarise, the previous

LOAD x
LOAD y
ADD (+)

now becomes

LOAD_BORROW x
LOAD_BORROW y
ADD (+)

Note that thanks to the base bytecode compiler, we now have enough information to perform a data-flow analysis pass to do simple lifetime analysis of objects on the stack! We can trivially observe that anything coming from a _BORROW must have a strong reference somewhere keeping it alive. Therefore, we can convert the ADD instruction to a form which does no reference counting:

LOAD_BORROW x
LOAD_BORROW y
ADD_NO_REFCOUNT (+)

I implemented the pass to do the lifetime analysis in CPython’s JIT optimizer earlier this year in CPython. However, as almost no bytecodes are converted yet, we don’t see a speedup. We do see a speedup in microbenchmarks such as nbody of about 6%. The key idea however, is that this now unblocks the register allocator, allowing it to do this:

LOAD_BORROW_REG_0_1 x
LOAD_BORROW_REG_1_2 y
ADD_NO_REFCOUNT_REG_2_0 (+)

With zero spills (if we are lucky and don’t run out of registers)! This optimization is thus in some sense, a canonicalization pass—it unlocks optimizations for other passes, while optimizing a little on its own.

More constant promotion

One key optimization in JIT compilers is constant propagation. In effect, something like

x = 1
y = 1
z = x + y

becomes:

x = 1
y = 1
z = 2

Note that x and y can be optimized away, but that is usually another associated optimization called copy propagation.

The JIT currently has a limited form of constant propagation. However, to really perform more, it needs to maintain a pool of constants like in Java or PyPy.

The syntax takes ideas from PyPy, but in RPython, you can promote a value to a trace-level constant:

x = hint(x, promote=True)

Which will cause x’s value at the time to be embedded into the trace itself, allowing the optimizer to go ham!

I plan to add that in this PR.

Basic Free-Threading support

Free-threading is taking off, but the current JIT doesn’t work with it yet.

The ideas for this are still a little nebuluous, but over the summer I had the fortune of contributing a little to ZJIT, Ruby’s new JIT compiler, with help from Max Bernstein. It was a load of fun and I learnt a lot. Perhaps the most interesting thing is the idea of PatchPoint and Ractors. I’m basically borrowing Ruby’s ideas here!

Ruby has had the nogil problem for their JIT compilers for ages. One way of making sure single-threaded-assumption optimizations still work is to add a watcher in the code (CPython’s implementation was upstreamed from Cinder, Instagram’s JIT compiler for Python). This watcher is essentially a callback to invalidate something once an assumption holds true. This sounds not very concrete, but consider the following micro-operation trace:

_CHECK_VALIDITY
# ... optimized that assumes single-threaded mode

We then insert into the thread_create in CPython:

thread_create()
{
    invalidate_all_jit_code();
    ...
}

Because the JIT checks for invalidate_all_jit_code contains a callback to invalidate _CHECK_VALIDITY. Which is checked by our JIT code everytime it’s run. This is just a cheap boolean flag check. This means when a thread is created, we throw all our JIT code away. Single threaded code runs at JIT speed, multi-threaded code runs slower but at least has the benefit of the GIL off.

This seems wasteful for now, but I did say basic free-threading support. There are more advanced schemes available! Such as discarding and then recompiling in a multi-thread optimization mode where we only turn on safe optimizations. That’s only planned for 3.16 though. We’re taking small steps.

Conclusion

If you like this sort of thing, or even if you don’t, consider contributing to CPython! For now, we don’t have easy contributor issues yet as we are waiting to land the initial stages of the trace recording JIT and the register allocator. After that, probably the start of next year, there will be tons of things for people to contribute to! Contribution doesn’t have to be code either, good reviews are always appreciated.

Acknowledgements

I thank Mark and Savannah for always reviewing my PRs :).

Reflections on 2 years of CPython’s JIT Compiler—The good, the bad, the ugly

2025-07-05T00:00:00+00:00

Reflections on 2 years of CPython’s JIT Compiler: The good, the bad, the ugly

5 July 2025

This blog post includes my honest opinions on the CPython JIT. What I think we did well, what I think we could have done better. I’ll also do some brief qualititative analysis.

I’ve been working on CPython’s JIT compiler since before the very start. I don’t know how long that is at this point … 2.5, maybe almost 3 years? Anyways, I’m primarily responsible for Python’s JIT compiler’s optimizer.

Note that at this point of time, the JIT is still experimental. This means it’s not ready for prime time yet: this blog post may go out of date fairly quickly!

Here’s a short summary:

The good:

I think we’re starting to build a community around the JIT, which is great.
The JIT is also teachable. We have newcomers coming in and contributing.

Could use improvement:

Performance
Inaccurate coverage of the JIT

Good: A community-driven JIT

CPython’s JIT is community-driven at this point. You may have heard of the layoffs at Microsoft affecting the Faster CPython team. However, my underestanding is that from the very start, the JIT was meant to be a community project.

This wasn’t always the case. When the JIT started out, it was practically only Brandt working on the machine code generator. I had help from Mark (Shannon) and Guido in landing the initial optimizer, but after that it was mostly me. Later I got busier with school work and Brandt became the sole contributor to the optimizer for a few months or so. Those were dark times.

I’m really happy to say that we have more contributors today though:

Savannah works on the machine code generator, reproducible JIT stencils, and sometimes the optimizer.
Tomáš works on the optimizer and is a codeowner of it!
Diego works on the machine code generator to improve it on ARM, and sometimes the optimizer.
We also have various drive-by contributors. Zheaoli, Noam and Donghee are names that I remember. Though I’m definitely missing a few names here.

This community building was somewhat organic, but also very intentional. We actively tried to make the JIT easier to work on. If you dig up past discussions, one of Mark’s arguments for a tracing JIT was easier static analysis. This easiness isn’t just that it doesn’t require a meet or join in general or that it requires only a single pass, but more that static anlaysis of a single basic block is easier to teach than a whole control-flow graph.

We also actively welcome people to work on the JIT with us. CPython doesn’t have much optimizing compiler expertise interested in working on the JIT. We have some compiler people, but the subset of those interested in working on the JIT is even smaller. So we aim to train up people even if they don’t have any background in compilers.

Good: A teachable JIT

As I mentioned earlier, tracing was one decision to make the JIT easier to teach. There are a few other design decisions too, but those will be their own blog post. So I’m not talking about them here.

Could be improved: Performance

CPython 3.13’s JIT ranges from slower to the interpreter to roughly equivalent to the interpreter. Calling a spade a spade: CPython 3.13’s JIT is slow. It hurts me to say this considering I work on it, but I don’t want to sugarcoat my words here.

The argument at the time was that it was a new feature and we needed to lay the foundations and test the waters. You might think that surely, CPython 3.14’s JIT is a lot faster right? In some ways, the JIT has become faster, but only in select scenarios. The answer is again… complicated. When using a modern compiler like Clang 20 to build CPython 3.14, I often found the interpreter outperforms the JIT. The JIT only really starts reaching parity or outperforming the interpreter if we use an old compiler like GCC 11 to build the interpreter. However, IMO that’s not entirely fair to the interpreter, as we’re purposely limiting it by using a compiler we know is worse for it. You can see this effect very clearly on Thomas Wouter’s analysis here. Note that this is the geometric mean. So there are select workloads where the JIT does show a real speedup!

(Image credits to Thomas Wouters). Anything below 1.00x on the graph is a slowdown.

In short, the JIT is almost always slower than the interpreter if you use a modern compiler. This also assumes the interpreter doesn’t get hit by random performance bugs on the side (which has happened many times now). Note: this result only applies to our x64 benchmarks. I cannot conclude anything about AArch64, which has been improving over time.

In some cases, we do see significant speedups (up to ~20%) in certain benchmarks. Indicating that some progress has been made on 3.14. Which is a good thing! What we’re tackling is that the performance is a mixed bag and often not very predictable. In the richards benchmark, we see a ~20% speedup, but on the nbody benchmark, we see a ~10% slowdown on my system, and a smaller slowdown for the spectralnorm benchmark. All of these are known to be loop-heavy artificial benchmarks, which V8 has since ditched so in theory, they all should see a speedup, but they don’t, which is strange.

3.14 JIT Off:
richards: Mean +- std dev: 44.5 ms +- 0.5 ms
nbody: Mean +- std dev: 91.8 ms +- 3.5 ms
spectral_norm: Mean +- std dev: 90.6 ms +- 0.7 ms

3.14 JIT On:
richards: Mean +- std dev: 37.8 ms +- 2.4 ms
nbody: Mean +- std dev: 104 ms +- 2 ms
spectral_norm: Mean +- std dev: 96.0 ms +- 0.7 ms

System/Build configuration: Ubuntu 22.04, Clang 20.1.7, PGO=true, LTO=thin, tailcall=false. Tuned with pyperf system tune.

You might ask: why is the 3.14 JIT not much faster? The real answer, which again hurts me to say is that the 3.14 JIT has almost no major optimizer* features over 3.13. In 3.14, we were mostly expanding the existing types analysis pass to cover more bytecodes. We were also using that as a way to teach new contributors about the JIT and encourage contribution. In short, we were building up new talent. I personally think we were quite low on contributors at the start. I also had other commitments which made features that were supposed to go into the JIT not go in, which I’m sorry for. Personally, I think building up more talent over prioritizing immediate performance is the right choice for long-term sustainability of the JIT.

*optimizer = JIT optimizer, separate from the code generator. The code generator for x64 and AArch64 has seen improvements.

Could be improved: Inaccurate coverage

The initial media coverage of the 3.13 JIT got the numbers wrong by misinterpreting our results. There was this number of “2-9%” faster being spread around. I think the first major blog post that covered this was this one. Note that I’m friends with the author of that post and I’m not trying to say that they did a bad job. Conveying performance is a really hard job. One that I’m still struggling with myself. However, in good conscience, and as an aspiring scientist, I can’t stand by and watch people say the 3.13 JIT is “2-9%” faster than the intepreter. It’s really more nuanced than that (see section above). Often times, the CPython 3.13 JIT is a lot slower than the interpreter. Furthermore, the linked comment is that the 3.13 JIT is 2-9% faster than the tier 2 interpreter. That’s the interpreter that executes our JIT intermediate representation by interpreting it, which is super slow. It’s not comparing to the actual CPython interpreter.

I’ve seen other sources repeat this number too. It frustrates me a lot. The problem with saying the 3.13 JIT is faster is that it sets the wrong expectations. Again, users on the Python Discourse forum and privately have shared performance numbers where the JIT is a significant regression for them. This goes against the grain of what’s reported online. We do not have control over the numbers, but I still would like to clear the air on what the real expectation should be.

Ugly: None

If I had thought there were really ugly stuff, I wouldn’t be working on the JIT anymore :-).

Conclusion and looking forward

I’m still hopeful for the JIT. As I mentioned above, we’ve built a significant community around it. We’re now starting to pick up momentum on issues and new optimizations that could bring single-digit percentage speedups to the JIT in 3.15 (note: this is the geometric mean of our benchmarks, so real speedups might be greater or lesser). Brandt has already merged some optimizations for the JIT’s machine code. I don’t want to bring unwanted attention to the other efforts for the moment. Just know this: there are multiple parallel efforts to improve the JIT now that we have a bigger community around it that can enable such work. The road getting here has been tough, but there’s promise in our future. We also really need help testing the JIT and getting more data for it. Please try it out!

Correction notice

In a previous version of this blog post, I pointed out there were no major performance additions to the JIT in 3.14. When I said this, I was thinking of the JIT optimizer only, not the machine code generator. I am frankly underqualified to talk about the machine code generator. I have since updated the post to specify the optimizer. Furthermore, when I say major, I don’t meant to denigrate the efforts of our contributors. I had planned for certain major features to enter the CPython JIT in 3.14, but missed them due to my own lack of time. So I’m not pointing blaming anyone here other than myself.

The (lack-of) performance gains for the JIT are for architectures that I observed (mostly a range of x64 processors). It is possible that some architectures have real gains that I’m not aware of.

I also added some benchmarks run on my system, where I show a speedup in some workloads, but a slowdown in others.

I’m Sorry for Python’s tail-calling Interpreter’s Results

2025-03-08T00:00:00+00:00

I’m Sorry for Python’s tail-calling Interpreter’s Results

08-Mar-2025

This is my first blog post ever. I want to use it to say I’m truly sorry for communicating inaccurate results for Python’s tail-calling interpreter. I take full personal responsibility for the oversight that led to it.

What happened?

About a month ago, I merged a new tail-calling interpreter into Python. That interpreter reported a 9-15% performance boost on Python 3.14’s Whats New page.

These figures turned out to be inaccurate. Long story short, the compiler we were using (Clang 19), had a bug that worsened our baseline performance. We (the CPython developers) were completely unaware of this bug.

The real performance uplift one can expect by upgrading to the tail-calling interpreter is between the 3-5% range. We are not too sure about this figure as well, because we had to compare across different compilers.

Thanks to Nelson Elhage for their excellent investigation into this issue and bringing it up. For more information, you can read their blog post here.

What I’m doing to fix the situation

Upon receiving news from Nelson confirming that the Clang 19 bug caused a 10% performance regression on our baseline. I did the following:

Immediately pushed a PR to Python 3.14’s What’s New Page to correct the record. I put a big attention markup in reStructuredText to signal to the reader that a correction has been made. This also gives credit to Nelson.
Updated all my Reddit posts to add a disclaimer and link to the updated What’s New.
For Twitter/X: I don’t have premium so I can’t edit my post. I’m thinking of posting a link to this blog post to let people know.

If you feel there’s more I could do, please let me know.

What I’ve learnt from this

This completely blindsided me and I’ve learnt to never trust the compiler when the performance results are too good to be true. That, and to carefully investigate our baselines.

At the time of writing, the Clang 19 bug I talked about is not yet fixed, and it exists in Clang 19, 20, maybe 21-beta. I do not want to blame the LLVM developers for this. Like me, they are probably volunteer contributors as well. Sometimes we make mistakes.

Summary

In short, a compiler bug in Clang 19 that we were unaware of resulted in worse baselines. I reported these figures believing they were true. I should have done more investigation into the compiler before reporting these figures. I’m deeply sorry for mistakenly reporting inaccurate numbers.