Vincent D. Warmerdam's Blog

benchmarking permutations

2026-01-08

I was dealing with a Python implementation for a prisoner problem simulation that involves permutations. You can simulate a permutation by running random.sample(list) or by running np.random.permutation(n_prisoners) but if you want to then walk the path of a cycle in a permutation you'll quickly learn that you're dealing with a loop that doesn't easily vectorise. It involves code like this:

def max_cycle_length_python(perm):
    """Find the maximum cycle length in a permutation (pure Python)."""
    n = len(perm)
    visited = [False] * n
    max_len = 0
    for start in range(n):
        if visited[start]:
            continue
        length = 0
        current = start
        while not visited[current]:
            visited[current] = True
            length += 1
            current = perm[current]
        if length > max_len:
            max_len = length
    return max_len

This got me thinking, maybe we can have Claude make different implementations and we can see if there is a speedup! So I made it try pure Python, NumPy, Numba, Rust and Mojo.

Implementation	Time (s)	Sims/sec	Speedup
Pure Python	0.752	133,038	1.0x
NumPy	1.525	65,570	0.5x
Numba	0.263	380,111	2.9x
Rust	0.102	980,604	7.4x
Mojo	0.161	619,273	4.7x

I got a notebook, it spews out results and while it's impressive that it was able to generate all of this ... I still felt compelled to have a look at some of the code. For starters, I spotted that the original Python benchmark still used numpy code in it, which wasn't fair. I also spotted that numba code added a JIT around the cycle_path function, but omitted the permutation generation.

before

@njit
def max_cycle_length_numba(perm):
    """Find the maximum cycle length in a permutation (Numba JIT)."""
    n = len(perm)
    visited = np.zeros(n, dtype=np.bool_)
    max_len = 0
    for start in range(n):
        if visited[start]:
            continue
        length = 0
        current = start
        while not visited[current]:
            visited[current] = True
            length += 1
            current = perm[current]
        if length > max_len:
            max_len = length
    return max_len


def simulate_numba(n_prisoners: int, n_sims: int) -> list[int]:
    """Run simulations using Numba JIT-compiled function."""
    results = []
    for _ in range(n_sims):
        perm = np.random.permutation(n_prisoners).astype(np.int64)
        results.append(max_cycle_length_numba(perm))
    return results

after

@njit
def _simulate_numba_jit(n_prisoners: int, n_sims: int) -> np.ndarray:
    """Fully JIT-compiled simulation loop."""
    results = np.empty(n_sims, dtype=np.int64)

    for i in range(n_sims):
        perm = np.random.permutation(n_prisoners)

        # Inline max_cycle_length logic for performance
        n = len(perm)
        visited = np.zeros(n, dtype=np.bool_)
        max_len = 0
        for start in range(n):
            if visited[start]:
                continue
            length = 0
            current = start
            while not visited[current]:
                visited[current] = True
                length += 1
                current = perm[current]
            if length > max_len:
                max_len = length
        results[i] = max_len

    return results

When I fixed both, the simulation numbers looked a bunch different.

Implementation	Time (s)	Sims/sec	Speedup
Pure Python	0.917	109,107	1.0x
NumPy	1.503	66,534	0.6x
Numba	0.1	1,002,954	9.2x
Rust	0.103	971,905	8.9x
Mojo	0.16	624,544	5.7x

I'm reasonably comfortable with all the Python benchmarks but then we get to the rust and mojo code. This is tricky. I can prompt Claude to try and dive deeper. But in the end, those languages are new to me. So my ability to steer it are limited. The main thing that I had it try were to ask Claude about sorting algorithms and to see if it might be better to use a standard library instead of rolling a custom implementation ... but that's about it on the short term.

The numbers didn't really change much and even though I am simulating a lot of times here it all feels within the margin of error.

Implementation	Time (s)	Sims/sec	Speedup
Pure Python	0.922	108,467	1.0x
NumPy	1.505	66,465	0.6x
Numba	0.108	926,469	8.5x
Rust	0.101	991,451	9.1x
Mojo	0.158	632,052	5.8x

How to think about this stuff

I can't help but play the devil's advocate here. If the point is to achieve a speedup then it's clear that Claude can just do that for you. It may not be the best way to do a benchmark, but the Rust implementation is on-par with the faster Numba one. If the bad benchmark led me to the Rust implementation it would still be a win, right? So maybe it doesn't matter if the comparison is fair or not.

A good reason to "no" to this is that there is a difference between building something that's "easy" and building something that's "simple". And that difference tends to be worth something. Does it really make sense to add Rust to a Python project? Or can we stick to a simpler Python stack by sticking to Numba? It depends on the project, sure, but that's a design decision, not a mere implementation detail.

It's not just the fact that I want to be able to work on a codebase even if the LLM is down. It's also that it's one thing to let go the how-part of something. I don't know how to write assembly, after all. So at some level I won't exactly know what's going on with my code. But it's another thing not to be able to explain why you designed something the way it is. And if Claude is going to help me, that's the part I can't let go.

It's incredibly clear that Claude is doing something impressive here, something that I clearly don't want to ignore. But at the same time I am also acutely aware that you loose something if you don't take the time to look at what you're getting back. My mind is really going back and forth on this topic as I expose myself to more and more of these tools and tasks. Depending on how things go I might think differently in a years time.

specific gridsearch

2026-01-06

I am copying a pattern from Simon Willison by also having Claude Code write me some notebooks on occasion about algorithm-peformance things that I'm curious about. A few years ago, it probably would not have been worth it for me to make this investment, but these days ...

The topic of todays exercise is GridSearchCV from scikit-learn and ways it could be so much faster if you designed it to work for specific use-cases. Under the hood it uses joblib/pickle to serialise scikit-learn pipelines and this comes at a cost.

I had Claude write me a notebook to compare a few different approaches for linear models, specifically.

Ridge Regression

Ridge is a linear model that comes with a regularisation parameter that you might want to loop over. Here's the thing though; you can pick an optimiser that leverages that Ridge has a closed form solution.

$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} $$

You can even write your NumPy code in a clever way so that you don't have to loop over all the $\alpha$-values. You can just figure out all the optimal weights on one swoop. If you compare it to looping over lots of calls to Ridge().fit or, even worse, doing that inside of a GridSearchCV you're gonna get some overhead.

Logistic Regression

Logistic regression does not have a closed form solution, but you can apply another trick. After you've trained your first model, you can consider using those weights as a starting point to train the next model that has a slightly different regularisation parameter. This is also a massive speedup! One that's easy to perform when you write the code "by hand" but not something that grid-search can do for you because it assumes that all

Pramatic?

Should you rewrite all your code now? No! Scikit-learn tries to solve an utmost general problem, there's bound to be many more of these instances where you can get better specific performance.

Is it an interesting lesson, though? One that's easier to observe thanks to new tools? Sure!

Is this a free lunch? No, not at all. Claude did a bunch of boilerplate right but it did get a lot of important nuance wrong. It originally made comparisons where it didn't just measure the joblib overhead but also performed cross-validations which isn't a fair comparison. You still need to check the work of the LLM, even if it speeds up the boilerplate and lowers the barrier of entry for these sorts of things.

Notebook for this work can be found here.

rusty maths

2026-01-04

Math can be so interesting sometimes.

Let's say you want to simplify this:

$$ \frac{\sin^2 x}{1 - \cos x} - 1 $$

One path

I thought, aha!, there's a 1 there! And I know that $\sin^2x + \cos^2 = 1 $. So I could just do this right?

$$ \frac{\sin^2 x}{1 - \cos x} - \sin^2x + \cos^2 $$

And from here you could expand and try something like:

$$ \frac{\sin^2 x}{1 - \cos x} - \frac{(1 - \cos x)(\sin^2x + \cos^2)}{1 - \cos x} $$

You now have something with the same denominator and this would get you there eventually ... but only if you're completely sure that you don't make any manual mistakes.

Another path

Instead you could also rewrite $\sin^2x + \cos^2 = 1 $ into $\cos^2 x = 1 - \sin^2x $. Why would this help? Well let's go back to the start.

$$ \frac{\sin^2 x}{1 - \cos x} - 1 $$

This becomes.

$$ = \frac{1 - \cos^2 x}{1 - \cos x} - 1 $$

You can factor this.

$$ = \frac{(1 - \cos x)(1 + \cos x)}{1 - \cos x} - 1 $$

And hey! Notice how things cancel out pretty early here!

$$ = \frac{(\cancel{1 - \cos x})(1 + \cos x)}{\cancel{1 - \cos x}} - 1 $$

That makes things a lot simpler.

$$ = 1 + \cos x - 1 $$

$$ = \cos x $$

The tricky part with treating math as an occasional hobby, instead of a daily activity, is that you get very rusty at detecting the right path early.

koaning.io

benchmarking permutations

How to think about this stuff

specific gridsearch

Ridge Regression

Logistic Regression

Pramatic?

rusty maths

One path

Another path

Recent Articles

compdef is handy

When Kevin Malone meets Claude

TIL wget can download full sites

diskcache with zlib

TaskyPi can turn your pyproject.toml into a Makefile too

Banning SQLAlchemy Dialects with Ruff

2nd Talk Python Interview

slipways

til etsy datasets

python data tools live

framework mechanical keyboard

ruff banned imports

dev-requirements.txt is bad

The transfer of enthousiasm

Python can open a webbrowser for you

webtui is stunning

Deliberate play

The music of 3b1b

Overtype markdown

The titanic dataset has a twist

The Sock Drawer Paradox

cline feels like an upgrade