The factorio benchmark
Jack Clark has a newsletter with a very interesting fact this week: factorio (the immensely popular factory game) is now an AI benchmark.
This page has all the details of the work, which include:
- A Python library that can interact with the game, which is the main entrypoint for the agents that compete in tasks.
- A leaderboard with the results of the agents that have competed so far (Claude seems the winner, but the fact that one of the authors is from Anthropic might help there).
One thing that's particularily interesting is that there's a mode for specific tasks, but there is also an unbounded task where the agents are simply instructed to "build the biggest factory". This is a very interesting task because it's not clear what the best strategy is but there is a "score" that can be associated with the production of the factory. Some items are higher up in the value chain because the require more complex production chains, so the score is not just a matter of counting the number of items produced.
It turns out to be a hard benchmark. Spatial reasoning, especially with planning, is still pretty hard for LLMs and in order for this setup to really work for an agent it also really needs to be great at programming. Interesting stuff though!