The startcraft pdf benchmark

I have a pdf on my laptop that is about starcraft bots.

CleanShot 2025-03-15 at 20.01.04.png
This is the top of the paper.

The paper itself is a fun read, but the most interesting thing about the PDF to me is how many OCR tools fail to grok the order in which it should be read. This is the order that many of them fall for:

CleanShot 2025-03-15 at 20.03.16.png
Fail.

However, this is the correct one:

CleanShot 2025-03-15 at 20.02.45.png
Win!

I've tried Mistral OCR (demo), marker from datalab as well as some LLMs but they all seem to fall for the same trap.

If you are more interested in making a summary of the entire pdf or if you are interested in merely indexing the text then this may not matter as much but for a lot of use-cases the order really does matter! And if it does ... maybe check if the vendor passes the starcraft pdf benchmark beforehand.