The startcraft pdf benchmark
I have a pdf on my laptop that is about starcraft bots.
The paper itself is a fun read, but the most interesting thing about the PDF to me is how many OCR tools fail to grok the order in which it should be read. This is the order that many of them fall for:
However, this is the correct one:
I've tried Mistral OCR (demo), marker from datalab as well as some LLMs but they all seem to fall for the same trap.
If you are more interested in making a summary of the entire pdf or if you are interested in merely indexing the text then this may not matter as much but for a lot of use-cases the order really does matter! And if it does ... maybe check if the vendor passes the starcraft pdf benchmark beforehand.