Finding Text in Comic Books

Today I learned, via this paper, that people are doing their best to parse the text in speech bubbles in comic books.

It turns out to be an interesting OCR challenge, but it also seems like there are data quality issues. To quote the paper:

This led the authors to create a new dataset, called COMICS Text+. It comes with two variants, one for detecting the speech bubbles with text and another one for recognizing the text that is in the bubble.

The paper describes some considerations too.

Our policy for selecting speech bubbles for the GT data is to ensure that all the text in the image is clearly visible and readable. The selected images come from different comic books and have various artistic styles. In some cases, there may be residual characters on the margins of the speech bubbles, but this is acceptable as long as the characters are not fully visible. This allows us to see how the text detector performs in the presence of residual text and ensures that it does not incorrectly label these as text regions. However, we do not include examples with half-words or full characters in the GT data because this would introduce irregular text and speech bubble detection problems that would skew the evaluation of our models’ performance.

The paper also mentions some models that were trained, and they compare against previous attemps as well.

It was a fun read! The data and source code for all this can be found on Github.