Translation Overlay 4: Tesseract as c# OCR provider

Posted by Programming Is Moe on Saturday, November 14, 2020

TOC

This time I will attempt to use Tesseract as an alternative to Google Cloud OCR.

Getting Tesseract Up and running in C#

Luckily there exists a c# wrapper for Tesseract by charlesw (no work is best work). So we install the Tesseract 4.1.0-beta nuget package, yoink the c# example from his example folder and slightly modify it for my use case:

public async Task<IEnumerable<ISpatialText>> Ocr(Stream image, string? inputLanguage = null)
{
    using var engine = new TesseractEngine(@"./tessdata", "jpn", EngineMode.Default);

    using var memoryStream = new MemoryStream();
    await image.CopyToAsync(memoryStream);

    using var img = Pix.LoadFromMemory(memoryStream.ToArray());
    using var page = engine.Process(img, PageSegMode.Auto);
    
    var completeText = page.GetText();
    _logger.LogInformation("Mean confidence: {0}", page.GetMeanConfidence());
    
    var results = new List<ISpatialText>();
    using var iter = page.GetIterator();
    iter.Begin();
    do
    {
        do
        {
            do
            {
                do
                {
                    var boundingWorked = iter.TryGetBoundingBox(PageIteratorLevel.Word, out va
                    var text = iter.GetText(PageIteratorLevel.Word);
                    if (boundingWorked && text != null)
                        results.Add(new SpatialText(text, new Rectangle(boundingBox.X1, boundingBox.Y1, boundingBox.Width, boundingBox.Height)));
                    
                } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
            } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
        } while (iter.Next(PageIteratorLevel.Block, PageIteratorLevel.Para));
    } while (iter.Next(PageIteratorLevel.Block));
    return results;
}

Babyshark goes do do do do. Now run that bad boy aaaaand: Missing data files. The C# wrapper doesn’t come prepacked with any language data files that tesseract needs to work. So looking at the tesseract documentation it describes 3 different sets. They basically are:

  • tessdata (Fast, less accurate)
  • tessdata-best (Slowest, most accurate)
  • tessdata-fast (Fastest, least accurate)

Because I have no idea yet how slow, slow is, I go with whatever has best in name. Linked on that we grab the 2 files that are relevant to my example image jpn_vert.traineddata and jpn.traineddata. Create a folder in my assembly called tessdata as referenced in the new TesseractEngine(@"./tessdata", "jpn", EngineMode.Default) ctor call and set both files to CopyAlways. Now it should work.

[OCR] 👈 boop

NOTHING GETS RECOGNIZED
NOTHING GETS RECOGNIZED

YEAH NOTHING. Just as a reminder, this is what should have happened:

reference

Trouble shooting tesseract not finding any text

Just FYI, tesseract has a nice guide on how to improve the result of OCR translations.

So because I made that mistake 5 times during this project by now: I made sure I just didn’t fuck up and checked the completeText variable in my method. And it’s an empty string. Good, now check: What is Tesseract seeing?

What is tesseract attempting to translate

For that tesseract has a configuration variable tessedit_write_images which will output the image right before the OCR step of tesseract. According to the docs tesseract does a bunch of image processing by itself. (Or maybe I managed to transmit a blank screen again)

public async Task<IEnumerable<ISpatialText>> Ocr(Stream image, string? inputLanguage = null)
{
    using var engine = new TesseractEngine(@"./tessdata", "jpn", EngineMode.Default);
    engine.SetVariable("tessedit_write_images", true);
    ...

Run it again and take a look at the .tiff output by tesseract. Tesseract will toss the image wherever the current “working directory” of the process is, which is while debugging where the main .dll is. So: bin/debug/ and so on.

full-capture

Looks pretty unintelligible to me.

How can we pre-process this to be more OCR friendly

We don’t. No matter what kind of filter combination I try in gimp, whatever I end up with is pretty much the same thing tesseract did only “worse”. And a library written by hundreds of people smarter than me surely does the most optimal image processing it can for this almost ideal image.

Is size/resolution the issue?

Let’s zoom, way in, and try it on only one panel:

zoomed-in

[OCR] 👈 boop

zoomed-in-ocr

Hey it was resolution all along! Or not. HMMMMMM, Why is only one panel OCR’d? Highly HIGHLY suspicious. Lemme zoom out to the original and limit the area to an individual panel:

[OCR] 👈 boop

single-panel

Awww crap. I smell work. So my shotgun approach to OCR that worked with google is a no go? This is what tesseract processed:

single-panel-see

Let’s see how that compares to the full capture:

partial-vs-full

And another example

partial-vs-full2

It is apparent that the more non text tesseract has to process in the image the more aggressive it tries to process the image? When it’s only text it almost does no processing at all.

Improving the OCR area for tesseract

I see 2 options:

  1. Train a neural network to detect speech bubbles like this dude. (Why didn’t you release what you had ;_;)
  2. Find magic settings in tesseract by reading the docs.

Both mean work but one of these options involves manually selecting bubbles in 4000 images and having to learn new skills. And somehow I don’t want to my application to focus 100% on manga type detection. Sure it could be an optional helper mode but…. uhg I don’t know… But I do get the feeling that I probably won’t get around it anyway. But yeah let’s take a look at the docs.

The only solution is not to be lazy

Well after sleeping on it a bit (and reading the docs) I came to the conclusion that there is no way around bubble detection. Tesseract just isn’t meant for finding text in noisy images. So up next how 2 yolo: