More Text Than Expected

I put in a hard limit of 60 detected text areas before sending it to the google translation API for one reason: What if the OCR fails to see the characters on the image as cohesive text. Which would result in 2 issues:

The google translate v3 api has a rate limit of (I think) 6000 translate requests per minute. Which would be really hard to crack except if the ocr sees every single character as separate “text”. Which possibly could, depending on the type of media that was OCR’d, result in an ridiculous amount of requests. Even if it might stay below the rate limits, I really don’t want to send 100+ requests at the same time every single time.
But most importantly: The translation would be useless without whole sentences.

This the test image that I was using:

by bkup

This time: Less assumptions, first I will look into the list what the text parts are and also add a debug visualization for it to the overlay.

watch result

The first annotation element has at least half of the text of the whole page? and the other 96, one to two characters each. No wait. The first element is the combination of all the other elements. So that means that the first Annotation element is the whole text and the rest the individual subtexts/characters. I totally expected that…

annotateimageresponse documentation

text_annotations[] | EntityAnnotation | If present, text (OCR) detection has completed successfully.

Yes, very informative google.

OCR Debug overlay

First I will have to convert the boundingPoly vertices of each annotation to the c# native Rectangle struct as I’m too lazy to deal with random quadrilaterals. (Yes I assumed things again). But this time I bothered to check:

"vertices": [ { x: 525, y: 65 }, { x: 663, y: 65 }, { x: 663, y: 169 }, { x: 525, y: 169 } ]
"vertices": [ { x: 551, y: 186 }, { x: 658, y: 184 }, { x: 660, y: 286 }, { x: 553, y: 288 } ]

As you can see the second pair of vertices are not a rectangle. Also google translate returns the vertices in clockwise order, starting at the top left. So to make it simple I will just grab the top left vertex [0] and the bottom right [2] to create a rectangle.

private SpatialText ConvertToSpatialText(EntityAnnotation annotation)
{
    var topLeft = annotation.BoundingPoly.Vertices[0];
    var bottomRight = annotation.BoundingPoly.Vertices[2];

    var position = new Point(topLeft.X, topLeft.Y);
    var size = new Size(bottomRight.X - topLeft.X, bottomRight.Y - topLeft.Y);

    var rectangle = new Rectangle(position, size);

    return new SpatialText(annotation.Description, rectangle);
}

So let’s give the new debug view a go:

I didn’t offset the coordinates by the capture region… — I didn't offset the coordinates by the capture region...

private SpatialText ConvertToSpatialText(EntityAnnotation annotation)
{
    ...
    var rectangle = new Rectangle(position, size);
    rectangle.Offset(_captureOffset);

    return new SpatialText(annotation.Description, rectangle);
}

Thanks, ImGui for not crashing when handling characters you don’t have a texture for — Thanks, ImGui for not crashing when handling characters you don't have a texture for

Nice. Now I will have to consolidate the elements into coherent sentences.

Consolidating Neighbors

Consolidating neighboring rectangles is straight forward enough… I hope.

As far as I can tell the order of the elements is from top to bottom, right to left. So if I iterate trough them in order and calculate the current elements distance to all prior accumulated elements, then it belongs to a prior text if the distance is below a threshold relative to the font size.

Is It In Distance?

Take for example the following scenario: I have already consolidated the text A and want to figure out if B and C are also at least half font size away from it. The simplest approach would be to calculate the distance by using the ‘Rectangle’ position (Top left Corner) to produce the distance (Green Lines). That would work for A and B as the corners are close enough. But as you can see A and C corners are too far away.

So what I actually want is the “shortest distance” of 2 Rectangles. “closest point of approach” is probably the more mathematically correct term for that (Blue lines)

I tried thinking of a easy way to calculate it but any solution I came up with involved figuring out the closest line between them, which would result in differentiating between 8 cases and calculating line segments. So a lot of error prone code to write and test. Also my time on earth is limited.

So I started to look online for a existing solutions but most algorithms for that involved almost exactly what I was thinking of and I couldn’t be bothered to port over partial python implementations

3D python implementation of finding the shortest point of approach between 2 lines.

Or another approach I saw online which involved using the Vector between the geometrical center of the rectangles and applying the Pythagoras Theorem a few times. Which isn’t 100% optimal but would absolutely suffice for the described problem.

So I went to sleep to throw a temper tantrum in my dreams about why I only want to write glue code and woke up with an idea: Just enlarge the Rectangles by the distance threshold and see if they have any overlap

And why is that easier? Because the native Rectangle struct has methods for both:

public static bool IsInDistance(this Rectangle a, Rectangle b, int distance)
{
    var aCopy = Rectangle.Inflate(a, distance, distance);
    var bCopy = Rectangle.Inflate(b, distance, distance);
    
    return aCopy.IntersectsWith(bCopy);
}

private IEnumerable<ISpatialText> ConsolidateInternal(IList<ISpatialText> unconsolidatedTexts)
{
    var consolidatedTexts = new List<ISpatialText>();

    foreach (var text in unconsolidatedTexts)
    {
        var consolidatedTextItBelongsTo = consolidatedTexts.FirstOrDefault(x => BelongToSameText(x, text));

        if (consolidatedTextItBelongsTo == null)
            consolidatedTexts.Add(new AccumulatedSpatialText(text));
        else
            consolidatedTextItBelongsTo.Combine(text);
    }
    
    return unconsolidatedTexts;
}
        
private static bool BelongToSameText(ISpatialText textA, ISpatialText textB)
{
    var fontSizeCompareTolerance = textA.Metrics.AverageLetterHeight * 0.9;
    var xTolerance = textA.Metrics.AverageLetterWidth;
    var yTolerance = textA.Metrics.AverageLetterHeight;
    var maxDistance = xTolerance * yTolerance / 2;
    
    var areSimilarFontSize = Math.Abs(textA.Metrics.AverageLetterHeight - textB.Metrics.AverageLetterHeight) < fontSizeCompareTolerance;

    if (!areSimilarFontSize)
        return false;

    var areCloseEnough = textA.Area.IsInDistance(textB.Area, maxDistance);

    return areCloseEnough;
}

Let’s hit that OCR button and see what will have gone terribly wrong this time:

“hey, at least the consolidation worked without crashing” — hey, at least the consolidation worked without crashing

private static bool BelongToSameText(ISpatialText textA, ISpatialText textB)
{
    ...
    // var maxDistance = xTolerance * yTolerance / 2;
    var maxDistance = (xTolerance + yTolerance) / 2;
    ...
}

“And a few rendering tweaks” — And a few rendering tweaks

Now how to deal with that upper right panel issue

Matching The Same Text Bubble

Currently I’m using proximity and font Size as a simple solution to clustering the elements. But it does not quite work with the giant blob of text in the upper right corner. Because the font is so big my “enlarge the rectangle by font size” solution causes it to overlap with the speech bubble to the left.

Also in the middle left panel, two panels didn’t get merged properly. So if I drop the linear distance that might fix the big text but cause problems to smaller fonts.

Well what to do?

I could of course try a visual approach that uses information from the source image to help but that would mean work. Time to tweak numbers / the algorithm.

private static bool BelongToSameText(ISpatialText textA, ISpatialText textB)
{
    var maxDistanceA = textA.Metrics.AverageLetterWidth + textA.Metrics.AverageLetterHeight;
    var maxDistanceB = textB.Metrics.AverageLetterWidth + textB.Metrics.AverageLetterHeight;

    var areCloseEnoughForA = textA.Area.IsInDistance(textB.Area, maxDistanceA);
    var areCloseEnoughForB = textA.Area.IsInDistance(textB.Area, maxDistanceB);

    return areCloseEnoughForA && areCloseEnoughForB;
}

Actually asking both rectangles if the other is in range helps a lot.

“Are 2 calculations an algorithm ?" — Are 2 calculations an algorithm ?

It’s Translation Time (●ˇ∀ˇ●)

Now I can finally plug the OCR’d text into the translation API

Not bad but there are multiple issues:

Characters are getting cut off at the end
The size of the rectangles has increased even though they are supposed to be the same as in the plain OCR version. Though that does give the text enough space to display properly.
“'” characters are getting mangled. Encoding issue ?

Mangled Character

“'” is the html encoded representation of “'”

public string MimeType { get; set; }
    in class TranslateTextRequest
Summary
    Optional. The format of the source text, for example, "text/html", "text/plain". If left blank, the MIME type defaults to "text/html".

Actually informative google, thanks.

Improving The Rendering

So the cutoff of characters int the text boxes is due to the ImGui window having an inner padding of 8px and me using the OCR bounding rectangles directly for the size and making a dumb mistake by forcing the inner content to be smaller than needed.

And the growth issue is due to random code I put in as a quick fix to better see what the text is during the debugging of the OCR consolidation process. And I’m too embarrassed to paste it here. Look in the git history if you are curious 😐

There are 2 issues at hand:

Text that does not have enough space in the original bounding box
Text that is too short for the original bounding box.

For text that is too long, the first solution that came to mind was to scale the boxes depending on the remaining missing text render area. And conversely scaling the FontSize up if there is additional space available.

private static (Rectangle compensatedArea, float compensatedFontScale) CompensateForRender(ISpatialText text)
{
    const float defaultFontScale = 0.5f;
    const int imGuiPadding = 16;

    var calculatedSize = ImGui.CalcTextSize(text.Text);

    var areaHeight = Math.Abs(text.Area.Height);
    var areaWidth = Math.Abs(text.Area.Width);

    var originalTextArea = calculatedSize.X * calculatedSize.Y * defaultFontScale;
    var additionalPaddingArea = originalTextArea + areaWidth * imGuiPadding + areaHeight * imGuiPadding;

    var neededArea = additionalPaddingArea;
    var availableArea = areaHeight * areaWidth;

    var missingArea = neededArea - availableArea;
    
    if (missingArea < 0)
    {
        var extraAreaFactor = availableArea / neededArea;

        return (text.Area, extraAreaFactor * defaultFontScale);
    }

    var scaleFactor = Math.Sqrt(neededArea / availableArea);

    var missingHeight = scaleFactor * text.Area.Height - text.Area.Height;
    var missingWidth = scaleFactor * text.Area.Width - text.Area.Width;
    
    var compensatedRectangle = Rectangle.Inflate(text.Area, (int) (missingWidth/2), (int) (missingHeight/2));
    
    return (compensatedRectangle, defaultFontScale);
}

“The result may look stupid but at least it is faithful to the original :v” — The result may look stupid but at least it is faithful to the original :v

Next will be probably a GUI to make the usage easier

A Cross-Platform C# OCR Translation Overlay Part 2

TOC