TOC
YOLO’ing that speech bubble detection.
So by what I understand how this works is, I need a sufficiently large data set of images that need to be pre processed by me to include data on what should be detected. In my case I would need to identify what a bubble is and make that available to yolo.
Where Do I even start?
Ok now that I read into this more than 5 seconds I understand that YOLO is implemented by many Machine Learning frameworks but I’m sticking to the OG Darknet implementation as that sounds really edgy and their logo is a pentagram. Nice.
- Setup Darknet
- Setup tagging software
- Get 2000 Images to train
- Label 100 Images
- Train Yolo on the 100 Images
- Use okaish-100-image-model to help label the rest so I only have to sit there and adjust the bounding boxes
- Train Yolo on the full 2000 images
Spoiler: If you can, spend money on getting someone else to do the tagging for you. There are multiple services that do that ( amazon mechanical turk for example). Every second sitting there and labeling the 100 image batch I thought to myself: Is this what my life has come to? Why am I not doing literally anything else right now?
Setup Darknet
- CMake >= 3.12
- CUDA >= 10.0
- cuDNN >= 7.0
Just 3 Dependencies? Easy Peasy. Not really. Because Nvidia makes it a pain to the a hold of the cuDNN dependencies without signing up. Also a colleague of mine told me that even though you may think that you have the right cuDNN version many ML most frameworks will not indicate if you are using the right one or not. Which results in longer training times while you believe that is the fastest you can go
Get 2000 Images to train
Done. I’m a data hoarder and have around 1300 comic like books, most already english and some in japanese, so I will simply move 2000 images out of my zip files with a little helper program. I tried to achieve a mix of 1000 japanese and 1000 non japanese so the detection gets more varied.
Sudden Summary
I forgot how the rest went
I worked on the speech bubble detection and wrote half of this blog post in November of 2020, right now we have January 2021 and I cannot remember half of what I did to get my speech bubble detection going. Due to end of the year crunch I was too stressed out to work on this in my free time and I also got myself a new Desktop at home and manage to forget to backup one itsy bitsy tiny set of data from my old PC: All the training temp folders I made. I blindly trusted to have everything related to my projects under version control but with ML there’s a huge caveat: ML models are too big for any free VC like github. Well at least I can speak about the issue on how to host the ML model :v
The result
I remember that even with just the 100 image model the accuracy of the detection wasn’t half bad. It was enough that the detected regions were slightly worse by about 10% compared to google. Which was at that time good enough for me. I tried to resolve myself on labeling about 100 images a day and did that for about 4 days but then the work crunch hit and I lost any motivation. If anybody is interested in the “finished” scuffed Yolo Models:
What it can detect is: Speech Bubbles, Text Boxes, and Text. (They overlap most of the time or sometimes not)
https://cdn.aris.moe/yolo/v1/yolo-manga.cfg
https://cdn.aris.moe/yolo/v1/yolo-manga.names
https://cdn.aris.moe/yolo/v1/yolo-manga.weights
The Detection rate isn’t great and I might revisit this in the future now that I have a non shit GPU but yeah.
Labeling tools I tried:
After looking around the label tool I wanted should have fulfilled the following requirements:
- Good Keyboard/Mouse Workflow
When you are labeling 3 classes per image, occurring around 15 times, and are going for 2000 images, you want the workflow to be as efficient as possible. Any unnecessary click you save is a neuron more saved from insanity.
- “Reinforced training”/ Auto Labeling
What would be cool is to instantly use the existing 100 image model to help pre select everything in future unlabeled images and only having to either adjust the classes or bounding boxes
- Not be a pain in the ass to setup or use with darknet
THATS IT. there’s nothing more I wanted and for some reason nothing I could would fulfill any of them without either compromises or additional scripting work.
Pro
- REALLY great UX/Workflow, hands down best there is
- Easy to setup and manage data
- Supports auto labeling with trained model
Cons
- Doesn’t support Yolo v3 or v4 export. Would need to write own converter
- Couldn’t figure out how to setup the auto labeling
Pro
- Good Workflow
- Easy to setup
- Supports auto labeling with trained model and works with yolo out of the box
Cons
- A bit peculiar in how it wants its data organized
- Can’t resize bounding boxes
The following I looked into and dismissed either due to Workflow issues:
Final Choice
DeepLabel
Really the only viable solution out of them all.