ML Files are too big for github but my AWS S3 cost paranoia is even bigger

Posted by Programming Is Moe on Monday, February 15, 2021

TOC

After my first time ever training my own ML model I came to realise something horrible upon trying to push my commit: The 240mb weights file is too big for Github’s 100mb File limit.

So what to do?

tl;dr

I treated the ML files as a download on-demand resource, fetching it during runtime from a dedicated paid CDN instead of properly embedding it in the repository under version control: Resulting in smaller download size (40mb instead of 280 mb) for 50% of users.

  • The semi-variable pricing model of “CDNs” is a better fit for my non existent user-base
  • Gitlab would have been a viable free solution but I prefer staying in Github for e-bragging non-reasons.
  • AWS can be risky as there is a chance of malicious usage against me and configuring AWS to auto stop the S3 bucket is beyond what I want to learn.
  • There are free, no strings attached hosters like https://pony.icu/ but… I… just dunno man. Feels like being at the mercy of the whims of some random person. Also as a weeb im contractually obligated to reject bronies
  • https://bunny.net/ is a nice compromise with easy to limit max bandwidth caps but costs essentially “10$ per year”.
  • Also here is a nice listing of free-ish single dev scale solutions https://free-for.dev/#/?id=web-hosting

Potential Git Solutions

Github LFS

I REALLY REALLY want to distribute my source code WITH the models as they are an integral part of the tesseract OCR I put in, so of course my first instinct was to read the github documentation for a way. And right there was the github proposed solution: Git Large File Storage

On a git technical level LFS enables git to handle larger files in a different way than a normal committed blob. But that’s not the reason for me using it.

Github allows ANY size of file as long as it is associated as an LFS object. Neat, why even make that distinction Github?

git-lfs-cap
Ah. Bandwidth Billing. That's why

Account associated LFS bandwidth limit

Github imposes a limit of 1GB per month on a account for ALL outbound LFS traffic. Meaning if anyone checks out, or downloads your repo, each LFS object transferred will count against your limit. Once the limit is reached you cannot check the repository out.

To increase the limit you need to buy additional bandwidth in 50gb increments for 5$ (0.10 $/gb)

In a way, understandable, bandwidth is not free but, like, I pay for github pro and that won’t even give me more bandwidth than the free tier… why do I have github pro again?

Moving to Gitlab

Gitlab has the more sensible approach of limiting a free repository to 10GB total and not the cumbersome 100mb file limit of Github. Which would be a fix for my problem but also I could see how their approach might be a problem for other people.

I briefly considered hosting my own git server but I kinda like having everything on the most popular site for bragging rights.

Treating it as on demand resource

So after thinking about it for a bit I decided on a compromise: Download the weights on demand when the feature is used. It may make it less convenient to develop but at least the initial download is smaller for the end-user (~40mb instead of ~280mb), which is especially useful to those who would rather use the Google Cloud OCR instead of tesseract and wouldn’t need to download the weights anyway.

On a Server

This blog is served by a digital ocean droplet, so I did consider just hosting it with my existing droplet.

But then I would need to pay attention how I handled the droplet in order to not suddenly make my weights vanish. Especially in the future when I might drop the project and not care about this blog anymore I’d want my source code and exe to still be useable even if I forget or don’t want to deal with any of this anymore.Too often I have seen promising projects or tools that I needed just to be greeted by a 404 after clicking the download link.

Alternatively provisioning a second droplet for 5$/month seems like a bit of a waste considering that I have exactly 1 user. But it would come with free 1tb of bandwidth though.

AWS S3

I’m a big boy programmer who really needs aws for a throw-away project! Setting up an S3 to just publicly host the files was really easy. Open the bucket on the AWS console, enable public access and upload the weights.

I was already done implementing my application with the aws download working and was talking to friend about the outbound traffic cost of 0.08$/gb. I jokingly said to him that, if he really wanted to annoy me, he could just keep downloading the files over and over again to make me lose money. Wait…. WAIT. How bad could that be?

Let’s assume I wanted to annoy someone without going full botnet on them:

  • https://calculator.aws
  • 100mbit connection
  • My contract has no data caps
  • My IP resets every 24 hours because that’s how german internet works
  • 24/7 for a month
> ((30*24h*60*60*100)/8)/1024 = 31640.625GB 
> ~30TB of traffic
 Pricing calculations

    Outbound:
    Internet: Tiered pricing for 30720 GB:
    1 GB x 0 USD per GB = 0.00 USD
    10239 GB x 0.09 USD per GB = 921.51 USD
    20480 GB x 0.085 USD per GB = 1740.80 USD 

    Data Transfer cost (monthly): 2,662.31 USD

I know, I know. A lot of assumptions about how AWS would handle a single IP hammering the same bucket for 24hours. And I’m pretty sure my ISP would, even if there are no caps, make use of their right to throttle when they detect this very much abnormal private internet usage. BUT 2600 dollars! Yeah… I gotta mitigate that. Just in case.

Protecting against malicious S3 bandwidth usage

I did a quick google on what you can do in theory without hosting any additional service or API gateway and AWS has a feature for budget threshold notification that can also trigger automated actions IAM actions. These Actions mostly consist of applying policies to any IAM managed resource (S3 in this case). I tried setting it up but without a degree in AWS I can’t tell if what I configured would actually work (and AWS wouldn’t tell me either).

  • I could have configured the policy in a way that does not cause the desired effect
  • Or the user used by the budget action does not have sufficient rights to apply the policy.
  • Just a manual email notification doesn’t feel safe.

Without me willing to properly learn how any of this work I should probably search for a simpler alternative.

Joining the bronie dark side

While searching for “free” solutions to my problem I found this neat little write up of free-ish single dev scale solutions https://free-for.dev/#/?id=web-hosting. Listed on it is https://pony.icu/, a free, as in free free free hoster. They advertise themselves as having, Unlimited Disk Space, Unlimited Bandwidth + included Free Sub Domain

I… I just don’t know man. When has anything good in life ever happened? Never. I briefly read through the TOS and my hosting use-case would be allowed as far as I can tell. In addition I couldn’t find anything that stood out as being malicious or a major downside.

I didn’t bother signing up to try it out but it kinda is neat to see and talk about.

Also why are there always so many weebs, bronies or furries in the programming community. Oh wait I’m part of the problem.

My random final solution: bunnyCDN

https://bunny.net/

I literally googled “cheapest cdn” and found out about keycdn. Researching them a bit I found some not so nice reddit posts, also their UI does not appeal to me. Someone somewhere mentioned BunnyCdn being the new hot replacement for them after they went to shit so I checked it out.

I’m 99% sure any other cdn would have sufficed but they are nice enough. Something I was missing from AWS was an easy way to set a threshold for stopping the bucket which they offer in the UI and set it to 1tb per month. Also they offer two different CDN pricing modes, Standard (~0.01$/gb) and High Volume (~0.005$/gb). But they have this weird thing where you need to load up your balance with 10$ each year, essentially making this a 10$+/year service.