‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

schizoidman@lemmy.zip · 1 天前

‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

sugar_in_your_tea@sh.itjust.works · 18 小时前

Why is it that only the larger cloud providers are acceptable? What’s wrong with one of the smaller providers like Linode/Akamai? There are a lot of crappy options, but also plenty of decent ones. If you build your infrastructure over a few different providers, you’ll pay more upfront in engineering time, but you’ll get a lot more flexibility.

For something like Signal, it should be pretty easy to build this type of redundancy since data storage is minimal and sending messages probably doesn’t need to use that data storage.

dogs0n@sh.itjust.works · 17 小时前

Akamai isnt small hehe

sugar_in_your_tea@sh.itjust.works · edit-2 16 小时前

It is, compared to AWS, Azure, and Google Cloud. Here’s 2024 revenue to give an idea of scale:

Akamai - $4B, Linode itself is ~$100M
AWS - $107B
Azure - ~$75B
Google Cloud - ~$43B

The smallest on this this list has 10x the revenue of Akamai.

Here are a few other providers for reference:

Hetzner (what I use) - €367M
Digital Ocean - $692.9M
Vultr (my old host) - not public, but estimates are ~$37M

I’m arguing they could put together a solution with these smaller providers. That takes more work, but you’re rewarded with more resilience and probably lower hosting costs. Once you have two providers in your infra, it’s easier to add another. Maybe start with using them for disaster recovery, then slowly diversify the hosting portfolio.

Squizzy@lemmy.world · 7 小时前

10% the size of google is decent. If I had ten percent of a tech giant’s reach in any particular sector I would consider myself significant but I get where you ae coming from

Encrypt-Keeper@lemmy.world · edit-2 18 小时前

Also you know… building your own data centers / co-locating. Even with the added man hours required it ends up being far cheaper.

sugar_in_your_tea@sh.itjust.works · 16 小时前

But far less reliable. If your data center has a power outrage or internet disruption, you’re screwed. Signal isn’t big enough to have several data centers for geographic diversity and redundancy, they’re maybe a few racks total.

Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever? If there’s an outage, you’re talking hours to days to get another server up, vs minutes for rented hosting.

For the scale that signal operates at and the relatively small processing needs, I think you’d want lots of small instances. To route messages, you need very little info, and messages don’t need to be stored. I’d rather have 50 small replicas than 5 big instances for that workload.

For something like Lemmy, colo makes a ton of sense though.

Encrypt-Keeper@lemmy.world · edit-2 15 小时前

It’s plenty reliable. AWS is just somebody else’s datacenter.

Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever?

Most Colo DCs offer ad hoc remote hands, but that’s beside the point. What do you mean here by “Various parts of the world”? In Signal’s case even Amazon didn’t need anyone in “various parts of the world” because the Signal infra on AWS was evidently in exactly one part of the world.

If there’s an outage, you’re talking hours to days to get another server up, vs minutes for rented hosting.

You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?

sugar_in_your_tea@sh.itjust.works · edit-2 13 小时前

the Signal infra on AWS was evidently in exactly one part of the world.

We don’t necessarily know that. All I know is that AWS’s load balancers had issues in one region. It could be that they use that region for a critical load balancer, but they have local instances in other parts of the world to reduce latency.

I’m not talking about how Signal is currently set up (maybe it is that fragile), I’m talking about how it could be set up. If their issue is merely w/ the load balancer, they could have a bit of redundancy in the load balancer w/o making their config that much more complex.

You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?

No, I mean if they had a proper distributed network of servers across the globe and were able to reroute traffic to other regions when one has issues, there could be minimal disruption to the service overall, with mostly local latency spikes for the impacted region.

My company uses AWS, and we had a disaster recovery mechanism almost trigger that would move our workload to a different region. The only reason we didn’t trigger it is because we only need the app to be responsive during specific work hours, and AWS recovered by the time we needed our production services available. A normal disaster recovery takes well under an hour.

With a self-hosted datacenter/server room, if there’s a disruption, there is usually no backup, so you’re out until the outage is resolved. I don’t know if Signal has disaster recovery or if they used it, I didn’t follow their end of things very closely, but it’s not difficult to do when you’re using cloud services, whereas it is difficult to do when you’re self-hosting. Colo is a bit easier since you can have hot spares in different regions/overbuild your infra so any node can go down.

Encrypt-Keeper@lemmy.world · edit-2 11 小时前

It was a DNS issue with DynamoDB, the load balancer issue was a knock-on effect after the DNS issue was resolved. But the problem is it was a ~15 hour outage, and a big reason behind that was the fact that the load in that region is massive. Signal could very well have had their infrastructure in more than one availability zone but since the outage affected the entire region they are screwed.

You’re right that this can be somewhat mitigated by having infrastructure in multiple regions, but if they don’t, the reason is cost. Multi-region redundancy costs an arm and a leg. You can accomplish that same redundancy via Colo DCs for a fraction of the cost, and when you do fix the root issue, you won’t then have your load balancers fail on you because in addition to your own systems you have half the internet all trying to pass its backlog of traffic at once.

sugar_in_your_tea@sh.itjust.works · 8 小时前

Multi-region redundancy costs an arm and a leg

Yes, if you buy an off the shelf solution, it’ll be expensive.

I’m suggesting treating VPS instances like you would a colo setup. Let cloud providers manage the hardware, and keep the load balancing in house. For Signal, this can be as simple as client-side latency/load checks. You can still colo in locations with heavier load; that’s how some Linux distros handle repo mirrors, and it works well. Signal’s data needs should be so low that simple DB replicas should be sufficient.