‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

schizoidman@lemmy.zip · 19 hours ago

‘There isn’t really another choice:’ Signal chief explains why the encrypted messenger relies on AWS

carrylex@lemmy.world · 3 hours ago

Just read through the bluesky thread and it’s obvious that she’s a CEO and has no idea how to code or design infrastructure

It’s leasing access to a whole sprawling, capital-intensive, technically-capable system that must be just as available in Cairo as in Capetown, just as functional in Bangkok as Berlin.

Yeah then why was Signal completely down when a single datacenter (us-east-1) fails and all others are working perfectly?

Did it ever come to your brilliant mind that your system design might be the problem?

Jump over your shadow, say that you screwed up and tell the people that you are no longer going to rely on a single S3 bucket in us-east-1 and stop your fingerpointing.

But you don’t even manage to host a proper working status page or technically explain your outages, so guess this train is long gone…

Fjdybank@lemmy.ca · 3 hours ago

Way to shoot the messenger there. Or are you also taking that pitchfork after Jassy?

net00@lemmy.today · 8 hours ago

Didn’t only 1 AWS region go down? maybe before even thinking about anything else they should focus on redundancy within AWS

shalafi@lemmy.world · edit-2 3 hours ago

us-east-1 went down. Problem is that IAM services all run through that DC. Any code relying on an IAM role would not be able to authenticate. Think of it as a username in a Windows domain. IAM encompasses all that you are allowed to view, change, launch, etc.

I didn’t hardly touch AWS at my last job, but listening to my teammates and seeing their code led me to believe IAM is used everywhere.

amzd@lemmy.world · 1 hour ago

How is that even legal, I thought there were data export laws in the eu

Evotech@lemmy.world · 6 hours ago

Apparently even if you are fully redundant there’s a lot of core services in US east 1 that you rely on

carrylex@lemmy.world · 3 hours ago

No, there isn’t. If you of course design your infrastructure correctly…

Evotech@lemmy.world · 2 hours ago

Wrong. Stuff that wasn’t even in us east went down too. Dns is global

lando55@lemmy.zip · 8 hours ago

This has been my biggest pet peeve in the wake of the AWS outage. If you’d built for high-availability and continuity then this event would at most have been a minor blip in your services.

shalafi@lemmy.world · 3 hours ago

Yeah, but if you want real redundancy, you pay double. My team looked into it. Even our CEO, no tightwad, just laughed and shook his head when we told him.

sugar_in_your_tea@sh.itjust.works · 8 hours ago

Why is it that only the larger cloud providers are acceptable? What’s wrong with one of the smaller providers like Linode/Akamai? There are a lot of crappy options, but also plenty of decent ones. If you build your infrastructure over a few different providers, you’ll pay more upfront in engineering time, but you’ll get a lot more flexibility.

For something like Signal, it should be pretty easy to build this type of redundancy since data storage is minimal and sending messages probably doesn’t need to use that data storage.

dogs0n@sh.itjust.works · 7 hours ago

Akamai isnt small hehe

sugar_in_your_tea@sh.itjust.works · edit-2 7 hours ago

It is, compared to AWS, Azure, and Google Cloud. Here’s 2024 revenue to give an idea of scale:

Akamai - $4B, Linode itself is ~$100M
AWS - $107B
Azure - ~$75B
Google Cloud - ~$43B

The smallest on this this list has 10x the revenue of Akamai.

Here are a few other providers for reference:

Hetzner (what I use) - €367M
Digital Ocean - $692.9M
Vultr (my old host) - not public, but estimates are ~$37M

I’m arguing they could put together a solution with these smaller providers. That takes more work, but you’re rewarded with more resilience and probably lower hosting costs. Once you have two providers in your infra, it’s easier to add another. Maybe start with using them for disaster recovery, then slowly diversify the hosting portfolio.

Encrypt-Keeper@lemmy.world · edit-2 8 hours ago

Also you know… building your own data centers / co-locating. Even with the added man hours required it ends up being far cheaper.

sugar_in_your_tea@sh.itjust.works · 7 hours ago

But far less reliable. If your data center has a power outrage or internet disruption, you’re screwed. Signal isn’t big enough to have several data centers for geographic diversity and redundancy, they’re maybe a few racks total.

Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever? If there’s an outage, you’re talking hours to days to get another server up, vs minutes for rented hosting.

For the scale that signal operates at and the relatively small processing needs, I think you’d want lots of small instances. To route messages, you need very little info, and messages don’t need to be stored. I’d rather have 50 small replicas than 5 big instances for that workload.

For something like Lemmy, colo makes a ton of sense though.

Encrypt-Keeper@lemmy.world · edit-2 5 hours ago

It’s plenty reliable. AWS is just somebody else’s datacenter.

Colo is more feasible, but who is going to travel to the various parts of the world to swap drives or whatever?

Most Colo DCs offer ad hoc remote hands, but that’s beside the point. What do you mean here by “Various parts of the world”? In Signal’s case even Amazon didn’t need anyone in “various parts of the world” because the Signal infra on AWS was evidently in exactly one part of the world.

If there’s an outage, you’re talking hours to days to get another server up, vs minutes for rented hosting.

You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?

sugar_in_your_tea@sh.itjust.works · edit-2 3 hours ago

the Signal infra on AWS was evidently in exactly one part of the world.

We don’t necessarily know that. All I know is that AWS’s load balancers had issues in one region. It could be that they use that region for a critical load balancer, but they have local instances in other parts of the world to reduce latency.

I’m not talking about how Signal is currently set up (maybe it is that fragile), I’m talking about how it could be set up. If their issue is merely w/ the load balancer, they could have a bit of redundancy in the load balancer w/o making their config that much more complex.

You mean like the hours it took for Signal to recover on AWS, meanwhile it would have been minutes if it was their own infrastructure?

No, I mean if they had a proper distributed network of servers across the globe and were able to reroute traffic to other regions when one has issues, there could be minimal disruption to the service overall, with mostly local latency spikes for the impacted region.

My company uses AWS, and we had a disaster recovery mechanism almost trigger that would move our workload to a different region. The only reason we didn’t trigger it is because we only need the app to be responsive during specific work hours, and AWS recovered by the time we needed our production services available. A normal disaster recovery takes well under an hour.

With a self-hosted datacenter/server room, if there’s a disruption, there is usually no backup, so you’re out until the outage is resolved. I don’t know if Signal has disaster recovery or if they used it, I didn’t follow their end of things very closely, but it’s not difficult to do when you’re using cloud services, whereas it is difficult to do when you’re self-hosting. Colo is a bit easier since you can have hot spares in different regions/overbuild your infra so any node can go down.

Encrypt-Keeper@lemmy.world · edit-2 1 hour ago

It was a DNS issue with DynamoDB, the load balancer issue was a knock-on effect after the DNS issue was resolved. But the problem is it was a ~15 hour outage, and a big reason behind that was the fact that the load in that region is massive. Signal could very well have had their infrastructure in more than one availability zone but since the outage affected the entire region they are screwed.

You’re right that this can be somewhat mitigated by having infrastructure in multiple regions, but if they don’t, the reason is cost. Multi-region redundancy costs an arm and a leg. You can accomplish that same redundancy via Colo DCs for a fraction of the cost, and when you do fix the root issue, you won’t then have your load balancers fail on you because in addition to your own systems you have half the internet all trying to pass its backlog of traffic at once.

PieMePlenty@lemmy.world · 9 hours ago

Matrix solved this with decentralization and federation. Don’t tell me its not possible.

Muffi@programming.dev · 8 hours ago

Alright, but then at the very least have a fallback implemented. Right?

blakemiller@lemmy.world · 18 hours ago

Her real comment was that there are only 3 major cloud providers they can consider: AWS, GCP, and Azure. They chose AWS and AWS only. So there are a few options for them going forward — 1) keep doing what they’re doing and hope a single cloud provider can improve reliability, 2) modify their architecture to a multi-cloud architecture given the odds of more than one major provider going down simultaneously is much rarer, or 3) build their own datacenters/use colos which have a learning curve yet are still viable alternatives. Those that are serious about software own their own hardware, after all.

Each choice has its strengths and drawbacks. The economics are tough with any choice. Comes down to priorities, ability to differentiate, and value in differentiation :)

Jean-luc Peak-hard@piefed.social · 6 hours ago

Meredith mentioned in a reply to her posts that they do leverage multi-cloud and were able to fall back onto GCP (Google Cloud Platform), which enabled Signal to recover quicker than just waiting on AWS. I’d link to source but on phone, it’s somewhere in this thread: https://mastodon.world/@Mer__edith/115445701583902092

jaybone@lemmy.zip · 13 hours ago

What reason do they give for only wanting to use those three cloud providers? There are many others.

neukenindekeuken@sh.itjust.works · edit-2 9 hours ago

Those are the only 3 that matter at the top tier/enterprise class of infrastructure. Oracle could be considered as well for nuanced/specialized deployments that are (largely) Oracle DB heavy; but AWS is so far ahead of Azure and GCP from a tooling standpoint it’s not even worth considering the other two if AWS is on the table.

It’s so bad with other cloud providers that ones like Azure offers insane discounts on their MSSQL DB (basically “free”) licensing just to use them over AWS. Sometimes the cost savings are worth it, but you take a usability and infrastructure hit by using anything other than AWS.

I honestly, legitimately, wish there was some other cloud provider out there that could do what AWS can do, but they don’t exist. Anyone else is a pale imitation from a devops perspective. It sucks. There should be other real competitors, especially to the US based cloud companies as the US cannot be trusted anymore, but they just don’t exist without taking a huge hit in terms of tools, APIs, and reliability options, to AWS.

ObsidianZed@lemmy.world · 8 hours ago

I always wondered why people don’t implement a multi-cloud infrastructure if they want/need extra HA. And I know Oracle offers a solution with Azure and GCP, with AWS on the horizon. Not to advertise for Oracle because they’re terrible otherwise, but I can’t imagine wanting a multi-cloud option and not consider them.

neukenindekeuken@sh.itjust.works · 6 hours ago

Multi cloud is very difficult to do well.

Multi region is already hard enough with transactional management not being easy to split between the regions, and multi-cloud is another order of magnitude more difficult than multi region.

With that said, use2 and others were still up, so if they were just multi region and failed over to east2 they would have been fine.

SMillerNL@lemmy.world · 13 hours ago

Scale, they need worldwide coverage.

https://mastodon.world/@Mer__edith/115445705126997025

boonhet@sopuli.xyz · 12 hours ago

The big 3 also offer disgustingly fast interconnection. Google, Amazon and Microsoft lay their own undersea fiber for better performance.

If willing to sacrifice a bit of everything, OVH has North-American and European locations, as well as one in India, one in Singapore and one in Australia. They’re building a few more in India, one in Dubai, two in Africa, one in NZ and 3 in South America. Once they add a few more on top of those, that’s damn near worldwide coverage too. And OVH is a French company, so the US government has less leverage over it than Amazon.

Count042@lemmy.ml · edit-2 7 hours ago

And yet a single availability zone in AWS going down caused an outage?

SMillerNL@lemmy.world · 3 hours ago

Yes, because scale is not the same as redundancy.

majster@lemmy.zip · 13 hours ago

They are serving 1on1 chats and group chats. That practically partitions itself. There are many server lease options all over the world. My assumption is that they use some AWS service and now can’t migrate off. But you need an oncall team anyway so you aren’t buying that much convenience.

boonhet@sopuli.xyz · 12 hours ago

There are many server lease options all over the world

It increases complexity a lot to go with a bunch of separate server leases. There’s a reason global companies use hyperscalers instead of getting VPSes in 30 or 40 different countries.

I hate the centralization as much as everyone else, but for some things it’s just not feasible to go on-prem. I do know an exception. Used to work at a company with a pretty large and widely spread out customer base (big corps on multiple continents) that had its own k8s cluster in a super secure colocation space. But our backend was always slow to some degree (in multiple cases I optimized multi-second API endpoints into 10-200ms), we used asynchronous processing for the truly slow things instead of letting the user wait for a multi-minute API request, and it just wasn’t the sort of application that you need to be super fast anyway, so the extra milliseconds of latency didn’t matter that much, whether it was 50 or 500.

But with a chat app, users want it to be fast. They expect their messages to be sent as soon as they hit the send button. It might take longer to actually reach the other people in the conversation, but it needs to be fast enough that if the user hits send and then immediately closes the app, it’s sent already. Otherwise it’s bad UX.

vacuumflower@lemmy.sdf.org · 7 hours ago

It’s weird for Signal to not be able to do what Telegram does. Yes, for this particular purpose they are not different.

Jean-luc Peak-hard@piefed.social · 6 hours ago

Telegram is basically not even encrypted. They are not offering the same service.

vacuumflower@lemmy.sdf.org · 6 hours ago

For the purpose of “shoot a message, go offline and be certain it’s sent” it’s the same service.

qwerty@discuss.tchncs.de · 18 hours ago

Session is a decentralized alternative to signal. It doesn’t require a phone number and all traffic is routed through a tor like onion network. Relays are run by the community and relay operators are rewarded with some crypto token for their troubles. To prevent bad actors from attacking the network, in order to run a relay you have to stake some of those tokens first and if your node misbehaves thay will get slashed.

e8d79@discuss.tchncs.de · 12 hours ago

I would not recommend it. Session is a signal fork that deliberately removes forward secrecy from the protocol and uses weaker keys. The removal of forward security means that if your private key is ever exposed all your past messages could be decrypted.

tengkuizdihar@programming.dev · 17 hours ago

shame their entire node system relies on cryptobros tech.

tor doesnt need currency to back it up. i2p doesnt need currency to back it up. why the hell lokinet does?

qwerty@discuss.tchncs.de · 15 hours ago

Tor relays only relay the traffic, they don’t store anything (other than HSDirs, but that’s miniscule). Session relays have to store all the messages, pictures, files until the user comes online and retrieves them. Obviously all that data would be too much to store on every single node, so instead it is spread across only 5-7 nodes at a time. If all of those nodes ware to go offline at the same time, messages would be lost, so there has to be some mechanism that discourages taking nodes offline without giving a notice period to the network. Without the staking mechanism, an attacker could spin up a bunch of nodes and then take them all down for relatively cheap, and leave users’ messages undelivered. It also incentivizes honest operators to ensure their node’s reliability and rewards them for it, which, even if you run your node purely for altruistic reasons, is always a nice bonus, so I don’t really see any downside to it, especially since the end user doesn’t need to interact with it at all.

Natanael@infosec.pub · 12 hours ago

I2P already did that with their DHT network (remember DHT?). I2P Bote uses that for messaging

vacuumflower@lemmy.sdf.org · 6 hours ago

Eh, no. A DHT doesn’t solve offline storage of data, when the source node is already offline, and the target node is not yet online.

Natanael@infosec.pub · 5 hours ago

It does temporarily, on the order of hours to days. It’s not designed to use the network for long term storage, just message passing

tengkuizdihar@programming.dev · 10 hours ago

yet they couldve done this with volunteer nodes or even their own, because not even the server knows the content, right?

FauxLiving@lemmy.world · 15 hours ago

Can you think of another way for people across the world to easily pay each other directly?

tengkuizdihar@programming.dev · 10 hours ago

lokinet is for data transfer, like a message from your phone to mine, not a currency. Thats why its odd it uses staking instead of any nodes.

anomnom@sh.itjust.works · edit-2 8 hours ago

Sounds like the staking is a way to incentivize individual node uptime. Also you need to pay into the stake to get going so there is some financial pain involved in neglecting, or worse, manipulating a node. Though it sounds like around €1000 per node, so it’s not really going to slow down governments or billion dollar commercial competitors.

FauxLiving@lemmy.world · edit-2 4 hours ago

Exactly.

It’s also a way that people can contribute to the network without needing third party payment services. I don’t need to find some node operator’s socials and look up a patron to use a credit card.

If I already have an account with a crypto exchange then it’s easy to pay the operators.

qweertz (they/she)@programming.dev · edit-2 7 hours ago

Just use Briar or SimpleX instead of this clowns’ service with no perfect forward secrecy

goatinspace@feddit.org · 18 hours ago

Victor@lemmy.world · 15 hours ago

Gifs you can hear ❤️

Galactose@sopuli.xyz · 9 hours ago

Excuse me, but I don’t believe this BS.

zr0@lemmy.dbzer0.com · 9 hours ago

Wrong. It is actually quite easy to use multiple clouds with the help of OpenTofu. So it is just a cheap excuse

shortwavesurfer@lemmy.zip · 12 hours ago

I’m going to call bullshit in that there are several networks that might be capable of doing this such as several blockchain networks or IPFS.

JoshuaFalken@lemmy.world · 12 hours ago

I’m going to call bullshit on the underlying assertion that Signal is using Amazon services for the sake of lining Jeff’s pocket instead of considering the “several” alternatives. As if they don’t have staff to consider such a thing and just hit buy now on the Amazon smile.

In any monopoly, there are going to be smaller, less versatile, less reliable options. Fine and dandy for Mr Joe Technology to hop on the niche wagon and save a few bucks, but that’s not going to work for anyone casting a net encompassing the world.