We run migrations across 2,800 microservices

by willsewellon 8/27/2024, 10:02 AMwith 12 comments

by HenryBemison 8/27/2024, 12:56 PM

Or as I call it "death by a thousand microcuts".

I always wonder why some (most) banks are proud of being reckless.. oh well, it keeps me well paid.

Also, Monzo decided to remove the "dark mode" option back-in-the-day. When I wrote to them about it "please return it as optional - as it already was" they responded with a polite "nope, suck it up". My next message to them was to close my account. Well.. "nope, suck it up" back right at you.

by gsckon 8/27/2024, 1:34 PM

I like Monzo as a bank, I think what they are doing is pretty cool.

But it all stills very amateur-ish, especially for a bank. Something as simple as being able to generate a proof of payment receipt for a bank transfer, why is this not possible? It feels incredibly unprofessional to send a screenshot of a mobile app to a company because your bank doesn't allow you to properly export a PDF for one single transaction.

by jjiceon 8/27/2024, 12:42 PM

What constitutes a micro service when you have 2800? Are these individual lambdas for each endpoint and background task or something?

by willsewellon 8/27/2024, 1:24 PM

There was previous discussion related to our microservices architecture here: https://news.ycombinator.com/item?id=22725989.

by lucianbron 8/27/2024, 1:41 PM

> it would require a lot of effort to update all call sites, and in some cases the benefit of the new API was minimal. By wrapping the old library it meant we could choose to keep the interface similar to the old library in these cases, making it easier to update call sites.

Doesn't wrapping the old library require a lot of effort to update all call sites?

If this is supposed to be general advice about libraries... does this mean wrap all libraries? Does not sound like a good idea to me.

by 0xbadcafebeeon 8/27/2024, 1:33 PM

The whole idea of (reliably) deploying and rolling back without downtime I don't think gets nearly enough meme-worthy attention on HN. It's quite complicated and depends entirely on a number of variables (specifically how you do everything). I wrote an internal paper once which was probably 30 pages just to explain why we couldn't do automatic rollbacks.

The most important parts of such a system (the ones mentioned in this post, anyway) don't get nearly enough attention:

- "centrally driven migrations": In any distributed service architecture, there are always too many interdependent pieces. You can't reliably touch thing A without also touching things B, C, D, etc. If you want any chance of automation or responding to failure without downtime, you must have a system which is aware of the changing state of everything and can change all the parts at a whim.

- "database migrations": This is again very complicated and depends on how your code and database are architected. You literally can't do migrations if your code and schema aren't set up right, and if you don't make the right kind of changes. How do you do this? Time to write a book...

- "wrap the old library": I can't remember what this is called, but it has a name. Anyway, the idea is hiding any change behind what is effectively a feature flag wrapper allows you to deploy the change without it being enabled, use the feature flag to test the change in production (on only one rest, on a percentage of requests, on one whole node/pod, etc), and then delete the old code eventually. This isn't just for features; you can replace entire interfaces, software stacks, whole systems this way, either piecemeal or entirely. Very powerful, but again, requires a specific approach not only in implementation but in use.

- "use automated rollback checks": What kind of checks? Checking what? In what way? At what time/stage? What happens when one fails? Do you do them in series or parallel? Can you do them in series or parallel? etc

- "deploy least critical services first": With enough interdependent services, you're going to hit cases where you have to upgrade parts B and C effectively simultaneously before you can upgrade A, etc. So for "no downtime", it will take a lot of coordination, and very explicit linkage and checking of specific new services, etc. There are ways to do this, but it's specific to your implementation and services, so this is another example of how you have to know exactly what's going on, and then set up the deployment to account for your specific dependency tree and how they react when they're run.

So many people I've run into don't think about any of these things. They literally say things like "automated rollbacks are easy, we did it at XYZ place", as if none of the above matter at all. They literally stick their head in the sand because they want to believe that it should be easy. But any engineer worth their salt will tell you that to do it correctly and reliably is bloody complicated.