S3 remains one of the most amazing pieces of software I've ever seen. That thing a few years ago where they just added strong read-after-write consistency to the whole system? Incredible software engineering. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
Wow. I used to correspond with Leslie Lamport years ago (about his Buridan's Principle papers, etc.)
Today I went to his website and discovered a lot about TLA+ and PlusCal. He still maintains it: https://lamport.azurewebsites.net/tla/peterson.html?back-lin...
I must say ... it would make total sense for a guy like that, who brought mathematics to programming and was an OG of concurrent systems, to create a systems design language that's used at AWS and other industrial places that need to serve people.
I wish more people who build distributed systems would use what he made. Proving correctness is crucial in large systems.
Amazing article! Using state machines is a must if you are building infrastructure control-planes. Was P a must, though? I am not sure. We have been building infrastructure control-planes for over 13 years now and every iteration we have built with Ruby. It worked wonders for us https://www.ubicloud.com/blog/building-infrastructure-contro...
The 92 % stat looks really interesting! It’s rarely the spectacular crash that knocks a cluster over. Instead, the “harmless” retry leaks state until everything breaks at 2 a.m on one fateful Friday. Evidently, we should budget more engineering hours for mediocre, silent failures than for outright disasters. That’s where the bodies are buried.
One thing I wondered about the P language: It seems like in the early days, it was used at Microsoft to generate C code that’s actually used at runtime in the Windows USB stack?
But now it is no longer used to generate production code?
I asked that question here, which I think was the same question as in a talk: https://news.ycombinator.com/item?id=34284557
It seems like if the generated code is used in a kernel, it could also be used in a cloud, which is less resource-constrained
This sounds interesting but as someone who hasn't worked at AWS, and isn't already familiar with TLA+ or P, I would have liked to see even a hello world example of either of them. Without that, it sounds like a lot of extra pain for things that a good design and testing process should catch anyway. Seeing a basic example in the article itself that would give me a better insight into what these actually do.
It is interesting how the industry ended up with things like TDD when it doesn't work for something as simple as a function that adds two numbers together. While not completely useless in some edge cases, its complete lack of any kind of formal underpinnings should have given us a clue. So many bad / unexamined ideas in the Uncle Bob era. Far closer to a religion than anything else (complete with process "rituals" even).
> to more lightweight semi-formal approaches (such as property-based testing, fuzzing, and runtime monitoring)
Ok, I get how property-based testing and fuzzing have a relationship to formal methods (the thing being checked looks like part of a formal specification, and in some sense these are a subset of the checks that a model-checking confirms), but calling runtime monitoring a "semi-formal approach" seems like a real stretch.
Would I be right in saying Promela and SPIN are at a higher level than what is being described in the article?
Just curious, has anyone used FIS in their own distributed services? I'm considering using it but don't have any real word experience handling those kind of experiments.
Hopefully, they have now a reactive security team, because I do not count the harcking bots/ip scanners they are protecting on their "cloud".
It's very interesting (I applaud this) that one of the main goals seems to be to make it more approachable as compared to TLA+, but then they go in write it in C# which I consider to be an incredibly unapproachable community and language.
I'm not trying to draw the ire of the Microsoft fan boys, and there are certainly smart people working on that, but it's just not going to happen for most people.
Had this been in golang, or maybe java, I'm sure many more hands would be digging in! Having said that, I hope this helps bring correctness and validation more into the mainstream. I've been casually following the project for a while now.
My long-term goal is to integrate model validation into https://github.com/purpleidea/mgmt/ so if this is an area of interest to you, please let me know!
>Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.
This is fucking amazing
> 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors
This. If you take nothing else away from the article (which has a lot) take this: fail well, don’t fail poorly.
What’s been disappointing to me is how easily formal methods are dismissed in industry.
TLA takes some time to learn to wield effectively but it pays off on spades.
It a comment of assesment
> Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.
Any good open source libraries that do this that are language agnostic? Seems doable - spin up a container with some tools within it. Said tools require some middleware to know when a test is going to be run, when test is run, tools basically make certain things, networking, storage, etc "determinstic" in the context of the test run.
This is more-or-less what antithesis does, but haven't seen anything open source yet.
You of course, could write your tests well, such that you can stub out I/O, but that's work and not everyone will write their tests well anyway (you should do this anyway, but it's nicer imo if this determinism is on a layer higher than the application).
as a slight sidebar - I'm not really bullish on AI, but I think testing is one of the things where AI will hopefully shine, because the feedback loop during prompting can be driven by your actual application requirements, such that the test implementation (driven by AI), requirements (driven by you as the prompt) and "world" (driven by the actual code being tested) can hopefully help drive all three to some theoretical ideal. if AI gives us anything, I'm hoping it can make software a more rigorous discipline by making formal verification more doable.