So yeah, this is only really relevant for collecting logs from clickhouse. Not for logs from anything else. Good for them, and I really love Clickhouse, but not really relevant.
Noteworthy point:
> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
I haven't worked in ClickHouse level scale.
Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.
Why would I use ClickHouse instead of storing log data as json file for historical log data?
Observability maximalism is a cult. A very rich one.
Do wide events really have to take up this much space? I mean, observability is to a large degree basically a sampling problem where the goal is to maximize the ability to reconstruct the state of the environment at a given time using a minimal amount of storage. You can accomplish that by either reducing the number of samples taken or by improving your compression capability.
For the latter, I have a very hard time believing we’ve squeezed most of the juice out of compression already. Surely there’s an absolutely massive amount of low-rank structure in all that redundant data. Yeah, I know these companies already use inverted indices and various sorts of trees, but I would have thought there are more research-y approaches (e.g. low rank tensor decomposition) that if we could figure out how to perform them efficiently would blow the existing methods out of the water. But IDK, I’m not in that industry so maybe I’m overlooking something.
There isn't much information about correlation. What are the state-of-the-art tools and techniques for observability in stateful use cases?
Let's take the example of an SFU-based video conferencing app, where user devices go through multiple API calls to join a session. Now imagine a user reports that they cannot see video from another participant. How can such problems be effectively traced?
Of course, I can manually filter logs and traces by the first user, then by the second user, and look at the signaling exchange and frontend/backend errors. But are there better approaches?
When I get back from Clickhouse to Postgres, I am always shocked. Like, what it is doing for some minutes importing this 20G dump? Shouldn't it take seconds?
Yes, this what the people who will curse you out and judge you for not using wide events omits: it will greatly increase storage costs compared to the normal metrics + traces + sample based logging that is conventional. It has both a benefit and a cost, and the cost part is always omitted.
I didn’t see how long logs are kept - retention time. After x months you may need summary/aggregated data but not sure about raw data.
THis industry is mostly filled with half-baked or in-progress standards which leads to segmentation of the ecosystems. From graphql, to openapi, to mcp,... to everything, nothing is perfect and it's fine.
The problem is, people who created spec is just following trial and error approach, which is insane.
What is the trick that this and dynamo use?
Are they just basically large hash tables?
I mean if you don´t get the logs when the serivce is down the entire solution is useless.
Whenever I read things like this I think: You are doing it wrong. I guess it is an amazing engineering feat for Clickhouse but I think we (as in IT or all people) should really reduce the amount of data we create. It is wasteful.
tldr, they now do a zero (?) copy of raw bytes instead of marshaling and unmarshaling json.
tbh that's not the flex. storing 100PB of logs just means we haven't figured out what's actually worth logging. metrics + structured events can usually tell 90% of the story. the rest? trace level chaos no one reads unless prod's on fire. what'd could've done better be: auto pruning logs that no alert ever looked at. or logs that never hit a search query in 3 months. call it attention weighted retention. until then this is just high end digital landfill with compression