Microsoft Office is using an artificially complex XML schema as a lock-in tool

by firexcyon 7/19/2025, 4:22 AMwith 128 comments

by jonathaneuniceon 7/19/2025, 11:18 AM

I wish this article had shown side-by-side examples. Back when I built document transformation tools as part of a publishing pipeline, the simplicity and clarity benefit of OpenDocument's XML over Microsoft's OOXML were *staggering* in practice. A beautiful, clean, logical approach vs beyond-Byzantine cruft and complexity at every turn.

I don't remember every element enough to render from memory, but ChatGPT's example feels about right:

OpenDocument

<text:p text:style-name="Para"> This is some <text:span text:style-name="Bold">bold text</text:span> in a paragraph. </text:p>

OOXML

<w:p> <w:pPr> <w:pStyle w:val="Para"/> </w:pPr> <w:r> <w:t>This is some </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>bold text</w:t> </w:r> <w:r> <w:t> in a paragraph.</w:t> </w:r> </w:p>

OpenDocument is not always 100% "simple," but it's logical and direct. Comprehensible on sight. OOXML is...something else entirely. Keep in mind the above are the simplest possible examples, not including named styles, footnotes, comments, change markup, and 247 other features commonly seen in commercial documents. The OpenDocument advantage increases at scale. In every way except breadth of adoption.

by cranberryturkeyon 7/19/2025, 7:03 AM

The post is essentially reminding people that XML doesn’t magically equal openness. A schema can be “unnecessarily complex, bloated, convoluted and difficult to implement”, and in the case of Office 365 the spec runs to “over 8 000 pages” and uses deeply nested tags, overloaded elements and wildcards. The result is that only the vendor can feasibly implement it, which eliminates third‑party implementations and lets the vendor dictate terms. The rail‑control analogy in the article makes the point well.

What isn’t acknowledged is that a lot of that complexity isn’t purely malicious. OOXML had to capture decades of WordPerfect/Office binary formats, include every oddball feature ever shipped, and satisfy both backwards‑compatibility and ISO standardisation. A comprehensive schema will inevitably have “dozens or even hundreds of optional or overloaded elements” and long type hierarchies. That’s one reason why the spec is huge. Likewise, there’s a difference between a complicated but documented standard and a closed format—OOXML is published (you can go and download those 8 000 pages), and the parts of it that matter for basic interoperability are quite small compared with the full kitchen‑sink spec.

That doesn’t mean the criticism is wrong. The sheer size and complexity of OOXML mean that few free‑software developers can afford to implement more than a tiny subset. When the bar is that high, the practical effect is the same as lock‑in. For simple document exchange, OpenDocument is significantly leaner and easier to work with, and interoperability bodies like the EU have been encouraging governments to use it for years. The takeaway for anyone designing document formats today should be the same as the article’s closing line: complexity imprisons people; simplicity and clarity set them free.

by jpalomakion 7/19/2025, 8:17 AM

What we should really do is abandon the WYSIWYG approach to document editing. This inevitably leads into vendor lock in.

Instead of perfect looks, we should focus on the content. Formats like markdown are nice, because they force you to do this. The old way made sense 30 yers ago when information was consumed on paper.

by flohofwoeon 7/19/2025, 9:42 AM

I don't even think it's intentional, they had to come up with a file format which supports all the weird historical artefacts in the various Office tools. They didn't have the luxury to first come up with a clean file format and then write the tools around it.

And I bet they didn't switch to XML because it was superior to their old file formats, but simply because of the unbelievable XML hype that existed for a short time in the late 1990s and early 2000s.

by wvenableon 7/19/2025, 5:37 AM

> Unfortunately, while an XML schema can be simple, it can also be unnecessarily complex, bloated, convoluted and difficult to implement without specific knowledge of its features.

One could now use that exact sentence to describe the most popular open document format of all: HTML and CSS.

by bob1029on 7/19/2025, 7:16 AM

This is a comical perspective to me. I've been ass-deep in core banking APIs where we generate service references from WSDL/XSDs. Some of the resulting codegen measures in the tens of megabytes for some files. I wouldn't even attempt to quantify the number of pages of documentation. And this is just for mid size US banking domain. Microsoft Office has to work literally everywhere for everything. The fact that it's only 8000 pages of documentation is likely a miracle.

If you're working with an XML schema that is served up in XSD format, using code gen is the best (only) path. I understand it's old and confusing to the new generation, but if you just do it the boomer way you can have the whole job done in like 15 minutes. Hand-coding to an XML interface would be like cutting a board with an unplugged circular saw.

by jiggawattson 7/19/2025, 7:11 AM

The opinion in the article misses something fundamental.

The complexity is not artificial, it is completely organic and natural.

It is incidental complexity born of decades of history, backwards compatibility, lip-service to openness, and regulatory compliance checkbox ticking. It wasn't purposefully added, it just happened.

Every large document-based application's file format is like this, no exceptions.

As a random example, Adobe Photoshop PSD files are famously horrific to parse, let alone interpret in any useful way. There are many, many other examples, I don't aim to single out any particular vendor.

All of this boils down to the simple fact that these file formats have no independent existence apart from their editor programs.

They're simply serialised application state, little better than memory-dumps. They encode every single feature the application has, directly. They must! Otherwise the feature states couldn't be saved. It's tautological. If it's in Word, Excel, PowerPoint, or any other Office app somewhere, it has to go into the files too.

There are layers and layers of this history and complex internal state that has to be represented in the file. Everything from compatibility flags, OLE embedding, macros, external data source, incremental saves, the support for quirks of legacy printers that no longer exist, CYMK, external data, document signing, document review notes, and on and on.

No extra complexity had to be added to the OOXML file formats, that's just a reflection of the complexity of Microsoft Office applications.

Simplicity was never engineered into these file formats. If it had been, it would have been a tremendous extra effort for zero gain to Microsoft.

Don't blame Microsoft for this either, because other vendors did the exact same thing, for the exact same pragmatic reasons.

by markus_zhangon 7/19/2025, 11:58 AM

I think the lock-in is more about MSFT's contracts with schools, governments and corporations. I wish they break large corporations to pieces.

by mcswellon 7/19/2025, 7:52 PM

I guess this is not directly related, but I worked on a modified DocBook XML schema for some years. We were writing grammars of natural languages, so we had no use for some of the DocBook constructs, and needed to add others. That wasn't hard. And there are WYSIWYM (What You See Is What You Mean) editors, like XMLmind, which read the schema and helped you create conforming documents.

There are at least two ways to get from such an XML document to a PDF; we used pdfLaTeX, modified to handle our extra constructs, and then XeLaTeX.

I won't say it was a simple toolpath, but it allowed us to do at least two things that would have been difficult with Word or OpenOffice:

(1) It gave us an archival XML format, which will probably be readable and understandable for centuries. For grammars of endangered languages, that's important, because the languages won't be around more than a couple decades.

(2) It gave us the ability to cleanly typeset documents that had multiple scripts (including both Roman and various right-to-left scripts, like Arabic and Thaana).

by donatjon 7/19/2025, 3:01 PM

Mind you, Microsoft already had an earlier very capable XML spreadsheet format that was much easier to parse, SpreadsheetML.

Back in the early 2000's I wrote readers and writers for it and made pretty heavy use of the format at my job at the time.

The biggest problem with SpreadsheetML was that it expected the extension to be .XML - Microsoft had some sort of magic that would still associate the files with Excel on Windows but it wasn't super reliable. We started using .xls but after an update Excel started barking about files with the wrong extension.

https://en.wikipedia.org/wiki/SpreadsheetML

by khelavastron 7/19/2025, 5:15 AM

Does this person not understand XML serializers..?

by pikeron 7/19/2025, 7:54 AM

This is a dupe from: https://news.ycombinator.com/item?id=44606646 but I'll repeat what I said over there.

I feel qualified to opine on this as both a former power user of Word and someone building a word processor for lawyers from scratch[1]. I've spent hours pouring over both the .doc and OOXML specs and implementing them. There's a pretty obvious journey visible in those specs from 1984 when computers were under powered with RAM rounding to zero through the 00's when XML was the hot idea to today when MSFT wants everyone on the cloud for life. Unlike say an IDE or generic text editor where developers are excited to work on and dogfood the product via self-hosting, word processors are kind of boring and require separate testing/QA.

It's not "artificial", it's just complex.

MSFT has the deep pockets to fund that development and testing/QA. LibreOffice doesn't.

The business model is just screaming that GPL'd LibreOffice is toast.

[1] Plug: https://tritium.legal

by skywhopperon 7/19/2025, 11:31 AM

While the critique is correct, the complexity is probably not “artificial”. Rather, it directly reflects the internal decades-old complex architecture of Office applications rather than making any attempt to be an actually useful schema for sharing between applications.

It only exists because Microsoft was desperate to avoid antitrust consequences for the dominance of Office 25 years ago.

by djohnstonon 7/20/2025, 2:26 AM

I spent some time reverse engineering bits of WordML for presentation to LLMs. Not a fun time. Things like ordered list formatting ends up having a lot of unspecified behaviour and precedence, and the same doc in gdocs and word will end up rendering differently because they decided on their own precedences.

by pessimizeron 7/19/2025, 7:18 AM

Strange that this is getting traction again, and good on the people getting it out there. Saw something about "OOXML" make Google News the other day.

Having a debate about the quality of OOXML feels like a waste of time, though. This was all debated in public when Microsoft was making its proprietary products into national standards, and nobody on Microsoft's side debated the formats on the merits because there obviously weren't any, except a dubious backwards compatibility promise that was already being broken because MS Office couldn't even render OOXML properly. People trying to open old MS Office documents were advised to try Openoffice.

They instead did the wise thing and just named themselves after their enemy ("Open Office? Well we have Office Open!"), offered massive discounts and giveaways to budget-strapped European countries for support, and directly suborned individual politicians.

Which means to me that it's potentially a winnable battle at some point in the future, but I don't know why now would be a better outcome than then. Maybe if you could trick MS into fighting with Google about it. Or just maybe, this latest media push is some submarine attempt by Google to start a new fight about file formats?

by drewcooon 7/19/2025, 11:37 AM

As opposed to the original binary format, designed to copy directly to the heap on restore?

by scarface_74on 7/19/2025, 10:30 AM

There have been third party support for importing and exporting Office documents as long as I can remember. It was part of Apple’s File Exchange extension in 1994. No one is locked into Office because of file formats.

by catmanjanon 7/19/2025, 10:33 AM

Does software that produces files have an obligation to provide interoperability?

by fithisuxon 7/19/2025, 7:57 AM

I have seen in the past the same claim for Bluetooth.

I think this needs to end and it is up to ordinary people to seek alternatives.

Apart from LibreOffice, we still have many other alternatives.

by another_twiston 7/19/2025, 7:27 AM

How hard would it be to generate a parser for this spec with AI code gen ?

by danjcon 7/19/2025, 7:53 AM

So, basically the same as Adobe with PDF

by kaleidawaveon 7/19/2025, 12:01 PM

HTML?

by ddtayloron 7/19/2025, 6:06 AM

Again?

by jasonm23on 7/20/2025, 6:51 AM

In other news, water is wet.

by jongjongon 7/19/2025, 8:32 AM

Microsoft is using an artificially complex everything as a lock-in tool. I learned this many years ago when I learned how to create a window in C++ and it took around 100 lines of over-engineered code just to create an empty window on Windows.

Even TypeScript encourages artificial complexity of interfaces and creates lock-in, that's why Microsoft loves it. That's why they made it Turing Complete and why they don't want TypeScript to be made backwards with JavaScript via the type annotations ECMAScript proposal. They want complex interfaces and they want all these complex interfaces to be locked into their tsc compiler which they control.

They love it when junior devs use obscure 'cutting edge' or 'enterprise grade' features of their APIs and disregard the benefits of simplicity and backwards compatibility.