Data accidentally exposed by Microsoft AI researchers

by deepersprouton 9/18/2023, 2:30 PMwith 226 comments

by saurikon 9/18/2023, 3:22 PM

A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

by sillysaurusxon 9/18/2023, 3:10 PM

The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.

by hdeshon 9/18/2023, 3:01 PM

On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.

by quickthrower2on 9/18/2023, 3:51 PM

Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

by stevanlon 9/18/2023, 3:14 PM

Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

[1] https://github.com/microsoft/robust-models-transfer/blame/a9...

by jl6on 9/18/2023, 4:22 PM

Kind of incredible that someone managed to export Teams messages out from Teams…

by pradnon 9/18/2023, 6:14 PM

It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

by molaon 9/18/2023, 4:05 PM

It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .

by anon1199022on 9/18/2023, 2:34 PM

Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.

by formerly_provenon 9/18/2023, 3:43 PM

This stands out

> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.

by bkmon 9/18/2023, 3:00 PM

Would be insane if the GPT4 model is in there somewhere (as its served by Azure).

by wodenokotoon 9/18/2023, 3:46 PM

I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.

by kevinsundaron 9/18/2023, 9:13 PM

This is very similar to how some security researchers got access to TikTok's S3 bucket: https://medium.com/berkeleyischool/cloudsquatting-taking-ove...

They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.

by EGregon 9/18/2023, 4:07 PM

This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:

https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...

https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...

by ricketteon 9/18/2023, 3:25 PM

At this point MS might as well aquire Wiz, given the number of azure security findings they have found.

by lijokon 9/18/2023, 7:18 PM

I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass

by gumballindieon 9/18/2023, 5:18 PM

Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.

by madelyn-goodmanon 9/18/2023, 5:37 PM

This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar

by naikrovekon 9/18/2023, 4:27 PM

Amazing how ingrained it is in some people to just go around security controls.

someone chose to make that SAS have a long expiry and someone chose to make it read-write.

by baz00on 9/18/2023, 6:14 PM

What's that, the second major data loss / leak event from MSFT recently.

Is your data really safe there?

by h1fraon 9/18/2023, 5:07 PM

The article is focusing on AI and teams messages for some reason, but the exposed bucket had password, ssh keys, credentials, .env and most probably a lot of proprietary code. I can't even imagine the nightmare it has created internally.

by svaha1728on 9/18/2023, 3:57 PM

Embrace, extend, and extinguish cybersecurity with AI. It's the Microsoft way.

by fithisuxon 9/19/2023, 4:27 AM

My opinion is that it was not an "accident", but they prepare us for the era where powerful companies will "own" our data in the name of security.

Should have been sent to prison.

by riwskyon 9/18/2023, 3:45 PM

If only Microsoft hadn’t named the project “robust” models transfer, they could have dodged this Hubrisbleed attack.

by bt1aon 9/18/2023, 3:06 PM

Don't get pickled, friends!

by 34679on 9/18/2023, 5:28 PM

@4mm character width:

4e-6 * 3.8e+13 = 152 million kilometers of text.

Nearly 200 round trips to the moon.

by avereveardon 9/18/2023, 3:21 PM

Oof. Is that containing code from GitHub private repos?

by endisneighon 9/18/2023, 2:59 PM

how is this sort of stuff not at least encrypted at rest?

by mymacon 9/18/2023, 3:32 PM

Fortunately not a whole of of data and for sure with a little bit like that there wasn't anything important, confidential or embarrassing in there. Looking forward to Microsoft's itemised list of what was taken, as well as their GDPR related filing.

by Nischalj10on 9/18/2023, 3:16 PM

zsh, any way to download the stuff?

by EMCymaticson 9/18/2023, 4:35 PM

That's a lot of data.

by munchleron 9/18/2023, 3:05 PM

> This case is an example of the new risks organizations face when starting to leverage the power of AI more broadly, as more of their engineers now work with massive amounts of training data.

It seems like a stretch to associate this risk with AI specifically. The era of "big data" started several years before the current AI boom.

by buro9on 9/18/2023, 2:56 PM

Part of me thought "this is fine as very few could actually download 38TB".

But that's not true as it's just so cheap to spin up a machine and some storage on a Cloud provider and deal with it later.

It's also not true as I've got a 1Gbps internet connection and 112TB usable in my local NAS.

All of a sudden (over a decade) all the numbers got big and massive data exfiltration just looks to be trivial.

I mean, obviously that's the sales pitch... you need this vendor's monitoring and security, but that's not a bad sales pitch as you need to be able to imagine and think of the risk to monitor for it and most engineers aren't thinking that way.

by anyoneamouson 9/18/2023, 4:20 PM

Straight to jail.

by HumblyTossedon 9/18/2023, 3:40 PM

Microsoft, too big to fa.. care.