> The result of all this, is when a submodule clone is performed, it might read one location from path = ..., but write out a different path that doesnât end with the ^M.
How does this achieve âremote code executionâ as the article states? How serious is it from a security perspective?
> I'm not sharing a PoC yet, but it is an almost trivial modification of an exploit for CVE-2024-32002. There is also a test in the commit fixing it that should give large hints.
EDIT: from the CVE-2024-32002
> Repositories with submodules can be crafted in a way that exploits a bug in Git whereby it can be fooled into writing files not into the submodule's worktree but into a .git/ directory. This allows writing a hook that will be executed while the clone operation is still running, giving the user no opportunity to inspect the code that is being executed.
So a repository can contain a malicious git hook. Normally git hooks arenât installed by âgit cloneâ, but this exploit allows one to, and a git hook can run during the clone operation.
> When reading a configuration value, Git will strip any trailing carriage return (CR) and line feed (LF) characters. When writing a configuration value, however, Git does not quote trailing CR characters, causing them to be lost when they are read later on. When initializing a submodule whose path contains a trailing CR character, the stripped path is used, causing the submodule to be checked out in the wrong place.
> If a symlink already exists between the stripped path and the submoduleâs hooks directory, an attacker can execute arbitrary code through the submoduleâs post-checkout hook.
Along with a bunch of other git CVEs that are worth a look.
Submodules can be any URL (and recursive), so for GitHub to block this totally would require them to crawl other forges (and some URLs could be private URLs, but GitHub likely can't tell that apart from an attacker who is just blocking GitHub). So the risk is GitHub could say they are blocking this and give a false sense of security.
Some previous bugs have resulted in validation added to git fsck, but because clone URLs can't change after the submodules are initialised that's not going to have any benefit here. (There were some defence-in-depth measures discussed, there's definitely a few things that can be improved here.)
Could this be mitigated by moving .git out of work tree directory and using unprivileged process that only has access to work tree directory to do all the file manipulation?
In this case, there is more than enough information given to make an exploit trivial for anyone actually investigating the issue. I donât see a reason to distribute PoC exploit code here, itâs fairly clear what the problem is as well as at least one possible RCE implementation.
The exploit being trivial also removes any reason not to show a complete PoC.
There is no security impact for most people anyway as the next step after a git clone is usually to run some kind of build command which executes arbitrary code from the repo.
A well-written article should still have spelled out what armchairhacker did. The article spent many paragraphs working through what seemed to me like a very obvious chain of reasoning that didn't need so much hand-holding, and then left me completely bewildered at the end with the last step. And even with the explanation, I'm still not sure why writing the files to a different path allows the hook to be written. (Surely Git knows that it's still in the middle of a recursive clone operation and that it therefore shouldn't accept any hooks?)
Git probably doesn't allow the submodule path to be .git but does allow .git\r which would then actually be written to .git due to the broken escaping.
There are hundreds of files on your system that if you could write to them would cause a RCE. For example, `AUTOEXEC.BAT`. While the intermediate step (how do you go from writing a filename without an ^M to writing to an arbitrary file on the target OS?) is also somewhat surprising (and maybe points at some potential deeper fixes within git or potential TOCTOU bugs?), ultimately it's the exact same as several other vulnerabilities in this class from the same author (see the 2024 case-sensitivity bug they linked), so I forgive them for glossing over it.
This is a big problem with using ad-hoc DSLs for config - there's often no formal specification for the grammar, and so the source of truth for parsing is spread between the home-grown serialization implementation and the home-grown deserialization implementation. If they get out of sync (e.g. someone adds new grammar to the parser but forgets to update the writer), you end up with a parser differential, and tick goes the time bomb. The lesson: have one source of truth, and generate everything that relies on it from that.
The problem here isn't that the parser was updated. The parser and writer did what they did for a reason, that made sense historically but wasn't what the submodule system expected. The submodule system is a bit "tacked on" to the git design and it's not the first time that particular abstraction cracks.
Every file format is underspecified in some way when you use it on enough platforms and protocols, unless the format is actually a reference implementation, and we've had enough problems with those. There's a reason IETF usually demands two independent implementations.
Similar problems can affect case insensitive filesystems, or when moving data between different locales which affect UTF-8 normalization. It's not surprising that an almost identical CVE was just one year ago.
Be careful what you wish for. They could have used yaml instead of ini for the config format, and we would have had more security issues over the years, not less.
No, the writer encodes values in a way that they will be read back as different values. This underlying issue is absolutely an encoder/decoder mismatch and has nothing to do with submodules.
This bug is orthogonal to one source of truth. It is a pure logic bug that could have existed in a standard system library for config files if such existed on Unix.
And consider that consequences of such bug would be much worse if it was in a standard system library. At least here it is limited mostly to developers where machines are updated.
Nitpick: the DSL here ("ini file format") is arguably ad-hoc, but it's extremely common and well-understood, and simple enough to make a common law specification work well enough in practice. The bug here wasn't due to the format. What you're actually complaining about is the hand-coded parser[1] sitting in the middle of a C program like a bomb waiting to go off. And, yes, that nonsense should have died decades ago.
There are places for clever hand code, even in C, even in the modern world. Data interchange is very not much not one of them. Just don't do this. If you want .ini, use toml. Use JSON if you don't. Even YAML is OK. Those with a penchant for pain like XML. And if you have convinced yourself your format must be binary (you're wrong, it doesn't), protobufs are there for you.
But absolutely, positively, never write a parser unless your job title is "programming language author". Use a library for this, even if you don't use libraries for anything else.
[1] Fine fine, lexer. We are nitpicking, after all.
Since we are nitpicking, git is considerably older than TOML, and nearly as old as YAML and JSON. In fact JSON wasnât even standardised until after gitâs first stable release.
Back then there wasnât a whole lot of options besides rolling their own.
And when you also consider that git had to be somewhat rushed (due to Linux kernel developers having their licenses revoked for BitKeeper) and the fact that git was originally written and maintained by Linus himself, itâs a little more understandable that he wrote his own config file parser.
Under the same circumstances, I definitely would have done that too. In fact back then, I literally did write my own config file parsers.
The parser is fine, the bug is in the encoder which writes a value ending in \r without quotes even though \r is a special character at the end of a line.
Iâve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.
If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So youâre much better off just jumping over that character entirely.
Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.
If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.
I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.
Why would ignoring CR lead to problems? It has nothing to do with line feeding on any system released in the last quarter of a century.
If youâre targeting iMacs or the Commodore 64, then sure, itâs something to be mindful of. But Iâd wager youâd have bigger compatibility problems before you even get to line endings.
Is there some other edge cases regarding CR that Iâve missed? Or are you thinking ultra defensively (from a security standpoint)?
That said, I do like your suggestion of treating CR like LF where the schema isnât sensitive to line numbering. Unfortunately for my use case, line numbering does matter somewhat. So would be good to understand if I have a ticking time bomb
Thatâs not what I meant. Itâs okay for the line break itself to be significant. But whitespace immediately preceding the line break shouldnât be significant, due to its general invisibility.
Reproduced the issue after a bit: https://github.com/acheong08/CVE-2025-48384
Then immediately went to update my git version. Still not up on Arch yet. Will refrain from pulling anything but I bet it'll take quite a while for most people to upgrade. Putting it in any reasonable popular repo where there are perhaps automated pulls will be interesting.
So this was disclosed before patching? With all of the alarming "here's how we can pwn your machine" posts turning out to be months after the fact, I figured by now that these blog posts all happen after all the distros have long patched it.
It seems like it would be appropriate to make it clear "this is important now" vs "don't worry you probably already patched this" in the headline to save our time for those who aren't just reading these posts out of interest.
Commits fixing the bug date back around 3 or 4 weeks. The patched release came 3 weeks ago. Perhaps some parties weren't informed that it's security critical (Homebrew, Arch, etc) and are now scrambling
I'm not privy to the exact communications that happened, but per the Ubuntu changelog they prepared a patch a week ago[1] (which is about the normal timeline for notification per[2]). Homebrew is not on the distros list, so likely wouldn't have got an early notification. Arch is, but remember "The Arch Security Team is a group of volunteers"[3].
"that may not be the most sensible advice now", says M. Leadbeater today. We were saying that a lot more unequivocally, back in 2003. (-:
As Mark Crispin said then, the interpretations that people put on it are not what M. Postel would have agreed with.
Back in the late 1990s, Daniel J. Bernstein did the famous analysis that noted that parsing and quoting when converting between human-readable and machine-readable is a source of problems. And here we are, over a quarter of a century later, with a quoter that doesn't quote CRs (and even after the fix does not look for all whitespace characters).
Amusingly, git blame says that the offending code was written 19 years ago, around the time that Daniel J. Bernstein was doing the 10 year retrospective on the dicta about parsing and quoting.
The question is whether recursive submodule checkout happens after some integrity/signature validation or before. The RCE can be an issue in the latter case.
Why git does not use Landlock? I know it is Linux-only, but why?
"git clone" should only have r/o access to config directory and r/w to clone directory. And no subprocesses. In every exploit demo: "Yep, <s>it goes to a square hole</s> it launches a calculator".
Why is this helpful? Just add the executable itself to path and execute it with "something" instead of "git something". Why are we making git an intermediary ? I am kind of stupid and this is genuine.
Because the joke doesnât land if typing âgit gudâ doesnât actually do something.
To your point, I would say that itâs âeasyâ rather than strictly helpful. There is a plugin I maintain internally that can be invoked by calling âhelm <thing>â if I go through the necessary steps to have it installable by the helm plugin command. Under the hood itâs just a small binary that you can put in your $PATH and itâll work fine, but there are tons of developers and PMs and other people at the company who donât know what a path variable is, or how to set it, or what a terminal is, or what shell theyâre running, or who know that they can do âhelm Xâ and âhelm Yâ, so why not âhelm Zâ for my plugin, etc ⌠It would be a hell of a lot easier to just ship the raw executable, but to those people and execs and mangers and stuff, it looks good if I can show it off next to the native stuff.
Whenever I have to help users with it, I notice that nearly everyone uses it with helm and not calling by the binary directly. It just comes down to the fact that presentation and perceived ease of use counts for a lot when people evaluate whether they want to make a tool part of their workflow.
It allows things to be added and removed from the main executable without changing the interface. This means if someone has a good idea everyone starts using, and they implemented it as a subcommand like this, it could eventually be integrated into git without anyone having to migrate. Also all the subcommands are implemented as separate executables like this anyway.
For example in /usr/lib/git-core/ with git 2.25.1 on Ubuntu, "git-rebase" is a symlink to "git". But on an older Centos VM I have access to, in /usr/libexec/git-core/ with git 2.16.5, "git-rebase" is a separate shell script.
Because if it's part of the repo, you don't depend on the host to take the extra step, which, if you're working from ephemeral instances or places where that step would have to be repeated, is a god send
Git itself uses this functionality. On my ubuntu system the path is `/usr/lib/git-core/` and in it you see all sorts of bins for "git commands", e.g `git-rm`, `git-mv`, `git-difftool`, etc. A lot of these are just links back to the git binary these days, but many features begin life as a standalone `git-$X` executable, and back in early git days much more functionality was split across executables. (The ones that are now links back to git are largely for scripting purposes, a lot of git "plugins" and various CI type scripts will call `git-mv` rather than trying to get quoting right around calling `git mv` for example.
It also helps make plugins easier to distribute. I don't want to have to type `git-x` sometimes and `git y` others, and if I want my plugin to get adoption, I really really don't want that. So things like git-lfs, git-annex, etc can easily be distributed, documented as a plugin, and generally be considered as "a part of git", rather than a separate command.
This pattern is also not unique to git. Other softwares have followed it, notably cargo.
The problem with Landlock, AFAIU, is that it's inherited across exec. OpenBSD's pledge and unveil were deliberately designed to reset across exec by default precisely because the "more secure" behavior makes it difficult or impossible to add to alot of existing code that needs to invoke other programs. It could be done in theory--e.g. by forking a helper process prior to locking down--but that requires a significant refactor all its own, so what ends up happening is that people just abstain from using seccomp or Landlock. Whereas all OpenBSD core utils were quickly modified to make use of pledge and unveil to some extent, and then over time improved to tighten things further; they were designed to permit and promote this sort of easy, incremental deployment.
I don't know the extent to which Landlock or even unveil would have helped here; maybe they would have only prevented the hook from running during the clone, but not subsequently when it's expected trusted hooks to run. But I'd bet adding unveil support to Git would be an easier task, notwithstanding (or even because of) the way Git invokes subprocesses.
Lone CR died with classic Mac OS over 20 years ago, I think we can ignore that. Lone LF is arguably a Unix-ism, everything else is/was using CRLF. Except that Unix text files are becoming ubiquitous, while existing protocols and formats, as well as Windows conventions, are stuck with CRLF. Thereâs really no good way out.
I'm on team "Windows should just accept and change to write and read CR and '/'. beginning the decades long transition process for those". Most of the APIs accept '/', and most of the software accepts CR-only.
I think even Microsoft have noticed this, which is why WSL exists to provide an island of Linux-flavored open source tooling inside a Windows environment.
I think you mean LF, not CR. The problem with changing the behavior with regard to CRLF is exactly that it would introduce vulnerabilities like the present one here, because some software would still apply the old behavior while others apply the new one. Stuff like https://portswigger.net/web-security/request-smuggling/advan....
Directory separators are another can of worms. A lot of functionality in Windows is driven by command-line invocations taking slash-prefixed options, where itâs crucial that they are syntactically distinct from file system paths. I donât think a transition is possible without an unacceptable amount of compatibility breakage.
Thanks for the PR. I was also looking into it briefly and did not understand where the SHA256 for the various architectures are coming from and so I gave up before creating the PR (now I understand it's created automatically by a bot).
why tf is git still running submodule hooks during clone at all. like think. youre cloning a repo which you didnt write it or audit it. and git just... runs a post checkout hook from a submodule it just fetched off the internet. even with this CRLF bug fixed, thats still bananas
Suppose the system call to list a directory examined the place on the disk where a filename should be, and found bytes representing ASCII control characters. Should it deny the existence of the corresponding file? Assume disk corruption? Something else? After all, maybe (this is admittedly more theoretical than practical) those bytes map to something else in the current locale. It's not like modern Windows which assumes the filenames are all UTF-16.
As the article says: "I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages, the common factor being this is a vulnerability in the interprocess communication of the components (either between git and external processes, or within the components of git itself). It is possible to draw a parallel with CRLF injection as seen in HTTP (or even SMTP smuggling)."
You can write this in any language. None of them will stop you. I'm on the cutting edge of "stop using C", but this isn't C's fault.
You can, but in languages like python/java/go/rust/... you wouldn't, because you wouldn't write serialization/de-serialization code by hand but call out to a battle hardened library.
This vulnerability is the fault of the C ecosystem where there is no reasonable project level package manager so everyone writes everything from scratch. It's exacerbated by the combination of a lack of generics (rust/java's solution), introspection (java/python's solution), and poor preprocessor in C (go's solution) so it wouldn't even be easy to make a ergonomic general purpose parser.
Python's pathlib wouldn't help you here, it can encode the necessary bits. Especially with configparser - it's 20 year old configuration reader. Java's story is worse.
What part of this would be prevented by another language?
You'd need to switch your data format to something like json, toml, etc. to prevent this from the outset. But JSON was first standardised 25 years ago, and AJAX wasn't invented when this was written. JSON was a fledgling and not widely used yet.
I guess we had netrc - but that's not standardised and everyone implements it differently. Same story for INI.
There was XML - at a time when it was full of RCEs, and everyone was acknowledging that its parser would be 90% of your program. Would you have joined the people disparaging json at the time as reinventing xml?
This vulnerability is the fault of data formats not being common enough to be widely invented yet.
> What part of this would be prevented by another language?
> You'd need to switch your data format to something like json, toml, etc.
The part where if you wrote this in any modern languages ecosystem you would do this.
Yes, modern languages and their ecosystems likely did not exist back then. The lesson going forwards is that we shouldn't keep doing new things like we did back then.
Saying smithing metal by using a pair of hand driven bellows is inefficient isn't to say the blacksmiths ages ago who had no better option were doing something wrong.
What an absurdly bad faith interpretation. I never said anything to even suggest abandoning old code.
As demonstrated by vulnerabilities like the one in the article, C (and its ecosystem) doesn't "work", so I'm glad to hear that you won't be sticking with that for new projects going forwards.
I have a feeling that this code was developed before any of those languages were widely popular and before their package managers or packages were mature.
Sure, I'm not trying to assign blame to Linus for deciding to write git in C, I'm saying that modern tooling (not C) would prevent the bug with reasonably high probability and that that's a factor when deciding what to do going forwards.
> I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages, the common factor being this is a vulnerability in the interprocess communication of the components (either between git and external processes, or within the components of git itself).
Whilst true, thereâs a swathe of modern tooling that will aide in marshalling data for IPC. Would you not agree that if protobuf, json or yaml were used, itâd be far less likely for this bug have slipped in?
In isolation, for any one particular bug, yes, but if you start applying this logic to everything, even problems as simple as reading some bytes from a file, you end up with a heao of dependencies for the most mundane things. We've tried that, it's bad.
I don't believe we must apply any guideline ad absurdum. Using a battle tested marshalling/serialization library is clearly the way to go most often. Of course, one can still construct difficult to parse XML and JSON or any other blob for any given format, but the chances that bad input will result in an RCE are lower.
No I would not agree that YAML or JSON parsers in any language are far less likely to have logic errors, and I'm not sure why protobuf (a binary format) would be a good choice for a human readable file.
INI is not a particularly complex format (less complex than YAML for example), and there are existing open source parsers written in C that could have been used.
You can dig in all you want, but this is not an issue with C strings or the INI format.
This isn't even a parser error at all - the INI format comes from DOS/Windows where a trailing carriage return would not be considered part of the value either.
Using other languages would likely fix the issue but as a side-effect. Most people would expect a C-vs-Rust comparison so Iâll take Go as an example.
Nobody would write the configuration parsing code by hand, and just use whatever TOML library available at hand for Go. No INI shenanigans, and people would just use any available stricter format (strings must be quoted in TOML).
So yeah, Rust and Go and Python and Java and Node and Ruby and whatnot would not have the bug just by virtue of having a package manager. The actual language is irrelevant.
However, whatever the language, the same hand implementation would have had the exact same bug.
Ah yes, yet ANOTHER vulnerability caused because Linux and most Unixes allow control characters in filenames. This ability's primary purpose appears to be to enable attacks and to make it significantly more difficult to write correct code. For example, you're not supposed to exchange filenames a line at a time, since filenames can contain newlines.
One piece of good news: POSIX recently added xargs -0 and find -print0, making it a little easier to portably handle such filenames. Still, it's a pain.
I plan to complete my "safename" Linux module I started years ago. When enabled, it prevents creating filenames in certain cases such as those with control characters. It won't prevent all problems, but it's a decent hardening mechanism that prevents problems in many cases.
You can get similar vulnerabilities with Unicode normalization, with mismatched code pages/character encodings, or, as the article points out, with a case-insensitive file system. That's not to say that control characters should be allowed in file names, but there's an inherent risk whenever byte sequences are being decoded or normalized into something else.
Not to the same degree, though, and the arguments for status quo are especially weak. There are reasonable arguments pro and con case-insensitive filenames. Character encoding issues are dwindling, since most systems just use utf-8 for filename encoding (as there is no mechanism for indicating the encoding of each specific filename), and using utf-8 consistently in filenames supports filenames in arbitrary languages.
Control characters in filenames have no obviously valuable use case, they appear to be allowed only because "it's always been allowed". That is not a strong argument for them. Some systems do not allow them, with no obvious ill effects.
I think better idea is to make git use user namespaces and sandbox itself to the clone directory so it literally cannot write/read outside of it. This prevents path traversal attacks and limits the amount of damage RCE could do. Filenames really aren't the problem.
The idea of Defence in Depth is to handle vulnerabilities at several levels, instead of relying on a single technique that becomes a single point of failure.
I'm not saying not to do that. But it seems sandboxing should be the first thing to think of. Especially in concept of git which allows you to execute all sorts of custom scripts. File name sanitation is not that however, in fact in contrary file name sanitation is known to cause security vulnerabilities and other annoying issues in past.
I completely disagree with author's (oft quoted here in comments) statement:
> I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages
For Christ's sake, Turing taught us that any error in one language is possible in any other language. You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM. Rust and similar memory safe languages can emulate literally any problem C can make a mine field out of.. but logic errors being "possible" to perform are significantly different from logic errors being the first tool available to pull out of one's toolbox.
Other comments have cited that in non-C languages a person would be more likely to reach for a security-hardened library first, which I agree might be helpful.. but replies to those comments also correctly point out that this trades one problem for another with dependency hell, and I would add on top of that the issue that a widely relied upon library can also increase the surface area of attack when a novel exploit gets found in it. Libraries can be a very powerful tool but neither are they a panacea.
I would argue that the real value in a more data-safe language (be that Rust or Haskell or LISP et al) is in offering the built-in abstractions which lend themselves to more carefully modeling data than as a firehose of octets which a person then assumes they need to state-switch over like some kind of raw Turing machine.
"Parse, don't validate" is a lot easier to stick to when you're coding in a language designed with a precept like that in mind vs a language designed to be only slightly more abstract than machine code where one can merely be grateful that they aren't forced to use jump instructions for every control flow action.
> You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM.
No, this wouldn't be a double free in Rust, it'd be a double free in whatever language you used to write the emulated code.
The distinction is meaningful, because the logic error he's talking about is possible in actual rust (even without unsafe), not just theoretically in some virtual system that you can use Rust to write a simulation for.
I can easily see this bug happening in Rust. At some level you need to transform your data model into text to write out, and to parse incoming text. If you want to parse linewise you might use BufRead::lines(), and then write a parser for those lines. That parser won't touch CRs at all, which means when you do the opposite and write the code that serializes your data model back to lines, it's easy to forget that you need to avoid having a trailing CR, since CR appears nowhere else in your code.
And - having dealt with parser construction in university for a few months - the only real way to deal with this is fuzzing and round trip tests.
It sounds defeatist, but non-trivial parsers end up with a huge state space very quickly - and entirely strange error situations and problematic inputs. And "non-trivial" starts a lot sooner than one would assume. As the article shows, even "one element per line" ends up non-trivial once you support two platforms. "foo\r\n" could be tokenized/parsed in 3 or even 4 different ways or so.
It just becomes worse from there. And then Unicode happened.
Well the question then becomes "how do you identify the quoting that needs to happen on the line" and tactics common in Rust enabled by features available in Rust will still lead a person away from this pattern of error.
One tool I'd have probably reached for (long before having heard of this particular corner case to avoid) would have been whitespace trimming, and CR counts as whitespace. Plus folk outside of C are also more likely to aim a regex at a line they want to parse, and anyone who's been writing regex for more than 5 minutes gets into the habit of adding `\s*` adjacent to beginning of line and end of line markers (and outside of capture groups) which in this case achieves the same end.
You're describing a different format entirely then if you're doing generic whitespace trimming without any consideration for the definition of "whitespace". The Git config format explicitly defines ignorable whitespace as spaces and horizontal tabs, and says that these whitespace characters are trimmed from values, which means nothing else gets trimmed from values. If you try to write a parser for this using a regular expression and `\s*` then you'd better look up what `\s` means to your regex engine because it almost certainly includes more than just SP and HT.
I can't think of any features in Rust that will lead someone away from this pattern of error, where this pattern of error is not realizing that round-tripping the serialized output back through the deserializer can change the boundaries of line endings. It's really easy to think "if I have a bunch of single-line strings and I join them with newlines I now have multiline text, and I can split that back up into individual lines and get back what I started with". This is doubly true if you start with a parser that splits on newline characters and then change it after the fact to use BufRead::lines() in response to someone telling you it doesn't work on Windows.
> You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM. Rust and similar memory safe languages can emulate literally any problem C can make a mine field out of..
That doesn't have any relevance to a discussion about memory safety in C vs rust. Invalid memory access in the emulated machine won't be able to access memory from other processes on the host system. Two languages being turing complete does not make them the same language. And it definitely does not make them bug for bug compatible. Rust _really_ does enable you to write memory safe programs.
> "Parse, don't validate" is a lot easier to stick to when you're coding in a language designed with a precept like that in mind vs a language designed to be only slightly more abstract than machine code
As you point out, the most serious way to undermine the "safety" features in a "safe" language like Rust is to implement a VM, programming language, serdes framework, etc, because these operate outside of Rust's type system and memory safety.
And that's exactly what the Git developers did here: They made an in-house configuration file format. If implemented in Rust, it would bypass most of Rust's safety features, particularly, type-safety.
It is mind-blowing the things people come up with when it comes to Rust vs C conversations. The same colvoluted crap for years at this point.
No, just no. I'm sorry, Ive implemented countless custom formats in Rust and have NEVER had to side step safe/unsafe or otherwise sacrifice type safety. Just what an absurd claim.
Maybe for some binary (de)serialization you get fancy (lol and are still likely to be far better off than with C) but goodness, I cannot imagine a single reason why a config file parser would need to be (type)-unsafe.
The person you replied to didn't say that you had to bypass safe. This bug is orthogonal to type and memory safety, its a different issue.
The git bug in question could be written in 100% safe rust using as much or as little of the type system[1] as you want. It's a logic error when parsing a string.
I dev rust full-time, and I've spent a lot of time writing protocol parsers. It's easy to forget to check this or that byte/string for every possible edge case as you're parsing it into some rust type, and happens all the time in rust, just like it did in C or python or go when I used those languages. This bug (if anything) is the type of thing that is solved with good tokenizer design and testing, and using more small, independently tested functions - again not at all related to the type system.
[1] Although in rust you can arrange your types so that this sort of bug is harder to implement or easier to catch than in most languages... but doing that requires an up-front understanding that logic bugs are just as possible in rust as in other languages, as well as some experience to avoid awkwardness when setting the types up.
In practice I think a Rust project would have used toml which parses safely. The limitation there would be that toml requires strings to be utf8, so it couldn't represent all possible unix paths.
I wrote a CLI program to determine and detect the end-of-line format, tabs, bom, and nul characters. It can be installed via Homebrew or you can download standalone binaries for all platforms:
> The result of all this, is when a submodule clone is performed, it might read one location from path = ..., but write out a different path that doesnât end with the ^M.
How does this achieve âremote code executionâ as the article states? How serious is it from a security perspective?
> I'm not sharing a PoC yet, but it is an almost trivial modification of an exploit for CVE-2024-32002. There is also a test in the commit fixing it that should give large hints.
EDIT: from the CVE-2024-32002
> Repositories with submodules can be crafted in a way that exploits a bug in Git whereby it can be fooled into writing files not into the submodule's worktree but into a .git/ directory. This allows writing a hook that will be executed while the clone operation is still running, giving the user no opportunity to inspect the code that is being executed.
So a repository can contain a malicious git hook. Normally git hooks arenât installed by âgit cloneâ, but this exploit allows one to, and a git hook can run during the clone operation.
More information here (https://github.blog/open-source/git/git-security-vulnerabili...) on the new CVE:
> When reading a configuration value, Git will strip any trailing carriage return (CR) and line feed (LF) characters. When writing a configuration value, however, Git does not quote trailing CR characters, causing them to be lost when they are read later on. When initializing a submodule whose path contains a trailing CR character, the stripped path is used, causing the submodule to be checked out in the wrong place.
> If a symlink already exists between the stripped path and the submoduleâs hooks directory, an attacker can execute arbitrary code through the submoduleâs post-checkout hook.
Along with a bunch of other git CVEs that are worth a look.
This seems easy for GitHub to block
It's not sufficient for GitHub to block it; plenty of Git repositories don't have anything to do with GitHub.
Submodules can be any URL (and recursive), so for GitHub to block this totally would require them to crawl other forges (and some URLs could be private URLs, but GitHub likely can't tell that apart from an attacker who is just blocking GitHub). So the risk is GitHub could say they are blocking this and give a false sense of security.
Some previous bugs have resulted in validation added to git fsck, but because clone URLs can't change after the submodules are initialised that's not going to have any benefit here. (There were some defence-in-depth measures discussed, there's definitely a few things that can be improved here.)
Yes, unfortunately it's pretty trivial. Any time arbitrary file write is possible, RCE is usually possible too.
Could this be mitigated by moving .git out of work tree directory and using unprivileged process that only has access to work tree directory to do all the file manipulation?
I've adjusted that paragraph to make it more clear how writing a file can lead to code execution.
[flagged]
In this case, there is more than enough information given to make an exploit trivial for anyone actually investigating the issue. I donât see a reason to distribute PoC exploit code here, itâs fairly clear what the problem is as well as at least one possible RCE implementation.
The exploit being trivial also removes any reason not to show a complete PoC.
There is no security impact for most people anyway as the next step after a git clone is usually to run some kind of build command which executes arbitrary code from the repo.
A well-written article should still have spelled out what armchairhacker did. The article spent many paragraphs working through what seemed to me like a very obvious chain of reasoning that didn't need so much hand-holding, and then left me completely bewildered at the end with the last step. And even with the explanation, I'm still not sure why writing the files to a different path allows the hook to be written. (Surely Git knows that it's still in the middle of a recursive clone operation and that it therefore shouldn't accept any hooks?)
Git probably doesn't allow the submodule path to be .git but does allow .git\r which would then actually be written to .git due to the broken escaping.
There are hundreds of files on your system that if you could write to them would cause a RCE. For example, `AUTOEXEC.BAT`. While the intermediate step (how do you go from writing a filename without an ^M to writing to an arbitrary file on the target OS?) is also somewhat surprising (and maybe points at some potential deeper fixes within git or potential TOCTOU bugs?), ultimately it's the exact same as several other vulnerabilities in this class from the same author (see the 2024 case-sensitivity bug they linked), so I forgive them for glossing over it.
Git does "know" that â but it doesn't know it's writing a hook.
I think what they're saying is it shouldn't execute any hooks during a first clone operation.
This is a big problem with using ad-hoc DSLs for config - there's often no formal specification for the grammar, and so the source of truth for parsing is spread between the home-grown serialization implementation and the home-grown deserialization implementation. If they get out of sync (e.g. someone adds new grammar to the parser but forgets to update the writer), you end up with a parser differential, and tick goes the time bomb. The lesson: have one source of truth, and generate everything that relies on it from that.
The problem here isn't that the parser was updated. The parser and writer did what they did for a reason, that made sense historically but wasn't what the submodule system expected. The submodule system is a bit "tacked on" to the git design and it's not the first time that particular abstraction cracks.
Every file format is underspecified in some way when you use it on enough platforms and protocols, unless the format is actually a reference implementation, and we've had enough problems with those. There's a reason IETF usually demands two independent implementations.
Similar problems can affect case insensitive filesystems, or when moving data between different locales which affect UTF-8 normalization. It's not surprising that an almost identical CVE was just one year ago.
Be careful what you wish for. They could have used yaml instead of ini for the config format, and we would have had more security issues over the years, not less.
No, the writer encodes values in a way that they will be read back as different values. This underlying issue is absolutely an encoder/decoder mismatch and has nothing to do with submodules.
This bug is orthogonal to one source of truth. It is a pure logic bug that could have existed in a standard system library for config files if such existed on Unix.
And consider that consequences of such bug would be much worse if it was in a standard system library. At least here it is limited mostly to developers where machines are updated.
Nitpick: the DSL here ("ini file format") is arguably ad-hoc, but it's extremely common and well-understood, and simple enough to make a common law specification work well enough in practice. The bug here wasn't due to the format. What you're actually complaining about is the hand-coded parser[1] sitting in the middle of a C program like a bomb waiting to go off. And, yes, that nonsense should have died decades ago.
There are places for clever hand code, even in C, even in the modern world. Data interchange is very not much not one of them. Just don't do this. If you want .ini, use toml. Use JSON if you don't. Even YAML is OK. Those with a penchant for pain like XML. And if you have convinced yourself your format must be binary (you're wrong, it doesn't), protobufs are there for you.
But absolutely, positively, never write a parser unless your job title is "programming language author". Use a library for this, even if you don't use libraries for anything else.
[1] Fine fine, lexer. We are nitpicking, after all.
Since we are nitpicking, git is considerably older than TOML, and nearly as old as YAML and JSON. In fact JSON wasnât even standardised until after gitâs first stable release.
Back then there wasnât a whole lot of options besides rolling their own.
And when you also consider that git had to be somewhat rushed (due to Linux kernel developers having their licenses revoked for BitKeeper) and the fact that git was originally written and maintained by Linus himself, itâs a little more understandable that he wrote his own config file parser.
Under the same circumstances, I definitely would have done that too. In fact back then, I literally did write my own config file parsers.
The parser is fine, the bug is in the encoder which writes a value ending in \r without quotes even though \r is a special character at the end of a line.
How many hand crafted lexers dealing with lf vs. cr-lf encodings do exist? My guess is n > ( number of people who coded > 10 KSLOC ).
Iâve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.
If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So youâre much better off just jumping over that character entirely.
Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.
If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.
I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.
Why would ignoring CR lead to problems? It has nothing to do with line feeding on any system released in the last quarter of a century.
If youâre targeting iMacs or the Commodore 64, then sure, itâs something to be mindful of. But Iâd wager youâd have bigger compatibility problems before you even get to line endings.
Is there some other edge cases regarding CR that Iâve missed? Or are you thinking ultra defensively (from a security standpoint)?
That said, I do like your suggestion of treating CR like LF where the schema isnât sensitive to line numbering. Unfortunately for my use case, line numbering does matter somewhat. So would be good to understand if I have a ticking time bomb
More generally, any textual file format where whitespace is significant at the end of a line is calling for trouble.
Maybe. But expecting people to remember a ; (or similar) at the end of lines is going to cause more frequent problems from a UX performance.
So youâre better off accepting the edge cases problems that white space introduces considering the benefits outweighs the pain.
Thatâs not what I meant. Itâs okay for the line break itself to be significant. But whitespace immediately preceding the line break shouldnât be significant, due to its general invisibility.
Is CR considered whitespace? I always thought that was classed as a non-printable control character. But maybe Iâm wrong?
Or are you talking about SP preceding CR and/or LF?
Old Macs (pre-OS X I think) used CR only as line terminators.
Yes, youâre right. I completely forgot about them.
I canât imagine anyone targeting macOS 9 (or earlier) systems these days but youâre right that itâs an edge case people should be aware of.
If you're using protobuf you can still use text as it turns out [0]
[0] https://protobuf.dev/reference/protobuf/textformat-spec/
Reproduced the issue after a bit: https://github.com/acheong08/CVE-2025-48384 Then immediately went to update my git version. Still not up on Arch yet. Will refrain from pulling anything but I bet it'll take quite a while for most people to upgrade. Putting it in any reasonable popular repo where there are perhaps automated pulls will be interesting.
2.50.1 is up on pacman this morning.
So this was disclosed before patching? With all of the alarming "here's how we can pwn your machine" posts turning out to be months after the fact, I figured by now that these blog posts all happen after all the distros have long patched it.
It seems like it would be appropriate to make it clear "this is important now" vs "don't worry you probably already patched this" in the headline to save our time for those who aren't just reading these posts out of interest.
Commits fixing the bug date back around 3 or 4 weeks. The patched release came 3 weeks ago. Perhaps some parties weren't informed that it's security critical (Homebrew, Arch, etc) and are now scrambling
I'm not privy to the exact communications that happened, but per the Ubuntu changelog they prepared a patch a week ago[1] (which is about the normal timeline for notification per[2]). Homebrew is not on the distros list, so likely wouldn't have got an early notification. Arch is, but remember "The Arch Security Team is a group of volunteers"[3].
[1]: https://launchpad.net/ubuntu/+source/git/1:2.43.0-1ubuntu7.3
[2]: https://oss-security.openwall.org/wiki/mailing-lists/distros
[3]: https://wiki.archlinux.org/title/Arch_Security_Team
Am I reading this wrong? As of this writing it all says "vulnerable".
https://security-tracker.debian.org/tracker/CVE-2025-48384
Just went and checked and the latest version on macOS is over a year old..
>git version 2.39.5 (Apple Git-154)
I'm confused about the timeline here, the tag for 2.50.1 is from 2025-06-16 ( https://github.com/git/git/tags ). And the date for 2.50.1 on https://git-scm.com is also 2025-06-16.
But it seems like almost no distributions have patched it yet
https://security-tracker.debian.org/tracker/CVE-2025-48386 (debian as an example)
And the security advisory is from yesterday: https://github.com/git/git/security/advisories/GHSA-4v56-3xv...
Did git backdate the release?
Reading someone quote Jon Postel in the context of CR+LF brings back memories.
* https://jdebp.uk/FGA/qmail-myths-dispelled.html#MythAboutBar...
"that may not be the most sensible advice now", says M. Leadbeater today. We were saying that a lot more unequivocally, back in 2003. (-:
As Mark Crispin said then, the interpretations that people put on it are not what M. Postel would have agreed with.
Back in the late 1990s, Daniel J. Bernstein did the famous analysis that noted that parsing and quoting when converting between human-readable and machine-readable is a source of problems. And here we are, over a quarter of a century later, with a quoter that doesn't quote CRs (and even after the fix does not look for all whitespace characters).
Amusingly, git blame says that the offending code was written 19 years ago, around the time that Daniel J. Bernstein was doing the 10 year retrospective on the dicta about parsing and quoting.
* https://github.com/git/git/commit/cdd4fb15cf06ec1de588bee457...
* https://cr.yp.to/qmail/qmailsec-20071101.pdf
I suppose that we just have to keep repeating the lessons that were already hard learned in the 20th century, and still apply in the 21st.
> As Mark Crispin said then, the interpretations that people put on it are not what M. Postel would have agreed with.
Absolutely, in particular the "Be conservative in what you do" would have prevented this bug.
[dead]
Would homebrew itself be problematic here? Does it do recursive cloning?
At least a cursory glance at the repo suggests it might: https://github.com/Homebrew/brew/blob/700d67a85e0129ab8a893f...
It would be odd if it didn't. Although the goal of homebrew is to execute the code in the repo.
The only situation where the RCE here is a problem is if you clone github repos containing data you don't want to execute. That's fairly unusual.
The question is whether recursive submodule checkout happens after some integrity/signature validation or before. The RCE can be an issue in the latter case.
There would also have to be a compromise of the transport (i.e. a MITM of HTTPS or SSH) to use this in most practical scenarios.
It still weakens the security, otherwise why bother with integrity/signature checks if you trust the git remote?
"trivial modification of an existing exploit"...
Why git does not use Landlock? I know it is Linux-only, but why? "git clone" should only have r/o access to config directory and r/w to clone directory. And no subprocesses. In every exploit demo: "Yep, <s>it goes to a square hole</s> it launches a calculator".
> no subprocesses
I guess you're okay with breaking all git hooks, including post-checkout, because those are subprocesses as a feature.
You can always run your git operations in a container with seccomp or such if you're not using any of the many features that it breaks
This would also break custom commands. Which if you don't know about it, is a pretty cool feature.
Drop a git-something executable in your path and you can execute it as git something.
Why is this helpful? Just add the executable itself to path and execute it with "something" instead of "git something". Why are we making git an intermediary ? I am kind of stupid and this is genuine.
Because the joke doesnât land if typing âgit gudâ doesnât actually do something.
To your point, I would say that itâs âeasyâ rather than strictly helpful. There is a plugin I maintain internally that can be invoked by calling âhelm <thing>â if I go through the necessary steps to have it installable by the helm plugin command. Under the hood itâs just a small binary that you can put in your $PATH and itâll work fine, but there are tons of developers and PMs and other people at the company who donât know what a path variable is, or how to set it, or what a terminal is, or what shell theyâre running, or who know that they can do âhelm Xâ and âhelm Yâ, so why not âhelm Zâ for my plugin, etc ⌠It would be a hell of a lot easier to just ship the raw executable, but to those people and execs and mangers and stuff, it looks good if I can show it off next to the native stuff.
Whenever I have to help users with it, I notice that nearly everyone uses it with helm and not calling by the binary directly. It just comes down to the fact that presentation and perceived ease of use counts for a lot when people evaluate whether they want to make a tool part of their workflow.
It allows things to be added and removed from the main executable without changing the interface. This means if someone has a good idea everyone starts using, and they implemented it as a subcommand like this, it could eventually be integrated into git without anyone having to migrate. Also all the subcommands are implemented as separate executables like this anyway.
For example in /usr/lib/git-core/ with git 2.25.1 on Ubuntu, "git-rebase" is a symlink to "git". But on an older Centos VM I have access to, in /usr/libexec/git-core/ with git 2.16.5, "git-rebase" is a separate shell script.
Because if it's part of the repo, you don't depend on the host to take the extra step, which, if you're working from ephemeral instances or places where that step would have to be repeated, is a god send
Because it's thematically a part of a git workflow.
Git itself uses this functionality. On my ubuntu system the path is `/usr/lib/git-core/` and in it you see all sorts of bins for "git commands", e.g `git-rm`, `git-mv`, `git-difftool`, etc. A lot of these are just links back to the git binary these days, but many features begin life as a standalone `git-$X` executable, and back in early git days much more functionality was split across executables. (The ones that are now links back to git are largely for scripting purposes, a lot of git "plugins" and various CI type scripts will call `git-mv` rather than trying to get quoting right around calling `git mv` for example.
It also helps make plugins easier to distribute. I don't want to have to type `git-x` sometimes and `git y` others, and if I want my plugin to get adoption, I really really don't want that. So things like git-lfs, git-annex, etc can easily be distributed, documented as a plugin, and generally be considered as "a part of git", rather than a separate command.
This pattern is also not unique to git. Other softwares have followed it, notably cargo.
Because something might make less sense on its own or conflict with another tool.
The problem with Landlock, AFAIU, is that it's inherited across exec. OpenBSD's pledge and unveil were deliberately designed to reset across exec by default precisely because the "more secure" behavior makes it difficult or impossible to add to alot of existing code that needs to invoke other programs. It could be done in theory--e.g. by forking a helper process prior to locking down--but that requires a significant refactor all its own, so what ends up happening is that people just abstain from using seccomp or Landlock. Whereas all OpenBSD core utils were quickly modified to make use of pledge and unveil to some extent, and then over time improved to tighten things further; they were designed to permit and promote this sort of easy, incremental deployment.
I don't know the extent to which Landlock or even unveil would have helped here; maybe they would have only prevented the hook from running during the clone, but not subsequently when it's expected trusted hooks to run. But I'd bet adding unveil support to Git would be an easier task, notwithstanding (or even because of) the way Git invokes subprocesses.
> And no subprocesses.
have you never used git over ssh?
it's 2025 and we still can't decide to use /r, /n or /r/n.
Lone CR died with classic Mac OS over 20 years ago, I think we can ignore that. Lone LF is arguably a Unix-ism, everything else is/was using CRLF. Except that Unix text files are becoming ubiquitous, while existing protocols and formats, as well as Windows conventions, are stuck with CRLF. Thereâs really no good way out.
I'm on team "Windows should just accept and change to write and read CR and '/'. beginning the decades long transition process for those". Most of the APIs accept '/', and most of the software accepts CR-only.
I think even Microsoft have noticed this, which is why WSL exists to provide an island of Linux-flavored open source tooling inside a Windows environment.
I think you mean LF, not CR. The problem with changing the behavior with regard to CRLF is exactly that it would introduce vulnerabilities like the present one here, because some software would still apply the old behavior while others apply the new one. Stuff like https://portswigger.net/web-security/request-smuggling/advan....
Directory separators are another can of worms. A lot of functionality in Windows is driven by command-line invocations taking slash-prefixed options, where itâs crucial that they are syntactically distinct from file system paths. I donât think a transition is possible without an unacceptable amount of compatibility breakage.
Of course we can, it's just that some people decide wrong.
It seems like Homebrew still provides a vulnerable version, the same goes for Debian Bookworm.
2.50.1 is available on Homebrew now, for anyone seeing this.
guess I have to wait a bit more... no update to git 2.50.1 on Homebrew yet.
https://github.com/Homebrew/homebrew-core/pull/229423 or brew install git --HEAD
Thanks for the PR. I was also looking into it briefly and did not understand where the SHA256 for the various architectures are coming from and so I gave up before creating the PR (now I understand it's created automatically by a bot).
Thanks for making that PR! A regular `brew install git` installs 2.50.1 for me now.
Is it just me or is the font an eyestrain on this blog?
I am not sure if there's bias on my part after reading your comment but yes, it is bothersome.
yeah I see what you mean. it's like the anti-aliasing is broken
why tf is git still running submodule hooks during clone at all. like think. youre cloning a repo which you didnt write it or audit it. and git just... runs a post checkout hook from a submodule it just fetched off the internet. even with this CRLF bug fixed, thats still bananas
[flagged]
every other day of the year the rest of the industry is laughing in git
[flagged]
Suppose the system call to list a directory examined the place on the disk where a filename should be, and found bytes representing ASCII control characters. Should it deny the existence of the corresponding file? Assume disk corruption? Something else? After all, maybe (this is admittedly more theoretical than practical) those bytes map to something else in the current locale. It's not like modern Windows which assumes the filenames are all UTF-16.
What does it do today if there's a forward slash and a null byte?
Because filenames (and all other strings) are just bags of bytes on unix based systems.
[dead]
[flagged]
As the article says: "I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages, the common factor being this is a vulnerability in the interprocess communication of the components (either between git and external processes, or within the components of git itself). It is possible to draw a parallel with CRLF injection as seen in HTTP (or even SMTP smuggling)."
You can write this in any language. None of them will stop you. I'm on the cutting edge of "stop using C", but this isn't C's fault.
You can, but in languages like python/java/go/rust/... you wouldn't, because you wouldn't write serialization/de-serialization code by hand but call out to a battle hardened library.
This vulnerability is the fault of the C ecosystem where there is no reasonable project level package manager so everyone writes everything from scratch. It's exacerbated by the combination of a lack of generics (rust/java's solution), introspection (java/python's solution), and poor preprocessor in C (go's solution) so it wouldn't even be easy to make a ergonomic general purpose parser.
Python's pathlib wouldn't help you here, it can encode the necessary bits. Especially with configparser - it's 20 year old configuration reader. Java's story is worse.
What part of this would be prevented by another language?
You'd need to switch your data format to something like json, toml, etc. to prevent this from the outset. But JSON was first standardised 25 years ago, and AJAX wasn't invented when this was written. JSON was a fledgling and not widely used yet.
I guess we had netrc - but that's not standardised and everyone implements it differently. Same story for INI.
There was XML - at a time when it was full of RCEs, and everyone was acknowledging that its parser would be 90% of your program. Would you have joined the people disparaging json at the time as reinventing xml?
This vulnerability is the fault of data formats not being common enough to be widely invented yet.
> What part of this would be prevented by another language?
> You'd need to switch your data format to something like json, toml, etc.
The part where if you wrote this in any modern languages ecosystem you would do this.
Yes, modern languages and their ecosystems likely did not exist back then. The lesson going forwards is that we shouldn't keep doing new things like we did back then.
Saying smithing metal by using a pair of hand driven bellows is inefficient isn't to say the blacksmiths ages ago who had no better option were doing something wrong.
Ok... So you're not saying C is a problem.
You're saying every few years, we should torch our code and rewrite from scratch, using new tools.
... Enjoy your collapsing codebase. I'll stick with what works, thanks.
What an absurdly bad faith interpretation. I never said anything to even suggest abandoning old code.
As demonstrated by vulnerabilities like the one in the article, C (and its ecosystem) doesn't "work", so I'm glad to hear that you won't be sticking with that for new projects going forwards.
... Except you have already admitted that it has bupkus to do with C.
You said it was a lack of "modern tooling". Modern C toolchains vastly outstrip most, for modernity. C23 is three years old.
But no. I wont be breaking compatibility for everyone, just to chase a shiny nickel. That is burning the barnyard for fear of geese.
I have a feeling that this code was developed before any of those languages were widely popular and before their package managers or packages were mature.
This file was written like 20 years ago.
Sure, I'm not trying to assign blame to Linus for deciding to write git in C, I'm saying that modern tooling (not C) would prevent the bug with reasonably high probability and that that's a factor when deciding what to do going forwards.
We keep getting RCEs in C because tons of useful programs are written in C. If someone writes a useful program in Rust, we'll get RCEs in Rust.
There are a lot of useful programs written in Rust nowadays. This comment might have made more sense like 5 years ago.
It's not that only C programs are useful. It's that subtle mistakes on C result in more catastrophic vulnerabilities.
Make a mistake in application code in a language like, say Java, and you'll end up with an exception.
The article refutes that somewhat:
> I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages, the common factor being this is a vulnerability in the interprocess communication of the components (either between git and external processes, or within the components of git itself).
C programmers don't see C problems. They see logic errors that are possible in any language.
Running with scissors isn't a problem. The problem is stabbing yourself with them. Isn't it obvious?
As mentioned in the article, this is a logic error that has nothing to do with C strings.
Whilst true, thereâs a swathe of modern tooling that will aide in marshalling data for IPC. Would you not agree that if protobuf, json or yaml were used, itâd be far less likely for this bug have slipped in?
The OC was about language choice. You can use protobuf, json or yaml in C as well.
In general, though, all these can be wildly overkill for many tasks. At some point you just need to write good code and actually test it.
In isolation, for any one particular bug, yes, but if you start applying this logic to everything, even problems as simple as reading some bytes from a file, you end up with a heao of dependencies for the most mundane things. We've tried that, it's bad.
I don't believe we must apply any guideline ad absurdum. Using a battle tested marshalling/serialization library is clearly the way to go most often. Of course, one can still construct difficult to parse XML and JSON or any other blob for any given format, but the chances that bad input will result in an RCE are lower.
On the contrary, we've tried it and it works great.
No, I think in general you should trust other people with experience in an area more than your own naive self. Division of labor and all that.
There are exceptions, as always, but using dependencies is good as a first approximation.
Having "safe" yaml parsing is a whole topic of head scratching in whatever language of your choice if you want a rabbit hole to look into...
No I would not agree that YAML or JSON parsers in any language are far less likely to have logic errors, and I'm not sure why protobuf (a binary format) would be a good choice for a human readable file.
INI is not a particularly complex format (less complex than YAML for example), and there are existing open source parsers written in C that could have been used.
You can dig in all you want, but this is not an issue with C strings or the INI format.
This isn't even a parser error at all - the INI format comes from DOS/Windows where a trailing carriage return would not be considered part of the value either.
If only it were just the c code that was causing people to be owned lol
Using other languages would likely fix the issue but as a side-effect. Most people would expect a C-vs-Rust comparison so Iâll take Go as an example.
Nobody would write the configuration parsing code by hand, and just use whatever TOML library available at hand for Go. No INI shenanigans, and people would just use any available stricter format (strings must be quoted in TOML).
So yeah, Rust and Go and Python and Java and Node and Ruby and whatnot would not have the bug just by virtue of having a package manager. The actual language is irrelevant.
However, whatever the language, the same hand implementation would have had the exact same bug.
Ah yes, yet ANOTHER vulnerability caused because Linux and most Unixes allow control characters in filenames. This ability's primary purpose appears to be to enable attacks and to make it significantly more difficult to write correct code. For example, you're not supposed to exchange filenames a line at a time, since filenames can contain newlines.
See my discussion here: https://dwheeler.com/essays/fixing-unix-linux-filenames.html
One piece of good news: POSIX recently added xargs -0 and find -print0, making it a little easier to portably handle such filenames. Still, it's a pain.
I plan to complete my "safename" Linux module I started years ago. When enabled, it prevents creating filenames in certain cases such as those with control characters. It won't prevent all problems, but it's a decent hardening mechanism that prevents problems in many cases.
You can get similar vulnerabilities with Unicode normalization, with mismatched code pages/character encodings, or, as the article points out, with a case-insensitive file system. That's not to say that control characters should be allowed in file names, but there's an inherent risk whenever byte sequences are being decoded or normalized into something else.
Not to the same degree, though, and the arguments for status quo are especially weak. There are reasonable arguments pro and con case-insensitive filenames. Character encoding issues are dwindling, since most systems just use utf-8 for filename encoding (as there is no mechanism for indicating the encoding of each specific filename), and using utf-8 consistently in filenames supports filenames in arbitrary languages.
Control characters in filenames have no obviously valuable use case, they appear to be allowed only because "it's always been allowed". That is not a strong argument for them. Some systems do not allow them, with no obvious ill effects.
I think better idea is to make git use user namespaces and sandbox itself to the clone directory so it literally cannot write/read outside of it. This prevents path traversal attacks and limits the amount of damage RCE could do. Filenames really aren't the problem.
The idea of Defence in Depth is to handle vulnerabilities at several levels, instead of relying on a single technique that becomes a single point of failure.
I'm not saying not to do that. But it seems sandboxing should be the first thing to think of. Especially in concept of git which allows you to execute all sorts of custom scripts. File name sanitation is not that however, in fact in contrary file name sanitation is known to cause security vulnerabilities and other annoying issues in past.
I completely disagree with author's (oft quoted here in comments) statement:
> I find this particularly interesting because this isn't fundamentally a problem of the software being written in C. These are logic errors that are possible in nearly all languages
For Christ's sake, Turing taught us that any error in one language is possible in any other language. You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM. Rust and similar memory safe languages can emulate literally any problem C can make a mine field out of.. but logic errors being "possible" to perform are significantly different from logic errors being the first tool available to pull out of one's toolbox.
Other comments have cited that in non-C languages a person would be more likely to reach for a security-hardened library first, which I agree might be helpful.. but replies to those comments also correctly point out that this trades one problem for another with dependency hell, and I would add on top of that the issue that a widely relied upon library can also increase the surface area of attack when a novel exploit gets found in it. Libraries can be a very powerful tool but neither are they a panacea.
I would argue that the real value in a more data-safe language (be that Rust or Haskell or LISP et al) is in offering the built-in abstractions which lend themselves to more carefully modeling data than as a firehose of octets which a person then assumes they need to state-switch over like some kind of raw Turing machine.
"Parse, don't validate" is a lot easier to stick to when you're coding in a language designed with a precept like that in mind vs a language designed to be only slightly more abstract than machine code where one can merely be grateful that they aren't forced to use jump instructions for every control flow action.
> You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM.
No, this wouldn't be a double free in Rust, it'd be a double free in whatever language you used to write the emulated code.
The distinction is meaningful, because the logic error he's talking about is possible in actual rust (even without unsafe), not just theoretically in some virtual system that you can use Rust to write a simulation for.
Another example would be making your own allocator in Rust.
Not possible without unsafe.
Sure it is. Have an array which you allocate from.
I can easily see this bug happening in Rust. At some level you need to transform your data model into text to write out, and to parse incoming text. If you want to parse linewise you might use BufRead::lines(), and then write a parser for those lines. That parser won't touch CRs at all, which means when you do the opposite and write the code that serializes your data model back to lines, it's easy to forget that you need to avoid having a trailing CR, since CR appears nowhere else in your code.
And - having dealt with parser construction in university for a few months - the only real way to deal with this is fuzzing and round trip tests.
It sounds defeatist, but non-trivial parsers end up with a huge state space very quickly - and entirely strange error situations and problematic inputs. And "non-trivial" starts a lot sooner than one would assume. As the article shows, even "one element per line" ends up non-trivial once you support two platforms. "foo\r\n" could be tokenized/parsed in 3 or even 4 different ways or so.
It just becomes worse from there. And then Unicode happened.
Well the question then becomes "how do you identify the quoting that needs to happen on the line" and tactics common in Rust enabled by features available in Rust will still lead a person away from this pattern of error.
One tool I'd have probably reached for (long before having heard of this particular corner case to avoid) would have been whitespace trimming, and CR counts as whitespace. Plus folk outside of C are also more likely to aim a regex at a line they want to parse, and anyone who's been writing regex for more than 5 minutes gets into the habit of adding `\s*` adjacent to beginning of line and end of line markers (and outside of capture groups) which in this case achieves the same end.
You're describing a different format entirely then if you're doing generic whitespace trimming without any consideration for the definition of "whitespace". The Git config format explicitly defines ignorable whitespace as spaces and horizontal tabs, and says that these whitespace characters are trimmed from values, which means nothing else gets trimmed from values. If you try to write a parser for this using a regular expression and `\s*` then you'd better look up what `\s` means to your regex engine because it almost certainly includes more than just SP and HT.
I can't think of any features in Rust that will lead someone away from this pattern of error, where this pattern of error is not realizing that round-tripping the serialized output back through the deserializer can change the boundaries of line endings. It's really easy to think "if I have a bunch of single-line strings and I join them with newlines I now have multiline text, and I can split that back up into individual lines and get back what I started with". This is doubly true if you start with a parser that splits on newline characters and then change it after the fact to use BufRead::lines() in response to someone telling you it doesn't work on Windows.
I've been writing regular expressions for at least 8 years, and I'm not sure I've ever written `\s*`.
> You can even get double free in Rust if you take the time to build an entire machine emulator and then run something that uses Malloc in the ensuing VM. Rust and similar memory safe languages can emulate literally any problem C can make a mine field out of..
That doesn't have any relevance to a discussion about memory safety in C vs rust. Invalid memory access in the emulated machine won't be able to access memory from other processes on the host system. Two languages being turing complete does not make them the same language. And it definitely does not make them bug for bug compatible. Rust _really_ does enable you to write memory safe programs.
Sounds like you actually agree with the comment you are replying to.
> "Parse, don't validate" is a lot easier to stick to when you're coding in a language designed with a precept like that in mind vs a language designed to be only slightly more abstract than machine code
"Parse, don't validate" is easily doable in plain C and almost always has been. See https://www.lelanthran.com/chap13/content.html
As you point out, the most serious way to undermine the "safety" features in a "safe" language like Rust is to implement a VM, programming language, serdes framework, etc, because these operate outside of Rust's type system and memory safety.
And that's exactly what the Git developers did here: They made an in-house configuration file format. If implemented in Rust, it would bypass most of Rust's safety features, particularly, type-safety.
It is mind-blowing the things people come up with when it comes to Rust vs C conversations. The same colvoluted crap for years at this point.
No, just no. I'm sorry, Ive implemented countless custom formats in Rust and have NEVER had to side step safe/unsafe or otherwise sacrifice type safety. Just what an absurd claim.
Maybe for some binary (de)serialization you get fancy (lol and are still likely to be far better off than with C) but goodness, I cannot imagine a single reason why a config file parser would need to be (type)-unsafe.
The person you replied to didn't say that you had to bypass safe. This bug is orthogonal to type and memory safety, its a different issue.
The git bug in question could be written in 100% safe rust using as much or as little of the type system[1] as you want. It's a logic error when parsing a string.
I dev rust full-time, and I've spent a lot of time writing protocol parsers. It's easy to forget to check this or that byte/string for every possible edge case as you're parsing it into some rust type, and happens all the time in rust, just like it did in C or python or go when I used those languages. This bug (if anything) is the type of thing that is solved with good tokenizer design and testing, and using more small, independently tested functions - again not at all related to the type system.
[1] Although in rust you can arrange your types so that this sort of bug is harder to implement or easier to catch than in most languages... but doing that requires an up-front understanding that logic bugs are just as possible in rust as in other languages, as well as some experience to avoid awkwardness when setting the types up.
In practice I think a Rust project would have used toml which parses safely. The limitation there would be that toml requires strings to be utf8, so it couldn't represent all possible unix paths.
Zero surprise there's a bug in git's quoting. That code is mental.
I wrote a CLI program to determine and detect the end-of-line format, tabs, bom, and nul characters. It can be installed via Homebrew or you can download standalone binaries for all platforms:
https://github.com/jftuga/chars