Skip to content

patch_parse: fix parsing unquoted filenames with spaces#4285

Merged
ethomson merged 2 commits into
libgit2:masterfrom
pks-t:pks/patches-with-whitespace
Dec 23, 2017
Merged

patch_parse: fix parsing unquoted filenames with spaces#4285
ethomson merged 2 commits into
libgit2:masterfrom
pks-t:pks/patches-with-whitespace

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented Jun 26, 2017

This is split out from #4272 due to some problems with ambiguous parsing of patches with whitespace and custom prefixes. This should not be merged as-is but is only intended as a reminder.


The patch format allows unquoted path names even when the paths contain
whitespaces in between. To disambiguate the path names in thee header,
one can simply use the slash in "a/" respectively "b/", as no file is
allowed to contain a forward slash.

While this format is distinctly parseable, we fail to do so correctly.
To determine the path length, we simply search for the first whitespace
and assume it is the delimiter between "a/foo.file b/foo.file", which
works for the simple cases. Instead, we should be searching for the
first occurrence of either ' b/' (note the leading space) or '\n' to
either delimit the first file path or the second file path. This does
not suffice yet, though, as the format also allows for quoted file
paths. Insofar, we also have to check for ' "b/' as delimiter for the
current file path.

Fix the issue by improving the logic determining file path length as
outlined in the above paragraph. Include a test to catch future
breakage.

@csware
Copy link
Copy Markdown
Contributor

csware commented Jul 6, 2017

Well, the first filename part could end with a space and be under a subpath starting with b, such as a/some /bla.txt in this case a patch is not parsed correctly with this patch.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jul 7, 2017

Yeah, correct, that's the actual issue. The issue becomes even more obvious when considering that it is possible to create custom prefixes. So instead of "a/" and "b/", the user can have "foobar/" and "whateverhelikes/" as prefixes, making it totally impossible for us to parse in all cases. One solution pointed out by @ethomson was to instead parse the "--- a/file\n+++ b/file" lines, but unfortunately they are not mandated by the patch format, either.

@pks-t pks-t force-pushed the pks/patches-with-whitespace branch from 6c27418 to 99506a4 Compare September 22, 2017 12:12
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Sep 22, 2017

This PR now depends on #4308.

I've changed the code to follow @ethomson's proposal of just bailing on the "diff --git" line if it seems to be unparseable and instead just use the "---" and "+++" lines. What I'm still undecided about is whether we want to fill in patch->header_old_part and patch->header_new_part, which is derived from the "diff --git" line, with the contents of patch->old_part and patch->new_part, derived from the "---" and "+++" lines. Opinions?

@ethomson
Copy link
Copy Markdown
Member

ethomson commented Oct 7, 2017

Opinions?

Yeah, I don't love my proposal of moving to the --- and +++ lines. It was sort of a half baked idea in the first place... I wonder if we should just get aggressive about validating the diff --git line and - if it's bogus - not fall back at all...?

Would your opinion about this behavior be different if I sent a patch to git to always quote filenames with spaces in them, ending this ambiguity once and for all?

(I should do that either way, of course.)

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Oct 9, 2017

I definitly think this should be fixed upstream, it's simply awful right now. But that won't help users who are stuck with older versions of Git in the next 10+ years (if you count things like RHEL), so in my opinion this fix here is a nice thing to have. Bailing out whenever we cannot parse diff --git simply feels wrong when we can in fact just extract the correct information from the +++ and --- lines.

@ethomson
Copy link
Copy Markdown
Member

ethomson commented Oct 9, 2017

Sure. Out of curiosity, how does Git itself choose the filenames? Does it use the +++ and --- lines? I was thinking that it did not, but I've actually forgotten by now, tbqh. If it does, then it's a no brainer to emulate them. If they do not, then it would be nice to know what they are doing and how they're working around this issue.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Oct 11, 2017

Git uses the diff --git line as defname, which is a fallback in case oldname or newname could not be derived from the +++ and --- lines. Here's the comment from git_header_name, which parses the diff --git thing:

/*
 * This is to extract the same name that appears on "diff --git"
 * line.  We do not find and return anything if it is a rename
 * patch, and it is OK because we will find the name elsewhere.
 * We need to reliably find name only when it is mode-change only,
 * creation or deletion of an empty file.  In any of these cases,
 * both sides are the same name under a/ and b/ respectively.
 */
static char *git_header_name(struct apply_state *state,
			     const char *line,
			     int llen)

Inside of that function:

		/* strip the a/b prefix including trailing slash */
		cp = skip_tree_prefix(state, first.buf, first.len);
		if (!cp)
			goto free_and_fail1;

So if they're not able to strip the prefixes, they simply bail out and fail. Same for other stuff if they're not able to parse the names. They only abort parsing the patch later on if neither oldname, newname nor defname are set.

So this PR here goes into the same direction, I'd argue.

@pks-t pks-t force-pushed the pks/patches-with-whitespace branch from 99506a4 to d5a192c Compare November 11, 2017 20:23
When parsing the "---" and "+++" line, we stop after the first
whitespace inside of the filename. But as files containing whitespaces
do not need to be quoted, we should instead use the complete line here.

This fixes parsing patches with unquoted paths with whitespaces.
The git patch format allows for having unquoted paths with whitespaces
inside. This format becomes ambiguous to parse, e.g. in the following
example:

    diff --git a/file b/with spaces.txt b/file b/with spaces.txt

While we cannot parse this in a correct way, we can instead use the
"---" and "+++" lines to retrieve the file names, as the path is not
followed by anything here but spans the complete remaining line. Because
of this, we can simply bail outwhen parsing the "diff --git" header here
without an actual error and then proceed to just take the paths from the
other headers.
@pks-t pks-t force-pushed the pks/patches-with-whitespace branch from d5a192c to 80226b5 Compare November 11, 2017 20:30
@ethomson ethomson merged commit 4110fc8 into libgit2:master Dec 23, 2017
@ethomson
Copy link
Copy Markdown
Member

Thanks for fixing this! 😀

@pks-t pks-t deleted the pks/patches-with-whitespace branch July 11, 2019 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants