Skip to content

Patch ID calculation#4272

Merged
ethomson merged 1 commit into
libgit2:masterfrom
pks-t:pks/patch-id
Jul 19, 2017
Merged

Patch ID calculation#4272
ethomson merged 1 commit into
libgit2:masterfrom
pks-t:pks/patch-id

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented Jun 16, 2017

This is a work in progress. It does produce the same patch IDs for many files, but I'm not quite there yet. I've also discovered an error in parsing patches where the file names have whitespaces inside and added a test, but there is no fix yet.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jun 16, 2017

Seems like we also don't parse patches with multiple diffs correctly. At least the following diff only invokes the file callback once:

diff --git a/x b/x
index 8a1218a..7059ba5 100644
--- a/x
+++ b/x
@@ -1,5 +1,4 @@
 1
 2
-3
 4
 5
diff --git a/y b/y
index e006065..9405325 100644
--- a/y
+++ b/y
@@ -1,4 +1,5 @@
 a
 b
+c
 d
 e

Will investigate later on. But other than multi-file patch IDs and patches of files with white space, this patch seems to correctly generate the stable IDs.

@ethomson
Copy link
Copy Markdown
Member

You need to flip that around. In libgit2 terminology, a patch doesn't have multiple diffs, a diff has multiple patches.

A "patch" is a per-file change. Multiple patches are aggregated into a diff. I sort of loathe this terminology, but here we are. (And please do correct me if I'm wrong, I always get confused by our naming here.)

How we choose to talk about the "patch id" for compatibility with git and our own internal nomenclature woes has yet to be seen. We could call it a "diff id" instead. Or not.

@pks-t pks-t changed the title WIP: Patch ID calculation Patch ID calculation Jun 19, 2017
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jun 19, 2017

Ah, yeah. Have been bitten by that terminology multiple times now, thanks for reminding! Having switched over to the diff interface, all tests pass right now.

So I chose to name the function git_diff_patchid, which I think makes most sense. We do not want to introduce new terminology here with e.g. git_diff_id, but instead state our relation to the upstream git-patch-id(1).

While at it, I've now fixed parsing patches with unquoted file names containing whitespaces. I've also tried to tighten our diff parsing code to return an error whenever we parse a patch containing multiple files, but quickly noticed that we in fact rely on the ability to ignore trailing garbage in a patch. So this interface is potentially dangerous in that it allows a developer to parse a patch as diff where we simply truncate the diff to the first patch. While it can be argued that the developer is at fault for not reading documentation, I can sympathise due to our confusing terminology. But well, here we are. Meh...

@ethomson
Copy link
Copy Markdown
Member

So this interface is potentially dangerous in that it allows a developer to parse a patch as diff where we simply truncate the diff to the first patch.

IIRC, this is necessary, in fact, since they can (legally) have trailing data, IIRC.

(sigh.)

Comment thread src/patch_parse.c Outdated
if (!quoted && git__isspace(ctx->line[len]))
if (!quoted &&
(!git__strncmp(ctx->line + len, " b/", strlen(" b/")) ||
!git__strncmp(ctx->line + len, " \"b/", strlen(" \"b/")) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that the prefixes are a/ and b/. You can set custom prefixes with --src-prefix and --dst-prefix, or you can have no prefix at all (with --no-prefix)...

This is a bit of a bummer as it means parsing the header line is completely ambiguous. What does:

diff --git a/b c/d/e f/g h/i j/k

mean?

I think we'll need to store the whole header buffer then parse it with the knowledge given to us by the --- and +++ lines.

:(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that a patch needn't have +++ and --- lines.

diff --git foo bar foo bar
old mode 100755
new mode 100644

(Note: git cannot apply this patch.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

God, that's awful. So at some point, I guess we'd just have to say "This is unsupported." I think it makes most sense to use the "+++" and "---" lines and bail out if they're missing, given that upstream git itself cannot use these kind of patches.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly hideous, right? Why these aren't just quoted whenever there's a space in the filename, I don't know. 😢

Can we split the space in the filename commit out? I think this removes our ability to deal with custom prefixes. I doubt anybody is actually using that but it might be thinking about this a bit more before we pull it out. I think the patchid stuff is 👍 and ready to go.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, wanted to propose the same. Will create a new PR for the other "fix" and keep it on hold for now

The upstream git project provides the ability to calculate a so-called
patch ID. Quoting from git-patch-id(1):

    A "patch ID" is nothing but a sum of SHA-1 of the file diffs
    associated with a patch, with whitespace and line numbers ignored."

Patch IDs can be used to identify two patches which are probably the
same thing, e.g. when a patch has been cherry-picked to another branch.

This commit implements a new function `git_diff_patchid`, which gets a
patch and derives an OID from the diff. Note the different terminology
here: a patch in libgit2 are the differences in a single file and a diff
can contain multiple patches for different files. The implementation
matches the upstream implementation and should derive the same OID for
the same diff. In fact, some code has been directly derived from the
upstream implementation.

The upstream implementation has two different modes to calculate patch
IDs, which is the stable and unstable mode. The old way of calculating
the patch IDs was unstable in a sense that a different ordering the
diffs was leading to different results. This oversight was fixed in git
1.9, but as git tries hard to never break existing workflows, the old
and unstable way is still default. The newer and stable way does not
care for ordering of the diff hunks, and in fact it is the mode that
should probably be used today. So right now, we only implement the
stable way of generating the patch ID.
@ethomson
Copy link
Copy Markdown
Member

🎉

@pks-t pks-t deleted the pks/patch-id branch September 15, 2017 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants