Patch ID calculation by pks-t · Pull Request #4272 · libgit2/libgit2

pks-t · 2017-06-16T12:22:20Z

This is a work in progress. It does produce the same patch IDs for many files, but I'm not quite there yet. I've also discovered an error in parsing patches where the file names have whitespaces inside and added a test, but there is no fix yet.

pks-t · 2017-06-16T13:03:59Z

Seems like we also don't parse patches with multiple diffs correctly. At least the following diff only invokes the file callback once:

diff --git a/x b/x
index 8a1218a..7059ba5 100644
--- a/x
+++ b/x
@@ -1,5 +1,4 @@
 1
 2
-3
 4
 5
diff --git a/y b/y
index e006065..9405325 100644
--- a/y
+++ b/y
@@ -1,4 +1,5 @@
 a
 b
+c
 d
 e

Will investigate later on. But other than multi-file patch IDs and patches of files with white space, this patch seems to correctly generate the stable IDs.

ethomson · 2017-06-16T16:57:53Z

You need to flip that around. In libgit2 terminology, a patch doesn't have multiple diffs, a diff has multiple patches.

A "patch" is a per-file change. Multiple patches are aggregated into a diff. I sort of loathe this terminology, but here we are. (And please do correct me if I'm wrong, I always get confused by our naming here.)

How we choose to talk about the "patch id" for compatibility with git and our own internal nomenclature woes has yet to be seen. We could call it a "diff id" instead. Or not.

pks-t · 2017-06-19T07:37:16Z

Ah, yeah. Have been bitten by that terminology multiple times now, thanks for reminding! Having switched over to the diff interface, all tests pass right now.

So I chose to name the function git_diff_patchid, which I think makes most sense. We do not want to introduce new terminology here with e.g. git_diff_id, but instead state our relation to the upstream git-patch-id(1).

While at it, I've now fixed parsing patches with unquoted file names containing whitespaces. I've also tried to tighten our diff parsing code to return an error whenever we parse a patch containing multiple files, but quickly noticed that we in fact rely on the ability to ignore trailing garbage in a patch. So this interface is potentially dangerous in that it allows a developer to parse a patch as diff where we simply truncate the diff to the first patch. While it can be argued that the developer is at fault for not reading documentation, I can sympathise due to our confusing terminology. But well, here we are. Meh...

ethomson · 2017-06-21T10:26:51Z

So this interface is potentially dangerous in that it allows a developer to parse a patch as diff where we simply truncate the diff to the first patch.

IIRC, this is necessary, in fact, since they can (legally) have trailing data, IIRC.

(sigh.)

ethomson · 2017-06-25T18:26:15Z

-		if (!quoted && git__isspace(ctx->line[len]))
+		if (!quoted &&
+		    (!git__strncmp(ctx->line + len, " b/", strlen(" b/")) ||
+		     !git__strncmp(ctx->line + len, " \"b/", strlen(" \"b/")) ||


There's no guarantee that the prefixes are a/ and b/. You can set custom prefixes with --src-prefix and --dst-prefix, or you can have no prefix at all (with --no-prefix)...

This is a bit of a bummer as it means parsing the header line is completely ambiguous. What does:

diff --git a/b c/d/e f/g h/i j/k

mean?

I think we'll need to store the whole header buffer then parse it with the knowledge given to us by the --- and +++ lines.

:(

Except that a patch needn't have +++ and --- lines.

diff --git foo bar foo bar old mode 100755 new mode 100644

(Note: git cannot apply this patch.)

God, that's awful. So at some point, I guess we'd just have to say "This is unsupported." I think it makes most sense to use the "+++" and "---" lines and bail out if they're missing, given that upstream git itself cannot use these kind of patches.

Truly hideous, right? Why these aren't just quoted whenever there's a space in the filename, I don't know. 😢

Can we split the space in the filename commit out? I think this removes our ability to deal with custom prefixes. I doubt anybody is actually using that but it might be thinking about this a bit more before we pull it out. I think the patchid stuff is 👍 and ready to go.

Yeah, wanted to propose the same. Will create a new PR for the other "fix" and keep it on hold for now

The upstream git project provides the ability to calculate a so-called patch ID. Quoting from git-patch-id(1): A "patch ID" is nothing but a sum of SHA-1 of the file diffs associated with a patch, with whitespace and line numbers ignored." Patch IDs can be used to identify two patches which are probably the same thing, e.g. when a patch has been cherry-picked to another branch. This commit implements a new function `git_diff_patchid`, which gets a patch and derives an OID from the diff. Note the different terminology here: a patch in libgit2 are the differences in a single file and a diff can contain multiple patches for different files. The implementation matches the upstream implementation and should derive the same OID for the same diff. In fact, some code has been directly derived from the upstream implementation. The upstream implementation has two different modes to calculate patch IDs, which is the stable and unstable mode. The old way of calculating the patch IDs was unstable in a sense that a different ordering the diffs was leading to different results. This oversight was fixed in git 1.9, but as git tries hard to never break existing workflows, the old and unstable way is still default. The newer and stable way does not care for ordering of the diff hunks, and in fact it is the mode that should probably be used today. So right now, we only implement the stable way of generating the patch ID.

ethomson · 2017-07-19T20:28:42Z

🎉

pks-t force-pushed the pks/patch-id branch from 35829f0 to fb07fa0 Compare June 16, 2017 13:00

pks-t force-pushed the pks/patch-id branch from fb07fa0 to 945b079 Compare June 19, 2017 07:30

pks-t changed the title ~~WIP: Patch ID calculation~~ Patch ID calculation Jun 19, 2017

pks-t force-pushed the pks/patch-id branch from 945b079 to 94ed08d Compare June 21, 2017 05:53

ethomson mentioned this pull request Jun 24, 2017

git patch-id libgit2/libgit2sharp#1424

Open

ethomson reviewed Jun 25, 2017

View reviewed changes

pks-t force-pushed the pks/patch-id branch from 94ed08d to 2554b8a Compare June 26, 2017 13:15

pks-t mentioned this pull request Jun 26, 2017

patch_parse: fix parsing unquoted filenames with spaces #4285

Merged

pks-t force-pushed the pks/patch-id branch from 2554b8a to 89a3482 Compare June 26, 2017 13:39

pks-t mentioned this pull request Jul 10, 2017

Support git patch-id to compare the changes in different diffs #4270

Closed

ethomson merged commit a94a540 into libgit2:master Jul 19, 2017

pks-t deleted the pks/patch-id branch September 15, 2017 06:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch ID calculation#4272

Patch ID calculation#4272
ethomson merged 1 commit into
libgit2:masterfrom
pks-t:pks/patch-id

pks-t commented Jun 16, 2017

Uh oh!

pks-t commented Jun 16, 2017

Uh oh!

ethomson commented Jun 16, 2017

Uh oh!

pks-t commented Jun 19, 2017

Uh oh!

ethomson commented Jun 21, 2017

Uh oh!

ethomson Jun 25, 2017

Uh oh!

ethomson Jun 25, 2017

Uh oh!

pks-t Jun 26, 2017

Uh oh!

ethomson Jun 26, 2017

Uh oh!

pks-t Jun 26, 2017

Uh oh!

ethomson commented Jul 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pks-t commented Jun 16, 2017

Uh oh!

pks-t commented Jun 16, 2017

Uh oh!

ethomson commented Jun 16, 2017

Uh oh!

pks-t commented Jun 19, 2017

Uh oh!

ethomson commented Jun 21, 2017

Uh oh!

ethomson Jun 25, 2017

Choose a reason for hiding this comment

Uh oh!

ethomson Jun 25, 2017

Choose a reason for hiding this comment

Uh oh!

pks-t Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

ethomson Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

pks-t Jun 26, 2017

Choose a reason for hiding this comment

Uh oh!

ethomson commented Jul 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants