Common parser interface by pks-t · Pull Request #4310 · libgit2/libgit2

pks-t · 2017-07-15T12:39:45Z

This is a first attempt at pulling out the diff-header parser and unify its interface, such that it can be reused for other code parts. In this PR I did a PoC to replace the config file parser with this common parser code, which turns out to be quite promising in that it removes roughly 150 lines of code which are not required anymore.

There are two issues left with this PR, which is why I labeled it as WIP. First, there are some small-ish issues with proper indentation when writing config files, which I'll handle as soon as the second problem is solved. The second problem is with recursive include files, where the wrong structures are touched due to our faulty include handling. These things should be fixed by #4250.

As there are quite a lot of merge conflicts due to #4250, I'll stop working on this issue right now until the other PR is merged.

pks-t · 2017-07-21T11:04:46Z

I've rebased on master with #4250. As expected, the problem with recursive includes was solved by this. Only remaining problem now is improper indentation, which I'll fix later

pks-t · 2017-07-21T11:37:04Z

Fixed remaining issues. Next steps which I'll not handle inside of this PR:

rewrite attributes parser to use common interface
convert existing parsers into state machines

pks-t · 2017-08-25T15:24:48Z

Rebased upon latest master to fix a single conflict due to the new "common.h" headers. I've also added commit messages for every commit.

pks-t · 2017-10-09T09:21:53Z

Gentle ping @ethomson @carlosmn :)

carlosmn · 2017-11-04T14:58:33Z

+
+	/* Next patch */
+	{ "diff --git "         , STATE_END,        0,                NULL },
+	{ "@@ -"                , STATE_END,        0,                NULL },


These 0s should be STATE_START, no? It looks like we do want to start again but using a number instead of the enum variant makes it look like we're after a different value.

carlosmn · 2017-11-04T15:01:07Z

+	STATE_COPY,
+
+	STATE_END,
+} parse_header_state;


It'd be nice to have some documentation of what each of these mean. I was shortly confused by STATE_END being the start start state for @@ which happens in the middle of a patch.

carlosmn · 2017-11-04T15:06:47Z


-			if (memcmp(ctx->line, op->str, min(op_len, ctx->line_len)) != 0)
+			if (transition->expected_state != state ||
+			    memcmp(ctx->parse_ctx.line, transition->str, min(op_len, ctx->parse_ctx.line_len)) != 0)


I think git__prefixcmp would be more appropriate here. Otherwise you're letting this match when either string is the prefix of the other.

carlosmn · 2017-11-04T15:24:05Z

+
 typedef struct {
 	const char *str;
+	parse_header_state expected_state;


Is this called expected_state because it's what you expect to be in in order to match? If we're going to go for a state machine, I think it makes more sense for this to be called current_state.

carlosmn · 2017-11-04T15:29:02Z

+		memcpy((char *)ctx->content, content, content_len);
+	} else {
+		ctx->content = NULL;
+	}


Must we copy the contents? We're already read whatever file or buffer into memory.

carlosmn · 2017-11-04T15:44:00Z

+	return 0;
+}
+
+int git_parse_advance_digit(git_off_t *out, git_parse_ctx *ctx, int base)


This doesn't match the type in the header. We should have int64 here.

carlosmn · 2017-11-04T15:48:41Z

+	return 0;
+}
+
+int git_parse_peek(char *out, git_parse_ctx *ctx, int flags)


A lot of the uses for this do just want to look at the next character. It might not be a huge deal but I worry about the performance hit when we have this loop in the function that we don't usually want.

Well, if the skip-whitespace flag is not set we're not looping at all. Are you worried about bad code generated by the compiler or just about calling that function multiple times? I mean I could also split that up into two different functions git_parser_peek and git_parser_peek_skip_whitespace (naming should obviously be improved), but I don't think it's required right now.

carlosmn · 2017-11-04T15:57:23Z

-		reader.read_ptr = reader.buffer.ptr;
-		reader.eof = 0;
+	if (result == 0 || result == GIT_ENOTFOUND) {
+		git_parse_ctx_init(&reader.ctx, contents.ptr, contents.size);


Do we guarantee that we won't touch the fields in contents if it's not found? It makes sense that we wouldn't touch them, but if that's not part of the contract, this could initialise the context with whatever random pointer.

Do you mean the contents of the context?

The `git_patch_parse_ctx` encapsulates both parser state as well as options specific to patch parsing. To advance this state and keep it consistent, we provide a few functions which handle advancing the current position and accessing bytes of the patch contents. In fact, these functions are quite generic and not related to patch-parsing by themselves. Seeing that we have similar logic inside of other modules, it becomes quite enticing to extract this functionality into its own parser module. To do so, we create a new module `parse` with a central struct called `git_parse_ctx`. It encapsulates both the content that is to be parsed as well as its lengths and the current position. `git_patch_parse_ctx` now only contains this `parse_ctx` only, which is then accessed whenever we need to touch the current parser. This is the first step towards re-using this functionality across other modules which require parsing functionality and remove code-duplication.

Instead of manually checking the parsing context's remaining length and comparing the leading bytes with a specific string, we can simply re-use the function `git_parse_ctx_contains_s`. Do so to avoid code duplication and to further decouple patch parsing from the parsing context's struct members.

The patch parsing code has multiple recurring patterns where we want to parse an actual number. Create a new function `git_parse_advance_digit` and use it to avoid code duplication.

Some code parts need to inspect the next few bytes without actually consuming it yet, for example to examine what content it has to expect next. Create a new function `git_parse_peek` which returns the next byte without modifying the parsing context and use it at multiple call sites.

Upon initializing the parser context, we do not currently initialize the current line, line length and line number. Do so in order to make the interface easier to use and more obvious for future consumers of the parsing API.

The configuration file code grew quite big and intermingles both actual configuration logic as well as the parsing logic of the configuration syntax. This makes it hard to refactor the parsing logic on its own and convert it to make use of our new parsing context module. Refactor the code and split it up into two parts. The config file code will only handle actual handling of configuration files, includes and writing new files. The newly created config parser module is then only responsible for parsing the actual contents of a configuration file, leaving everything else to callbacks provided to its provided function `git_config_parse`.

As the config parser is now cleanly separated from the config file code, we can easily refactor the code and make use of the common parser module. This removes quite a lot of duplicated functionality previously used for handling the actual parser state and replaces it with the generic interface provided by the parser context.

pks-t force-pushed the pks/common-parser branch from 4244e6f to cc2e745 Compare July 21, 2017 11:03

pks-t changed the title ~~[WIP] Common parser~~ Common parser interface Jul 21, 2017

pks-t force-pushed the pks/common-parser branch from cbb6343 to 1096c49 Compare August 25, 2017 15:22

pks-t mentioned this pull request Aug 25, 2017

patch_parse: implement state machine for parsing patch headers #4308

Merged

carlosmn reviewed Nov 4, 2017

View reviewed changes

pks-t mentioned this pull request Nov 6, 2017

Config iterators do not retain order #4361

Closed

pks-t added 7 commits November 11, 2017 17:06

parse: implement and use git_parse_advance_digit

252f2ee

The patch parsing code has multiple recurring patterns where we want to parse an actual number. Create a new function `git_parse_advance_digit` and use it to avoid code duplication.

parse: always initialize line pointer

7bdfc0a

Upon initializing the parser context, we do not currently initialize the current line, line length and line number. Do so in order to make the interface easier to use and more obvious for future consumers of the parsing API.

pks-t force-pushed the pks/common-parser branch from 1096c49 to 9e66590 Compare November 11, 2017 17:17

ethomson merged commit 1d7c15a into libgit2:master Nov 11, 2017

pks-t deleted the pks/common-parser branch November 11, 2017 20:32

Conversation

pks-t commented Jul 15, 2017

Uh oh!

pks-t commented Jul 21, 2017

Uh oh!

pks-t commented Jul 21, 2017

Uh oh!

pks-t commented Aug 25, 2017

Uh oh!

pks-t commented Oct 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants