Skip to content

Common parser interface#4310

Merged
ethomson merged 7 commits into
libgit2:masterfrom
pks-t:pks/common-parser
Nov 11, 2017
Merged

Common parser interface#4310
ethomson merged 7 commits into
libgit2:masterfrom
pks-t:pks/common-parser

Conversation

@pks-t
Copy link
Copy Markdown
Member

@pks-t pks-t commented Jul 15, 2017

This is a first attempt at pulling out the diff-header parser and unify its interface, such that it can be reused for other code parts. In this PR I did a PoC to replace the config file parser with this common parser code, which turns out to be quite promising in that it removes roughly 150 lines of code which are not required anymore.

There are two issues left with this PR, which is why I labeled it as WIP. First, there are some small-ish issues with proper indentation when writing config files, which I'll handle as soon as the second problem is solved. The second problem is with recursive include files, where the wrong structures are touched due to our faulty include handling. These things should be fixed by #4250.

As there are quite a lot of merge conflicts due to #4250, I'll stop working on this issue right now until the other PR is merged.

@pks-t pks-t force-pushed the pks/common-parser branch from 4244e6f to cc2e745 Compare July 21, 2017 11:03
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jul 21, 2017

I've rebased on master with #4250. As expected, the problem with recursive includes was solved by this. Only remaining problem now is improper indentation, which I'll fix later

@pks-t pks-t changed the title [WIP] Common parser Common parser interface Jul 21, 2017
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Jul 21, 2017

Fixed remaining issues. Next steps which I'll not handle inside of this PR:

  • rewrite attributes parser to use common interface
  • convert existing parsers into state machines

@pks-t pks-t force-pushed the pks/common-parser branch from cbb6343 to 1096c49 Compare August 25, 2017 15:22
@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Aug 25, 2017

Rebased upon latest master to fix a single conflict due to the new "common.h" headers. I've also added commit messages for every commit.

@pks-t
Copy link
Copy Markdown
Member Author

pks-t commented Oct 9, 2017

Gentle ping @ethomson @carlosmn :)

Comment thread src/patch_parse.c

/* Next patch */
{ "diff --git " , STATE_END, 0, NULL },
{ "@@ -" , STATE_END, 0, NULL },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 0s should be STATE_START, no? It looks like we do want to start again but using a number instead of the enum variant makes it look like we're after a different value.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Fixed

Comment thread src/patch_parse.c
STATE_COPY,

STATE_END,
} parse_header_state;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have some documentation of what each of these mean. I was shortly confused by STATE_END being the start start state for @@ which happens in the middle of a patch.

Comment thread src/patch_parse.c Outdated

if (memcmp(ctx->line, op->str, min(op_len, ctx->line_len)) != 0)
if (transition->expected_state != state ||
memcmp(ctx->parse_ctx.line, transition->str, min(op_len, ctx->parse_ctx.line_len)) != 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think git__prefixcmp would be more appropriate here. Otherwise you're letting this match when either string is the prefix of the other.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment thread src/patch_parse.c

typedef struct {
const char *str;
parse_header_state expected_state;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called expected_state because it's what you expect to be in in order to match? If we're going to go for a state machine, I think it makes more sense for this to be called current_state.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread src/parse.c Outdated
memcpy((char *)ctx->content, content, content_len);
} else {
ctx->content = NULL;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must we copy the contents? We're already read whatever file or buffer into memory.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread src/parse.c Outdated
return 0;
}

int git_parse_advance_digit(git_off_t *out, git_parse_ctx *ctx, int base)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the type in the header. We should have int64 here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread src/parse.c
return 0;
}

int git_parse_peek(char *out, git_parse_ctx *ctx, int flags)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the uses for this do just want to look at the next character. It might not be a huge deal but I worry about the performance hit when we have this loop in the function that we don't usually want.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the skip-whitespace flag is not set we're not looping at all. Are you worried about bad code generated by the compiler or just about calling that function multiple times? I mean I could also split that up into two different functions git_parser_peek and git_parser_peek_skip_whitespace (naming should obviously be improved), but I don't think it's required right now.

Comment thread src/config_file.c
reader.read_ptr = reader.buffer.ptr;
reader.eof = 0;
if (result == 0 || result == GIT_ENOTFOUND) {
git_parse_ctx_init(&reader.ctx, contents.ptr, contents.size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we guarantee that we won't touch the fields in contents if it's not found? It makes sense that we wouldn't touch them, but if that's not part of the contract, this could initialise the context with whatever random pointer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the contents of the context?

The `git_patch_parse_ctx` encapsulates both parser state as well as
options specific to patch parsing. To advance this state and keep it
consistent, we provide a few functions which handle advancing the
current position and accessing bytes of the patch contents. In fact,
these functions are quite generic and not related to patch-parsing by
themselves. Seeing that we have similar logic inside of other modules,
it becomes quite enticing to extract this functionality into its own
parser module.

To do so, we create a new module `parse` with a central struct called
`git_parse_ctx`. It encapsulates both the content that is to be parsed
as well as its lengths and the current position. `git_patch_parse_ctx`
now only contains this `parse_ctx` only, which is then accessed whenever
we need to touch the current parser. This is the first step towards
re-using this functionality across other modules which require parsing
functionality and remove code-duplication.
Instead of manually checking the parsing context's remaining length and
comparing the leading bytes with a specific string, we can simply re-use
the function `git_parse_ctx_contains_s`. Do so to avoid code duplication
and to further decouple patch parsing from the parsing context's struct
members.
The patch parsing code has multiple recurring patterns where we want to
parse an actual number. Create a new function `git_parse_advance_digit`
and use it to avoid code duplication.
Some code parts need to inspect the next few bytes without actually
consuming it yet, for example to examine what content it has to expect
next. Create a new function `git_parse_peek` which returns the next byte
without modifying the parsing context and use it at multiple call sites.
Upon initializing the parser context, we do not currently initialize the
current line, line length and line number. Do so in order to make the
interface easier to use and more obvious for future consumers of the
parsing API.
The configuration file code grew quite big and intermingles both actual
configuration logic as well as the parsing logic of the configuration
syntax. This makes it hard to refactor the parsing logic on its own and
convert it to make use of our new parsing context module.

Refactor the code and split it up into two parts. The config file code
will only handle actual handling of configuration files, includes and
writing new files. The newly created config parser module is then only
responsible for parsing the actual contents of a configuration file,
leaving everything else to callbacks provided to its provided function
`git_config_parse`.
As the config parser is now cleanly separated from the config file code,
we can easily refactor the code and make use of the common parser
module. This removes quite a lot of duplicated functionality previously
used for handling the actual parser state and replaces it with the
generic interface provided by the parser context.
@pks-t pks-t force-pushed the pks/common-parser branch from 1096c49 to 9e66590 Compare November 11, 2017 17:17
@ethomson ethomson merged commit 1d7c15a into libgit2:master Nov 11, 2017
@pks-t pks-t deleted the pks/common-parser branch November 11, 2017 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants