Dont NOt Merge : POC code to show how objects and not just CFs could be used as returns values by bbakerman · Pull Request #3323 · graphql-java/graphql-java

bbakerman · 2023-09-04T04:25:26Z

This is a POC to show how we could return Objects and not just CompletableFutures from the data fetchers and then beyond

This was driven from my investigations into Java 21 virtual threads where a DF can in theory return a value thats async BUT is materialised by the time the DF returns (eg it ran on a VT inside the data fetcher)

CFs are cheapish but pure objects are (in theory) cheaper.

The price of this PR is the type safety in our code and in callbacks such as Instrumentation

We have to handle an Object return type as CompletableFuture<Foo> | Foo - so the compiler is not helping out code.

This also breaks the API of the ExecutionStrategy BUT I am comfortable with that. I think on reflection that ExecutionStrategy should be @Internal code

I will spend some time to see if this leads to better performance.

ALGORITHM

The general alo here is that at key points we have either a List<CompleteableFuture<Foo>> or a List<Foo> - or the single arity versions of of those - and since Java can't represent a union type this ends up being methods like

protected /* CompletableFuture<FieldValueInfo> | FieldValueInfo */ Object resolveFieldWithInfo(...)

So when these values need to be consumed you end up with 2 paths via if (obj instanceOf CompleteableFuture) else

I have called the non promise a materialised object - I am not sure this is the right nomclementure.

The other key part is our async wait code. Since we cant know up front if a result object will be async or materialised we have to use a mechanism to discover that.

In theory we would have a polymorphic list of values like CF1, object2, CF3, CF4, object5, object6, object7 say. That list would require us to async wait on 3 objects while having 4 other materialised values we dont have to.

The Async.CombinedBuilder code was retrofited with an awaitPolymorphic() (I don't love the naming). if ANY of the objects to wait for a CFs then it will truly await then a return a CompleteableFuture<List<Foo>>. If they are all materialised it can skip that and returns a List<Foo>>.

Again the consumer of that needs to check via instanceOf if its a CF or not. If its not it can be consumed direct otherwise it must be consumed as a CF chain.

Just like CompleteableFuture this is viral - the method return types become Object - which is bad for clean code understanding but not impossible to understand.

POSSIBLE IMPROVEMENTS

Right now we wrap completed values into ExecutionResult objects and then unwrap them and throw that away. I think this is becase when we complete a Object type we call back to executionStrategy.execute() and this returns a ExecutionResult

This is wasteful for scalars, enums and even lists in maybe ways. This does however provide a common shape to FieldValueInfo where the value is today a CF<ExecutionResult> (or a possible ExecutionResult in this PR)

A ExecutionResult can carry errors and data BUT we never use that so again a waste. This would be a memory saving if we stopped wrapping values in ERs

IMPACT

The BIG code impact is in InstrumentationContext which I didn't change (yet)

public interface InstrumentationContext<T> {
    void onDispatched(CompletableFuture<T> result);
    void onCompleted(T result, Throwable t);
}

Nominally the field fetch context is no longer CompletableFuture but my union of Object or CF

DataLoader code needs this onDispatched to work BUT it does not use the value at all.

This calls for a breaking change but how ? TBD

Summary

So there is real food for thought here.

It's faster if we don't use CFs but its only a little bit faster and the code is more complicated and the InstrumentationContext is a real pain in a breaking code sense

There is less memory pressure if we dont allocated CFs which is something. I didn't spend time on that memory pressure but its a 3-4 CFs per returned field value (as we compose CFs)

He-Pin · 2023-09-04T04:36:33Z

+            return (FetchedValue) join;
+        } else {
+            return (FetchedValue) fetchedField;
+        }


Epic, you are planing some great thing!

bbakerman · 2023-09-04T05:12:22Z

I have done some benchmarks on my machine.

if anything... its slower in ms/op but faster in thorough put ops/s. But not much and not to any great degree

Master - before change

Benchmark                                    Mode  Cnt    Score    Error  Units
BenchMark.benchMarkSimpleQueriesThroughput  thrpt   15  176.481 ± 11.509  ops/s
BenchMark.benchMarkSimpleQueriesAvgTime      avgt   15    5.426 ±  0.283  ms/op


This branch - with changes

Benchmark                                    Mode  Cnt    Score    Error  Units
BenchMark.benchMarkSimpleQueriesThroughput  thrpt   15  177.025 ± 13.248  ops/s
BenchMark.benchMarkSimpleQueriesAvgTime      avgt   15    6.161 ±  0.407  ms/op

This branch - with changes taking out the onDispatch() CF creation

BenchMark.benchMarkSimpleQueriesThroughput  thrpt   15  178.441 ± 7.796  ops/s
BenchMark.benchMarkSimpleQueriesAvgTime      avgt   15    6.119 ± 2.048  ms/op

The above benchmarks a test of sync only results. There are no async in it at all.

I had expected it to be quicker but its not - interesting...

Suspect

I suspect the culprint might be the completeObject / completeList methods which jump is back into CF land

If you have a List<Foo> source object to run over on graphql then it ends up runnig the list as a set of field fetches in CF land

CompletableFuture<List<ExecutionResult>> resultsFuture = Async.each(fieldValueInfos, FieldValueInfo::getFieldValue);

So while the each object in the list might be fetched as a non CF object - it ends up being "completed" as a CF no matter what.

I think we would have to take this PR further and see if we can have a polymorphic list of CFs or materialised objects come back in the completeValue methods

Taking it further

I took this further and made the completeList / completeObject all returns a Object or CF. At present that does not pass all tests

The benchmarking is better however

Benchmark                                    Mode  Cnt    Score   Error  Units
BenchMark.benchMarkSimpleQueriesThroughput  thrpt   15  178.598 ± 6.397  ops/s
BenchMark.benchMarkSimpleQueriesAvgTime      avgt   15    5.441 ± 0.220  ms/op

but not amazingly better.

…h viral support in completeObject adn completeList

…e in async combined builder

bbakerman · 2023-09-04T11:55:08Z

-            if (exception != null) {
-                ExecutionResult executionResult = handleNonNullException(executionContext, overallResult, exception);
-                completeListCtx.onCompleted(executionResult, exception);
-                return;


This is the completeList code - before it did

CompletableFuture<List> resultsFuture = Async.each(fieldValueInfos, FieldValueInfo::getFieldValue);

but this now gets a bit more complicated because the items in the list can be CFs or not.

So we unwrap that Async.each and do our CF or Object code

bbakerman · 2023-09-04T11:57:55Z

@@ -662,7 +737,7 @@ protected CompletableFuture<ExecutionResult> completeValueForEnum(ExecutionConte
        } catch (NonNullableFieldWasNullException e) {
            return exceptionallyCompletedFuture(e);
        }


notice here we still use CFs - why? because they can represent exceptions in themselves.

So if the serlisation fails - we returns an exceptional CF otherwise the materialised values

bbakerman · 2023-09-04T11:59:22Z

        // Calling this from the executionContext to ensure we shift back from mutation strategy to the query strategy.
-
-        return executionContext.getQueryStrategy().execute(executionContext, newParameters);
+        return executionContext.getQueryStrategy().executePolymorphic(executionContext, newParameters);


This means that a materialised object can be completed via recursively calling back to the ES but via a special method that allows CFs or materialized values

…adoc

bbakerman · 2023-09-05T00:52:03Z

All tests now passing

rrva · 2023-09-06T16:47:49Z

Awesome to see work in this area 👍 💯

rrva · 2023-09-06T16:51:40Z

I can provide real-world benchmarks on this branch vs master if it executes my queries

rrva · 2023-09-06T16:52:43Z

I can provide real-world benchmarks on this branch vs master if it executes my queries

Heh, "real-world" and benchmark in the same sentence. What I meant: I can run our load tests on it which involve real queries and provide some comparative indications

bbakerman · 2023-09-06T23:25:00Z

I can provide real-world benchmarks on this branch vs master if it executes my queries

Heh, "real-world" and benchmark in the same sentence. What I meant: I can run our load tests on it which involve real queries and provide some comparative indications

Please do - we would love to see this in a more real world situation

bbakerman · 2023-09-06T23:25:12Z

I can provide real-world benchmarks on this branch vs master if it executes my queries

Heh, "real-world" and benchmark in the same sentence. What I meant: I can run our load tests on it which involve real queries and provide some comparative indications

Please do - we would love to see this in a more real world situation

bbakerman · 2023-09-06T23:28:33Z

I would like to change this PR again to remove the ExecutionResult wrapping that happens. I suspect this was caused by the old reentrant1 call back to the executionStrattegy when resolving an object but this PR changes that but having a special method method for callback that knows its inside the gate.

I think this would also reduce memory pressure by not allocating an wrapper object for enums and scalars and so on- that is one for every value

rrva · 2023-10-06T13:24:39Z

I can provide real-world benchmarks on this branch vs master if it executes my queries

Heh, "real-world" and benchmark in the same sentence. What I meant: I can run our load tests on it which involve real queries and provide some comparative indications

Please do - we would love to see this in a more real world situation

We measured this branch using some load tests involving real queries and compared this to 21.1. With a quick load test we typically use (10mins), I am afraid we also do not see any notable improvements in throughput or latency up to P99. However, less CompletableFutures are created as expected (about a third). GC pauses go down.

graphql-java 21.1 baseline

memory allocation distribution

GC pressure and GC pauses

this branch

memory allocation distribution

GC pressure and GC pauses

rrva · 2023-10-06T13:27:15Z

I would need to spend more time load testing or measure latency up to P99.9 to really see a difference in throughput/latency. GC is coping right now. But we do see differences in the GC metrics.

rrva · 2023-10-06T14:02:06Z

I would need to spend more time load testing or measure latency up to P99.9 to really see a difference in throughput/latency. GC is coping right now. But we do see differences in the GC metrics.

These tests were on Java 21, 16Gb heap, half of which can be used for query processing and G1 GC.

bbakerman · 2023-10-07T08:18:20Z

Wow thanks so much for these numbers. This really helps put some perspective on the work.

Reducing memory pressure is always a good thing however I think most people would trade more memory usage for less latency and in this case it's just not there.

This code is problematic in the sense of code maintenance. It's entering into a scripting language world where where an Object value can be actually be a CompletableFuture<Foo> | Foo and needs to be checked with instanceof whenever used. In fact this is how the graphql-js reference code base does it work, returns a Union type that needs to be checked when used.

It also breaks Instrumentation quite a lot because nominally they are

public interface InstrumentationContext<T> {
    void onDispatched(CompletableFuture<T> result);
    void onCompleted(T result, Throwable t);

So in theory one could "do work on the passed in CF during onDispatched but this code doesn't want to give them out and it defeats the purpose of allocated one for instrumentation when we took such efforts to avoid them in the engine.

The Instrumentation by design of course always locks in your implementation so maybe that's just par for the course.

I am more bullish on #3330 as a way to reduce memory pressure because there truly is a completely redundant ExecutionResult wrapper on each value. I am more bullish on this breaking change as achievable BUT the combination would both reduce memory pressure.

github-actions · 2023-12-07T00:15:13Z

Hello, this pull request has been inactive for 60 days, so we're marking it as stale. If you would like to continue working on this pull request, please make an update within the next 30 days, or we'll close the pull request.

samuelAndalon · 2024-02-13T01:10:03Z

@bbakerman any chance we could release an snapshot version of this branch ? i am very interested in trying this in graphql-kotlin, we have graphql servers with high load of traffic + very large queries, recently turned off Datadog automatic instrumentation as it was creating spans for every single field, and for very large queries it was creating a lot of objects causing the minor collection time of GC to take a lot, after turning off the DD instrumentation saw this improvement

i suspect there will be even more improvement by avoiding creating CFs for fields that are already in memory after being synchronously resolved.

bbakerman · 2024-02-16T04:22:03Z

any chance we could release an snapshot version of this branch ?

I am asking the team if we are able to do this.

bbakerman · 2024-02-16T07:20:25Z

@samuelAndalon

I have kicked off a custom release : https://github.com/graphql-java/graphql-java/actions/runs/7927320202

The artifact will be called 0.0.0-2024-02-16T00-00-00-support_objects_or_cfs

bbakerman · 2024-02-16T10:18:00Z

https://repo1.maven.org/maven2/com/graphql-java/graphql-java/0.0.0-2024-02-16T00-00-00-support_objects_or_cfs/ is now available as a maven artifact

samuelAndalon · 2024-02-16T21:59:23Z

oh right, so i tried the snapshot, and was able to see that now we are not creating CFs for values that are already in memory and those can be synchronously resolved, the problem here, as properly mentioned in the code comments, is the InstrumentationContext interface, which will cause issues with the other hooks, and some of the default graphql-java instrumentations, like batching.

What i would like to propose is to create a new InstrumentationContext that will be used for beginField, beginFieldFetch, beginFieldComplete and beginFieldListComplete, all these can use and share a custom interface, just like ExecutionStrategyInstrumentationContext, this is going to be problematic, yes, it is going to be a breaking chance, yes, but i absolutely feel that the performance of the engine will improve.

In terms of if the instrumentations are going to break, i dont think they will, batching instrumentation only needs to know if a field was dispatched, it actually does not use the value.

bbakerman · 2024-02-17T01:34:01Z

Yup this code is very much a Proof of Concept (POC)

And yes I think the InstrumentationContext should be changed anyway to not send in the CF to be completed. Semantically the same thing happens onCompleted

public interface InstrumentationContext<T> {

    void onDispatched(CompletableFuture<T> result);

    void onCompleted(T result, Throwable t);

}

could become

public interface InstrumentationContext<T> {

    void onDispatched();

    void onCompleted(T result, Throwable t);

}

in theory if you were composing on that CF during onDispatched, you could do the same via onCompleted

dondonz · 2024-04-16T05:17:01Z

Merged into master in separate PR

POC code to show how objects and not just CFs could be used

c8fd93c

bbakerman added breaking change requires a new major version to be relased Not to be merged spikes or other stuff that should never or not yet to be merged labels Sep 4, 2023

He-Pin reviewed Sep 4, 2023

View reviewed changes

bbakerman added 2 commits September 4, 2023 20:04

POC code to show how objects and not just CFs could be used - now wit…

d4def5d

…h viral support in completeObject adn completeList

POC code to show how objects and not just CFs could be used - old nam…

2236003

…e in async combined builder

bbakerman commented Sep 4, 2023

View reviewed changes

POC code to show how objects and not just CFs could be used - bad jav…

176243b

…adoc

Merge remote-tracking branch 'origin/master' into support_objects_or_cfs

8684872

bbakerman mentioned this pull request Sep 10, 2023

Do NOT MERGE - Another POC PR to show removing ER allocation #3328

Closed

github-actions Bot added the Stale label Dec 7, 2023

Updating build files to master

9a1fab2

Getting it passing tests

e2466df

dondonz closed this Apr 16, 2024

Conversation

bbakerman commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ALGORITHM

POSSIBLE IMPROVEMENTS

IMPACT

Summary

Uh oh!

He-Pin Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

bbakerman commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Suspect

Taking it further

Uh oh!

bbakerman Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

bbakerman Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

bbakerman Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

bbakerman Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

bbakerman commented Sep 5, 2023

Uh oh!

rrva commented Sep 6, 2023

Uh oh!

rrva commented Sep 6, 2023

Uh oh!

rrva commented Sep 6, 2023

Uh oh!

bbakerman commented Sep 6, 2023

Uh oh!

bbakerman commented Sep 6, 2023

Uh oh!

bbakerman commented Sep 6, 2023

Uh oh!

rrva commented Oct 6, 2023

graphql-java 21.1 baseline

memory allocation distribution

GC pressure and GC pauses

this branch

memory allocation distribution

GC pressure and GC pauses

Uh oh!

rrva commented Oct 6, 2023

Uh oh!

rrva commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbakerman commented Oct 7, 2023

Uh oh!

github-actions Bot commented Dec 7, 2023

Uh oh!

samuelAndalon commented Feb 13, 2024

Uh oh!

bbakerman commented Feb 16, 2024

Uh oh!

bbakerman commented Feb 16, 2024

Uh oh!

bbakerman commented Feb 16, 2024

Uh oh!

samuelAndalon commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbakerman commented Feb 17, 2024

Uh oh!

dondonz commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

bbakerman commented Sep 4, 2023 •

edited

Loading

bbakerman commented Sep 4, 2023 •

edited

Loading

rrva commented Oct 6, 2023 •

edited

Loading

samuelAndalon commented Feb 16, 2024 •

edited

Loading