Skip to content

graphql-java/java-dataloader

 
 

Repository files navigation

java-dataloader

Build Status   Apache licensed   Download

This small and simple utility library is a pure Java 8 port of Facebook DataLoader.

It can serve as integral part of your application's data layer to provide a consistent API over various back-ends and reduce message communication overhead through batching and caching.

An important use case for java-dataloader is improving the efficiency of GraphQL query execution. Graphql fields are resolved in a independent manner and with a true graph of objects, you may be fetching the same object many times.

A naive implementation of graphql data fetchers can easily lead to the dreaded "n+1" fetch problem.

Most of the code is ported directly from Facebook's reference implementation, with one IMPORTANT adaptation to make it work for Java 8. (more on this below).

But before reading on, be sure to take a short dive into the original documentation provided by Lee Byron (@leebyron) and Nicholas Schrock (@schrockn) from Facebook, the creators of the original data loader.

Table of contents

Features

java-dataloader is a feature-complete port of the Facebook reference implementation with one major difference. These features are:

  • Simple, intuitive API, using generics and fluent coding
  • Define batch load function with lambda expression
  • Schedule a load request in queue for batching
  • Add load requests from anywhere in code
  • Request returns a CompleteableFuture<V> of the requested value
  • Can create multiple requests at once
  • Caches load requests, so data is only fetched once
  • Can clear individual cache keys, so data is re-fetched on next batch queue dispatch
  • Can prime the cache with key/values, to avoid data being fetched needlessly
  • Can configure cache key function with lambda expression to extract cache key from complex data loader key types
  • Individual batch futures complete / resolve as batch is processed
  • Results are ordered according to insertion order of load requests
  • Deals with partial errors when a batch future fails
  • Can disable batching and/or caching in configuration
  • Can supply your own CacheMap<K, V> implementations
  • Has very high test coverage (see Acknowledgements)

Examples

A DataLoader object requires a BatchLoader function that is responsible for loading a promise of values given a list of keys

        BatchLoader<Long, User> userBatchLoader = new BatchLoader<Long, User>() {
            @Override
            public CompletionStage<List<User>> load(List<Long> userIds) {
                return CompletableFuture.supplyAsync(() -> {
                    return userManager.loadUsersById(userIds);
                });
            }
        };

        DataLoader<Long, User> userLoader = new DataLoader<>(userBatchLoader);

You can then use it to load values which will be CompleteableFuture promises to values

        CompletableFuture<User> load1 = userLoader.load(1L);

or you can use it to compose future computations as follows. The key requirement is that you call dataloader.dispatch() or its variant dataloader.dispatchAndJoin() at some point in order to make the underlying calls happen to the batch loader.

In this version of data loader, this does not happen automatically. More on this in Manual dispatching .

           userLoader.load(1L)
                    .thenAccept(user -> {
                        System.out.println("user = " + user);
                        userLoader.load(user.getInvitedByID())
                                .thenAccept(invitedBy -> {
                                    System.out.println("invitedBy = " + invitedBy);
                                });
                    });
    
            userLoader.load(2L)
                    .thenAccept(user -> {
                        System.out.println("user = " + user);
                        userLoader.load(user.getInvitedByID())
                                .thenAccept(invitedBy -> {
                                    System.out.println("invitedBy = " + invitedBy);
                                });
                    });
    
            userLoader.dispatchAndJoin();

As stated on the original Facebook project :

A naive application may have issued four round-trips to a backend for the required information, but with DataLoader this application will make at most two.

DataLoader allows you to decouple unrelated parts of your application without sacrificing the performance of batch data-loading. While the loader presents an API that loads individual values, all concurrent requests will be coalesced and presented to your batch loading function. This allows your application to safely distribute data fetching requirements throughout your application and maintain minimal outgoing data requests.

In the example above, the first call to dispatch will cause the batched user keys (1 and 2) to be fired at the BatchLoader function to load 2 users.

Since each thenAccept callback made more calls to userLoader to get the "user they they invited", another 2 user keys are given at the BatchLoader function for them.

In this case the userLoader.dispatchAndJoin() is used to make a dispatch call, wait for it (aka join it), see if the data loader has more batched entries, (which is does) and then it repeats this until the data loader internal queue of keys is empty. At this point we have made 2 batched calls instead of the naive 4 calls we might have made if we did not "batch" the calls to load data.

Batching requires batched backing APIs

You will notice in our BatchLoader example that the backing service had the ability to get a list of users given a list of user ids in one call.

            public CompletionStage<List<User>> load(List<Long> userIds) {
                return CompletableFuture.supplyAsync(() -> {
                    return userManager.loadUsersById(userIds);
                });
            }

This is important consideration. By using dataloader you have batched up the requests for N keys in a list of keys that can be retrieved at one time.

If you don't have batched backing services, then you cant be as efficient as possible as you will have to make N calls for each key.

       BatchLoader<Long, User> lessEfficientUserBatchLoader = new BatchLoader<Long, User>() {
           @Override
           public CompletionStage<List<User>> load(List<Long> userIds) {
               return CompletableFuture.supplyAsync(() -> {
                   //
                   // notice how it makes N calls to load by single user id out of the batch of N keys
                   //
                   return userIds.stream()
                           .map(id -> userManager.loadUserById(id))
                           .collect(Collectors.toList());
               });
           }
       };

That said, with key caching turn on (the default), it will still be more efficient using dataloader than without it.

Using dataloader in graphql for maximum efficiency

If you are using graphql, you are likely to making queries on a graph of data (surprise surprise). dataloader will help you to make this a more efficient process by both caching and batching requests for that graph of data items. If dataloader has previously see a data item before, it will cached the value and will return it without having to ask for it again.

Imagine we have the StarWars query outlined below. It asks us to find a hero and their friend's names and their friend's friend's names. It is likely that many of these people will be friends in common.

    {
        hero {
            name 
            friends {
                name
                friends {
                   name
                } 
            }
        }
    }

The result of this query is displayed below. You can see that Han, Leia, Luke and R2-D2 are tight knit bunch of friends and share many friends in common.

    [hero: [name: 'R2-D2', friends: [
            [name: 'Luke Skywalker', friends: [
                    [name: 'Han Solo'], [name: 'Leia Organa'], [name: 'C-3PO'], [name: 'R2-D2']]],
            [name: 'Han Solo', friends: [
                    [name: 'Luke Skywalker'], [name: 'Leia Organa'], [name: 'R2-D2']]],
            [name: 'Leia Organa', friends: [
                    [name: 'Luke Skywalker'], [name: 'Han Solo'], [name: 'C-3PO'], [name: 'R2-D2']]]]]
    ]

A naive implementation would called a DataFetcher to retrieved a person object every time it was invoked.

In this case it would be 15 calls over the network. Even though the group of people have a lot of common friends. With dataloader you can make the graphql query much more efficient.

As graphql descends each level of the query ( eg as it processes hero and then friends and then for each their friends), the data loader is called to "promise" to deliver a person object. At each level dataloader.dispatch() will be called to fire off the batch requests for that part of the query. With caching turned on (the default) then any previously returned person will be returned as is for no cost.

In the above example there are only 5 unique people mentioned but with caching and batching retrieval in place their will be only 3 calls to the batch loader function. 3 calls over the network or to a database is much better than 15 calls you will agree.

If you use capabilities like java.util.concurrent.CompletableFuture.supplyAsync() then you can make it even more efficient by making the the remote calls asynchronous to the rest of the query. This will make it even more timely since multiple calls can happen at once if need be.

Here is how you might put this in place:

       // a batch loader function that will be called with N or more keys for batch loading
       BatchLoader<String, Object> characterBatchLoader = new BatchLoader<String, Object>() {
           @Override
           public CompletionStage<List<Object>> load(List<String> keys) {
               //
               // we use supplyAsync() of values here for maximum parellisation
               //
               return CompletableFuture.supplyAsync(() -> getCharacterDataViaBatchHTTPApi(keys));
           }
       };

       // a data loader for characters that points to the character batch loader
       DataLoader characterDataLoader = new DataLoader<String, Object>(characterBatchLoader);

       //
       // use this data loader in the data fetchers associated with characters and put them into
       // the graphql schema (not shown)
       //
       DataFetcher heroDataFetcher = new DataFetcher() {
           @Override
           public Object get(DataFetchingEnvironment environment) {
               return characterDataLoader.load("2001"); // R2D2
           }
       };

       DataFetcher friendsDataFetcher = new DataFetcher() {
           @Override
           public Object get(DataFetchingEnvironment environment) {
               StarWarsCharacter starWarsCharacter = environment.getSource();
               List<String> friendIds = starWarsCharacter.getFriendIds();
               return characterDataLoader.loadMany(friendIds);
           }
       };

       //
       // DataLoaderRegistry is a place to register all data loaders in that needs to be dispatched together
       // in this case there is 1 but you can have many
       //
       DataLoaderRegistry registry = new DataLoaderRegistry();
       registry.register("character", characterDataLoader);

       //
       // this instrumentation implementation will dispatched all the dataloaders
       // as each level fo the graphql query is executed and hence make batched objects
       // available to the query and the associated DataFetchers
       //
       DataLoaderDispatcherInstrumentation dispatcherInstrumentation
               = new DataLoaderDispatcherInstrumentation(registry);

       //
       // now build your graphql object and execute queries on it.
       // the data loader will be invoked via the data fetchers on the
       // schema fields
       //
       GraphQL graphQL = GraphQL.newGraphQL(buildSchema())
               .instrumentation(dispatcherInstrumentation)
               .build();

One thing to note is the above only works if you use DataLoaderDispatcherInstrumentation which makes sure dataLoader.dispatch() is called. If this was not in place, then all the promises to data will never be dispatched ot the batch loader function and hence nothing would ever resolve.

See below for more details on dataLoader.dispatch()

Error object is not a thing in a type safe Java world

In the reference JS implementation if the batch loader returns an Error object back from the load() promise is rejected with that error. This allows fine grain (per object in the list) sets of error. If I ask for keys A,B,C and B errors out the promise for B can contain a specific error.

This is not quite as loose in a Java implementation as Java is a type safe language.

A batch loader function is defined as BatchLoader<K, V> meaning for a key of type K it returns a value of type V.

It cant just return some Exception as an object of type V. Type safety matters.

However you can use the Try data type which can encapsulate a computation that succeeded or returned an exception.

        Try<String> tryS = Try.tryCall(() -> {
            if (rollDice()) {
                return "OK";
            } else {
                throw new RuntimeException("Bang");
            }
        });

        if (tryS.isSuccess()) {
            System.out.println("It work " + tryS.get());
        } else {
            System.out.println("It failed with exception :  " + tryS.getThrowable());

        }

DataLoader supports this type and you can use this form to create a batch loader that returns a list of Try objects, some of which may have succeeded and some of which may have failed. From that data loader can infer the right behavior in terms of the load(x) promise.

        DataLoader<String, User> dataLoader = DataLoader.newDataLoaderWithTry(new BatchLoader<String, Try<User>>() {
            @Override
            public CompletionStage<List<Try<User>>> load(List<String> keys) {
                return CompletableFuture.supplyAsync(() -> {
                    List<Try<User>> users = new ArrayList<>();
                    for (String key : keys) {
                        Try<User> userTry = loadUser(key);
                        users.add(userTry);
                    }
                    return users;
                });
            }
        });

On the above example if one of the Try objects represents a failure, then its load() promise will complete exceptionally and you can react to that, in a type safe manner.

Disabling caching

In certain uncommon cases, a DataLoader which does not cache may be desirable.

    new DataLoader<String, User>(userBatchLoader, DataLoaderOptions.newOptions().setCachingEnabled(false));

Calling the above will ensure that every call to .load() will produce a new promise, and requested keys will not be saved in memory.

However, when the memoization cache is disabled, your batch function will receive an array of keys which may contain duplicates! Each key will be associated with each call to .load(). Your batch loader should provide a value for each instance of the requested key as per the contract

        userDataLoader.load("A");
        userDataLoader.load("B");
        userDataLoader.load("A");

        userDataLoader.dispatch();

        // will result in keys to the batch loader with [ "A", "B", "A" ]

More complex cache behavior can be achieved by calling .clear() or .clearAll() rather than disabling the cache completely.

Caching errors

If a batch load fails (that is, a batch function returns a rejected CompletionStage), then the requested values will not be cached. However if a batch function returns a Try or Throwable instance for an individual value, then that will be cached to avoid frequently loading the same problem object.

In some circumstances you may wish to clear the cache for these individual problems:

        userDataLoader.load("r2d2").whenComplete((user, throwable) -> {
            if (throwable != null) {
                userDataLoader.clear("r2dr");
                throwable.printStackTrace();
            } else {
                processUser(user);
            }
        });

The scope of a data loader is important

If you are serving web requests then the data can be specific to the user requesting it. If you have user specific data then you will not want to cache data meant for user A to then later give it user B in a subsequent request.

The scope of your DataLoader instances is important. You might want to create them per web request to ensure data is only cached within that web request and no more.

If your data can be shared across web requests then you might want to scope your data loaders so they survive longer than the web request say.

Custom caches

The default cache behind DataLoader is an in memory HashMap. There is no expiry on this and it lives for as long as the data loader lives.

However you can create your own custom cache and supply it to the data loader on construction via the org.dataloader.CacheMap interface.

        MyCustomCache customCache = new MyCustomCache();
        DataLoaderOptions options = DataLoaderOptions.newOptions().setCacheMap(customCache);
        new DataLoader<String, User>(userBatchLoader, options);

You could choose to use one of the fancy cache implementations from Guava or Kaffeine and wrap it in a CacheMap wrapper ready for data loader. They can do fancy things like time eviction and efficient LRU caching.

Manual dispatching

The original Facebook DataLoader was written in Javascript for NodeJS. NodeJS is single-threaded in nature, but simulates asynchronous logic by invoking functions on separate threads in an event loop, as explained in this post on StackOverflow.

NodeJS generates so-call 'ticks' in which queued functions are dispatched for execution, and Facebook DataLoader uses the nextTick() function in NodeJS to automatically dequeue load requests and send them to the batch execution function for processing.

And here there is an IMPORTANT DIFFERENCE compared to how java-dataloader operates!!

In NodeJS the batch preparation will not affect the asynchronous processing behaviour in any way. It will just prepare batches in 'spare time' as it were.

This is different in Java as you will actually delay the execution of your load requests, until the moment where you make a call to dataLoader.dispatch().

Does this make Java DataLoader any less useful than the reference implementation? We would argue this is not the case, and there are also gains to this different mode of operation:

  • In contrast to the NodeJS implementation you as developer are in full control of when batches are dispatched
  • You can attach any logic that determines when a dispatch takes place
  • You still retain all other features, full caching support and batching (e.g. to optimize message bus traffic, GraphQL query execution time, etc.)

However, with batch execution control comes responsibility! If you forget to make the call to dispatch() then the futures in the load request queue will never be batched, and thus will never complete! So be careful when crafting your loader designs.

Let's get started!

Installing

Gradle users configure the java-dataloader dependency in build.gradle:

repositories {
    maven {
        jcenter()
    }
}

dependencies {
    compile 'com.graphql-java:java-dataloader:1.0.2'
}

Building

To build from source use the Gradle wrapper:

./gradlew clean build

Other information sources

Contributing

All your feedback and help to improve this project is very welcome. Please create issues for your bugs, ideas and enhancement requests, or better yet, contribute directly by creating a PR.

When reporting an issue, please add a detailed instruction, and if possible a code snippet or test that can be used as a reproducer of your problem.

When creating a pull request, please adhere to the current coding style where possible, and create tests with your code so it keeps providing an excellent test coverage level. PR's without tests may not be accepted unless they only deal with minor changes.

Acknowledgements

This library was originally written for use within a VertX world and it used the vertx-core Future classes to implement itself. All the heavy lifting has been done by this project : vertx-dataloader including the extensive testing (which itself came from Facebook).

This particular port was done to reduce the dependency on Vertx and to write a pure Java 8 implementation with no dependencies and also to use the more normative Java CompletableFuture.

vertx-core is not a lightweight library by any means so having a pure Java 8 implementation is very desirable.

This library is entirely inspired by the great works of Lee Byron and Nicholas Schrock from Facebook whom we would like to thank, and especially @leebyron for taking the time and effort to provide 100% coverage on the codebase. The original set of tests were also ported.

Licensing

This project is licensed under the Apache Commons v2.0 license.

Copyright © 2016 Arnold Schrijver, 2017 Brad Baker and others contributors