Sunday, November 8, 2015

Node.js handling results from multiple async functions

I've been using node.js regularly for a couple months now, and have just started to try and answer questions about it on stackoverflow.  I have frequently been seeing questions about handling Multiple asynchronous requests, ie starting multiple async requests in response to some sort of event, then waiting until All requests have finished and taking their results and then performing some action with the results.  This blog post should explain a couple of strategies for dealing with this.

The problem:

For this post I'm going to assume we have have a node.js http express server.  It has a single route registered.  The route makes a series of external API calls, waits for their response,  then processes all data and sends a response to the user:

Naive 'Synchronous' looking approach:

The goal is to get the result of each api call in the `results` array and then aggregate the results.  A naive implementation that Does Not Work is:

Why isn't this correct?  The above code sends off 3 requests to the api endpoints, saves their results, performs aggregation and then sends response to client.  If this code was executed synchronously it would work as expected, but since `request` is asynchronous results will be empty when they are aggregated.  Lets trace the code path.

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array
4. the results are aggregated
5. the aggregated results are sent to the client.

Since request() is asynchronous there is no guarantee of when the callback will be called. This is the difficulty of asynchronous programming, and node in general.  If the code read the way it executed, it would be correct, but because it is asynchronous, we have absolutely no idea when or even IF the call to request() will finish.  (there are certainly ways to guarantee that the call will finish through the use of a timeout).  The program will execute like:

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls, (results = [])
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array (results = [])
4. results is still equal to [] because no data has been retrieved!!! the requests were only Sent, and a function was registered to handle the responses, WHEN THEY OCCUR, which could be anytime in the future!!!!
5. the results are aggregated, (still an empty list)
6. the aggregated results are sent to the client

Keeping track of the responses:

To be correct the program needs to aggregate the results only AFTER all requests have been made and returned.  That means the program needs to keep track of how many requests are going to be made and how many requests have been completed. When all expected requests have been completed, THEN the results should be aggregated and the response should be sent to the client.

A correct implementation requires that the program keep track of how many responses have been received.  Because of this the program is significantly more complicated.  The function keeps track of how many responses have been received, when all responses have been received, only then are the results aggregated, and the response is sent to the client.  While this program handles the case of all requests succeeding it is extremely deficient in its handling of errors.  Should the results be aggregated if one of the requests times out, or if the API server returns an error?

An important thing to notice is that the API client request callback is responsible for triggering the aggregation the the data and sending response to the client.  There is A LOT going on here.  Tracing the flow of this program can be complicated.  If we add in error handling, (or short circuiting of the requests) things can get even more complicated!!  Finally, we are laying a base for a nice callback pyramid of doom.  The top level code queues the API requests, and callbacks to be executed when the requests finish, and then the callbacks are responsible for finalizing the express get request and sending a response to the client.  I would certainly prefer that the callback is NOT responsible for this.  I feel like the callback should only be responsible for handing an individual API response.  Very focused (single responsibility) functions are generally easier to reason about, and usually easier to test.

async, A level of abstraction:

Using the wildly popular async library allows us to separate processing the results and sending response to client from making the api requests.

The above code may look more complicated, but it could be easier to test, as the callback responsible for aggregating results and sending response, is no longer located inside of an API response callback.  The requests to the API are now triggered by the async library. They are completed in parallel.  When all requests have finished completing (have called the callback method, or when one request has called callback with an error) the function passed as the third parameter to async.each will be executed.

This is great because the API response callback is no longer directly responsible for aggregating results and sending response to a client.  Internally async library is keeping track of the number of requests similar to the way we did in our first correct example.  I would argue that making these requests and performing an action when all responses have been complete is significantly more cleaner using async library.

Another approach using promises.... to be continued.....

No comments:

Post a Comment