Saturday, December 12, 2015

Least Astonishment: Node.js mutable module cache

Node.js' require cache's modules after the initial call to `require`, so that subsequent calls to require return the cached module.  If a module exports an object, and that object is mutated then subsequent calls to require that module will import the mutated object.  Below are 3 files.  One exporting an object, one requiring that object and mutating it, and another re-requiring the object, which loads the mutated object from the cache:

shared.js exports an object, which will be required and mutated.

mutate-require.js imports the object from shared.js.  This is the first import so it is loading the object that shared.js exports fresh and stores it in the cache.  It then mutates that object, and exports null.

Finally index.js imports mutate-require.js, which triggers the require of shared.js, the imports shared.js directly, which results from a cache load.  The comment at the bottom of index.js shows the output of when index.js is executed.

Is this the behavior you'd expect?  I was surprised, when I saw this.  I could imagine a more sane default being: having `require` copy the exported value to the cache and returning the clean copy on each require. But then what about modules that export objects that mutate themselves??  Thoughts?

Sunday, November 29, 2015

Node.js Spying on Exported Module Property

Popular node unit testing frameworks emphasize both spying and mocking in unittests.  Because so much IO regularly occurs in node projects, planning a strategy to mock on IO in unittests should be a high priority.   When testing it's usually pretty clear what methods need to be spied on (any that make IO requests :)  But if code isn't architected to be testable from the beginning, then spies, mocks and patches end up being applied in roundabout ways.  Lately I've been seeing a common mistake made with spying on exported module objects.  This blog posts assumes understanding of testing spies.

What are some ways to spy on exported module methods?

Assume we have a module resources.js.  All of its methods are being exported for unittesting.

buildRequest is easily tested because it has no IO calls.  The difficulty arises when we try to test getData and makeRequest.  Our goal for this blog is to write a test that asserts getData calls makeRequest a single time with the correct arguments.  Since makeRequest performs IO, and IO is a no-no for our unittests, we have to mock it in some way.  Our fist attempt at doing so is:

In our test resources.js is required and then makeRequest is mocked and spied on, so the test can make assertions on it, and no IO is performed.  The output of running the above test is provided, which results in an error.  This is the initial attempt that I've been seeing very frequently.  The issue is the spy is being created on the exported objects property and NOT the makeRequest defined in and used in resources.js.

To illustrate this; if resources.js were to use the object it is exporting in its calls, then the spy would be created and used as expected!!!
The above sample modifies resources.js to use the object it is exporting, which is the same object the spy is being applied to.  The output of running the test in test.spec.js against the above resources.js file is shown. The test PASSES! (In the near future I plan on having a blog that shows, where, why, how objects are exported/required referencing node core code)

While the above works, I personally don't think it is very clear, (and haven't really seen modules that use module.exports in its function implementations.  I also think it is dangerous to reference module.exports internally to a module because, clients of that module can mutate it!!!!

A similar way to mock makeRequest is to export a reference to an object used internally.
The above code, defines a utils object in resources.js and exports a reference to it.  Because it is a reference, the test can be updated to spy on resources.utils.makeRequest and the utils object in resources.js will be mutated.

I think this way is cleaner than using module.exports directly, but is still susceptible to being mutated by a client!

Dependency injection is a strong tool for creating node code that can be easily tested; by providing a clean way to mock IO dependencies.  It requires that the caller provides (injects) a functions dependencies.  In this case getData depends on functions that perform IO, makeRequest.  Refactoring it would require the caller of getData to provide a makeRequest function.  This would seamlessly allow a test to provide a mocked makeRequest method (that doesn't make IO calls), and the actual code to provide a different makeRequest method (which does make IO calls).
getData now requires that a requestor method is provided by the client.  The test code is free to provide a spy as the requestor, while the production clients will be required to provide makeRequest

getData(resource, resources.makeRequest, callback);

For the above example, getData only has a single dependency, while frequently in real world code methods may have multiple IO dependencies.  While functions can usually be decomposed and refactored to achieve making a single call or two, doing so after code is already in production is dangerous.  Because of how easy it is to create extremely nested node.js code it is very beneficial to design testable code from the start.  A powerful tool to do this is dependency injection, something I plan on writing a lot more about very soon.

Keep testing, happy noding!!!

Sunday, November 8, 2015

Node.js handling results from multiple async functions

I've been using node.js regularly for a couple months now, and have just started to try and answer questions about it on stackoverflow.  I have frequently been seeing questions about handling Multiple asynchronous requests, ie starting multiple async requests in response to some sort of event, then waiting until All requests have finished and taking their results and then performing some action with the results.  This blog post should explain a couple of strategies for dealing with this.

The problem:

For this post I'm going to assume we have have a node.js http express server.  It has a single route registered.  The route makes a series of external API calls, waits for their response,  then processes all data and sends a response to the user:

Naive 'Synchronous' looking approach:

The goal is to get the result of each api call in the `results` array and then aggregate the results.  A naive implementation that Does Not Work is:

Why isn't this correct?  The above code sends off 3 requests to the api endpoints, saves their results, performs aggregation and then sends response to client.  If this code was executed synchronously it would work as expected, but since `request` is asynchronous results will be empty when they are aggregated.  Lets trace the code path.

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array
4. the results are aggregated
5. the aggregated results are sent to the client.

Since request() is asynchronous there is no guarantee of when the callback will be called. This is the difficulty of asynchronous programming, and node in general.  If the code read the way it executed, it would be correct, but because it is asynchronous, we have absolutely no idea when or even IF the call to request() will finish.  (there are certainly ways to guarantee that the call will finish through the use of a timeout).  The program will execute like:

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls, (results = [])
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array (results = [])
4. results is still equal to [] because no data has been retrieved!!! the requests were only Sent, and a function was registered to handle the responses, WHEN THEY OCCUR, which could be anytime in the future!!!!
5. the results are aggregated, (still an empty list)
6. the aggregated results are sent to the client

Keeping track of the responses:

To be correct the program needs to aggregate the results only AFTER all requests have been made and returned.  That means the program needs to keep track of how many requests are going to be made and how many requests have been completed. When all expected requests have been completed, THEN the results should be aggregated and the response should be sent to the client.

A correct implementation requires that the program keep track of how many responses have been received.  Because of this the program is significantly more complicated.  The function keeps track of how many responses have been received, when all responses have been received, only then are the results aggregated, and the response is sent to the client.  While this program handles the case of all requests succeeding it is extremely deficient in its handling of errors.  Should the results be aggregated if one of the requests times out, or if the API server returns an error?

An important thing to notice is that the API client request callback is responsible for triggering the aggregation the the data and sending response to the client.  There is A LOT going on here.  Tracing the flow of this program can be complicated.  If we add in error handling, (or short circuiting of the requests) things can get even more complicated!!  Finally, we are laying a base for a nice callback pyramid of doom.  The top level code queues the API requests, and callbacks to be executed when the requests finish, and then the callbacks are responsible for finalizing the express get request and sending a response to the client.  I would certainly prefer that the callback is NOT responsible for this.  I feel like the callback should only be responsible for handing an individual API response.  Very focused (single responsibility) functions are generally easier to reason about, and usually easier to test.

async, A level of abstraction:

Using the wildly popular async library allows us to separate processing the results and sending response to client from making the api requests.

The above code may look more complicated, but it could be easier to test, as the callback responsible for aggregating results and sending response, is no longer located inside of an API response callback.  The requests to the API are now triggered by the async library. They are completed in parallel.  When all requests have finished completing (have called the callback method, or when one request has called callback with an error) the function passed as the third parameter to async.each will be executed.

This is great because the API response callback is no longer directly responsible for aggregating results and sending response to a client.  Internally async library is keeping track of the number of requests similar to the way we did in our first correct example.  I would argue that making these requests and performing an action when all responses have been complete is significantly more cleaner using async library.

Another approach using promises.... to be continued.....