Software Engineering Thoughts

Saturday, November 12, 2016

Go Log Centralization/Pipeline

Go concurrency is silly. It's so effective, and easy to use, that programs can quickly become a jumble of go routines spawning go routines, and channels being distributed throughout the application. I feel like static typing, combined with the go tool chain, provides a strong degree of confidence that overly concurrent code compiles, runs and is reasonably correct.

Go log ecosystem looks to be pretty fractured. The standard library `log` package is very different than what I would expect a standard library to provide, as I'm coming from a more "traditional" python/java logging ecosystem, where log levels, handler, formatters are the norms. There are some go logging packages which have adopted the same concepts. The issue is I had already begun development on an MVP and had standard library logging functions in the code, when I realized that file logging or syslog, or multiple handlers based on configuration options, would be helpful.

If I chose to use a logging package in go it would require:

- Passing around a log.Logger instance (which is threadsafe)

- Log specific formatting / text representation becomes the responsibility of the file, and is a tight coupling.

Because of go's asynchronous nature, log often contain important information that can help test validate operations/transitions are occurring correctly. If a test would like to make assertions on logger, using a "traditional" logging library, a logger interface would have to be created, and a mock version would have to be provided by the test. Go promotes many concurrently executing tasks which makes code more prone to be non-deterministic because of the asynchronous nature. If state changes to an object were logged, and a test wanted to make sure certain asynchronous state changes occurred in response to an input, then the test would have to provide a mock logger, which recorded all calls to it, and make assertions on based on that.

A pattern I found, which addresses the above issues, is having each object submit any log-worthy events to an event channel. For my first event, there is a single global even channel that is initialized at the main level. It is then passed to all lower functions. Each object writes to it:

With centralizing logging (or even per package logging) log representation is easily configurable, ie: json, syslog, stdout/err, file, etc. Instead of the log representation being defined through the log library api (perhaps as a handler), the application gets to control that logic. Having a centralized application based logger also allows for application level logging. This could be to do something like write each event of a certain type to a specific file.

- Logging is decoupled from the application, corner cases, and unforeseen errors could create conditions where events leading up to an application crash are not fully represented in the log.

- Blocking channel is used by default, if there are multiple publishers to the channel, a buffered channel should probably be used. It may be difficult to reason about what the buffer size should be since one channel may be shared across an application.

Everything proposed above is easily achievable using more traditional logging libraries (python, java) and even some go 3rd party logging libraries, like logrus. But having a centralized event stream is a lightweight alternative to using a 3rd party library, and allows for easier testing.

Saturday, December 12, 2015

Least Astonishment: Node.js mutable module cache

Node.js' require cache's modules after the initial call to `require`, so that subsequent calls to require return the cached module. If a module exports an object, and that object is mutated then subsequent calls to require that module will import the mutated object. Below are 3 files. One exporting an object, one requiring that object and mutating it, and another re-requiring the object, which loads the mutated object from the cache:

shared.js exports an object, which will be required and mutated.

mutate-require.js imports the object from shared.js. This is the first import so it is loading the object that shared.js exports fresh and stores it in the cache. It then mutates that object, and exports null.

Finally index.js imports mutate-require.js, which triggers the require of shared.js, the imports shared.js directly, which results from a cache load. The comment at the bottom of index.js shows the output of when index.js is executed.

Is this the behavior you'd expect? I was surprised, when I saw this. I could imagine a more sane default being: having `require` copy the exported value to the cache and returning the clean copy on each require. But then what about modules that export objects that mutate themselves?? Thoughts?

Sunday, November 29, 2015

Node.js Spying on Exported Module Property

Popular node unit testing frameworks emphasize both spying and mocking in unittests. Because so much IO regularly occurs in node projects, planning a strategy to mock on IO in unittests should be a high priority. When testing it's usually pretty clear what methods need to be spied on (any that make IO requests :) But if code isn't architected to be testable from the beginning, then spies, mocks and patches end up being applied in roundabout ways. Lately I've been seeing a common mistake made with spying on exported module objects. This blog posts assumes understanding of testing spies.

What are some ways to spy on exported module methods?

Assume we have a module resources.js. All of its methods are being exported for unittesting.

buildRequest is easily tested because it has no IO calls. The difficulty arises when we try to test getData and makeRequest. Our goal for this blog is to write a test that asserts getData calls makeRequest a single time with the correct arguments. Since makeRequest performs IO, and IO is a no-no for our unittests, we have to mock it in some way. Our fist attempt at doing so is:

In our test resources.js is required and then makeRequest is mocked and spied on, so the test can make assertions on it, and no IO is performed. The output of running the above test is provided, which results in an error. This is the initial attempt that I've been seeing very frequently. The issue is the spy is being created on the exported objects property and NOT the makeRequest defined in and used in resources.js.

To illustrate this; if resources.js were to use the object it is exporting in its calls, then the spy would be created and used as expected!!!
The above sample modifies resources.js to use the object it is exporting, which is the same object the spy is being applied to. The output of running the test in test.spec.js against the above resources.js file is shown. The test PASSES! (In the near future I plan on having a blog that shows, where, why, how objects are exported/required referencing node core code)

While the above works, I personally don't think it is very clear, (and haven't really seen modules that use module.exports in its function implementations. I also think it is dangerous to reference module.exports internally to a module because, clients of that module can mutate it!!!!

A similar way to mock makeRequest is to export a reference to an object used internally.
The above code, defines a utils object in resources.js and exports a reference to it. Because it is a reference, the test can be updated to spy on resources.utils.makeRequest and the utils object in resources.js will be mutated.

I think this way is cleaner than using module.exports directly, but is still susceptible to being mutated by a client!

Dependency injection is a strong tool for creating node code that can be easily tested; by providing a clean way to mock IO dependencies. It requires that the caller provides (injects) a functions dependencies. In this case getData depends on functions that perform IO, makeRequest. Refactoring it would require the caller of getData to provide a makeRequest function. This would seamlessly allow a test to provide a mocked makeRequest method (that doesn't make IO calls), and the actual code to provide a different makeRequest method (which does make IO calls).
getData now requires that a requestor method is provided by the client. The test code is free to provide a spy as the requestor, while the production clients will be required to provide makeRequest

getData(resource, resources.makeRequest, callback);

For the above example, getData only has a single dependency, while frequently in real world code methods may have multiple IO dependencies. While functions can usually be decomposed and refactored to achieve making a single call or two, doing so after code is already in production is dangerous. Because of how easy it is to create extremely nested node.js code it is very beneficial to design testable code from the start. A powerful tool to do this is dependency injection, something I plan on writing a lot more about very soon.

Keep testing, happy noding!!!

Sunday, November 8, 2015

Node.js handling results from multiple async functions

I've been using node.js regularly for a couple months now, and have just started to try and answer questions about it on stackoverflow. I have frequently been seeing questions about handling Multiple asynchronous requests, ie starting multiple async requests in response to some sort of event, then waiting until All requests have finished and taking their results and then performing some action with the results. This blog post should explain a couple of strategies for dealing with this.

The problem:

For this post I'm going to assume we have have a node.js http express server. It has a single route registered. The route makes a series of external API calls, waits for their response, then processes all data and sends a response to the user:

Naive 'Synchronous' looking approach:

The goal is to get the result of each api call in the `results` array and then aggregate the results. A naive implementation that Does Not Work is:

Why isn't this correct? The above code sends off 3 requests to the api endpoints, saves their results, performs aggregation and then sends response to client. If this code was executed synchronously it would work as expected, but since `request` is asynchronous results will be empty when they are aggregated. Lets trace the code path.

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array
4. the results are aggregated
5. the aggregated results are sent to the client.

Since request() is asynchronous there is no guarantee of when the callback will be called. This is the difficulty of asynchronous programming, and node in general. If the code read the way it executed, it would be correct, but because it is asynchronous, we have absolutely no idea when or even IF the call to request() will finish. (there are certainly ways to guarantee that the call will finish through the use of a timeout). The program will execute like:

1. all urls are initialized in apiUrls
2. results array is initialized, which will hold all data returned by api calls, (results = [])
3. apiUrls is iterated over, a request to each url is made, a callback is registered to handle the response, the callback will add the response to results array (results = [])
4. results is still equal to [] because no data has been retrieved!!! the requests were only Sent, and a function was registered to handle the responses, WHEN THEY OCCUR, which could be anytime in the future!!!!
5. the results are aggregated, (still an empty list)
6. the aggregated results are sent to the client

Keeping track of the responses:

To be correct the program needs to aggregate the results only AFTER all requests have been made and returned. That means the program needs to keep track of how many requests are going to be made and how many requests have been completed. When all expected requests have been completed, THEN the results should be aggregated and the response should be sent to the client.

A correct implementation requires that the program keep track of how many responses have been received. Because of this the program is significantly more complicated. The function keeps track of how many responses have been received, when all responses have been received, only then are the results aggregated, and the response is sent to the client. While this program handles the case of all requests succeeding it is extremely deficient in its handling of errors. Should the results be aggregated if one of the requests times out, or if the API server returns an error?

An important thing to notice is that the API client request callback is responsible for triggering the aggregation the the data and sending response to the client. There is A LOT going on here. Tracing the flow of this program can be complicated. If we add in error handling, (or short circuiting of the requests) things can get even more complicated!! Finally, we are laying a base for a nice callback pyramid of doom. The top level code queues the API requests, and callbacks to be executed when the requests finish, and then the callbacks are responsible for finalizing the express get request and sending a response to the client. I would certainly prefer that the callback is NOT responsible for this. I feel like the callback should only be responsible for handing an individual API response. Very focused (single responsibility) functions are generally easier to reason about, and usually easier to test.

async, A level of abstraction:

Using the wildly popular async library allows us to separate processing the results and sending response to client from making the api requests.

The above code may look more complicated, but it could be easier to test, as the callback responsible for aggregating results and sending response, is no longer located inside of an API response callback. The requests to the API are now triggered by the async library. They are completed in parallel. When all requests have finished completing (have called the callback method, or when one request has called callback with an error) the function passed as the third parameter to async.each will be executed.

This is great because the API response callback is no longer directly responsible for aggregating results and sending response to a client. Internally async library is keeping track of the number of requests similar to the way we did in our first correct example. I would argue that making these requests and performing an action when all responses have been complete is significantly more cleaner using async library.

Another approach using promises.... to be continued.....

Friday, June 7, 2013

How much experience does a Computer Science Degree provide?

Many companies employment requirements include a computer science degree or "Equivalent Experience". In reality, the "Equivalent Experience" of a college computer science degree is much lower than employers believe.

Most times companies quantify equivalent experience as an amount of time, which is 4 years. Just because school normally lasts for 4 years, does that mean an individual gains 4 years worth of computer experience from a computer science degree.

A core component of the US bachelor degree is a humanities education. This comes in the form of required courses in art, english, history, philosophy, mathematics, social studies and often contain a physical education aspect. The vast majority of 4 year programs require courses from the above disciplines, in hopes of creating "well rounded" students. These courses are required even in technical programs.

I looked at the requirements for a computer science degree at UMBC (major requirements). 70 computer science and math credits are required for a degree. In my browsings, I have seen anywhere between 70-90 credits required in the field. For calculation purposes I will assume 80 credits required.

80 credits total / 15 credits per semester (average) = 5.3 semesters of school

A semester is roughly 3.5 months of school.

5 semester * 3.5 months per semester = 17.5 months

Um. This is 1.5 years of 15 / per week instruction?

84 weeks (1.5 years) * 15 hours of instruction per week = 1260 hours of instruction

1260 hours is 31.5 40 hour weeks, significantly less than 1 full year.

Friday, May 31, 2013

Django Request/Response Cycle, How Requests Die and Responses Are Born

This summer I decided to start contributing to django. Looking through the easy picking tickets, I realized it would be beneficial to develop an understanding of some of the core components of the django framework before jumping into tickets.

I plan on writing blog posts about the internals of django, starting with today's: What is the path of a request through the django framework, and what is the path or a response out of the framework.

Lets begin with describing the entry point, and all major functions that are called for a request/response. We can then elaborate on the important functions. Django uses the WSGI standard to interact with web servers, there is a lot of information available about this standard on the web.

Basically, django will be returning a callable application with the signature and return value the same as

def simple_app(environ, start_response):
    """Simplest possible application object"""
    status = '200 OK'
    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    return ['Hello world!\n']

When deploying django the webserver needs to know of the location of the wsgi.py file that belongs to your project, this is where it retrieves the callable application (like the one above)

If we look in the wsgi.py file where django that is included with django we see:

from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

Ah HA!

So lets start tracing how our application is built!

1. get_wsgi_application is a public wrapper function that returns a WSGIHandler instance.
Notice how WSGI handler has a lot in common with the above sample application:

Django calculates the correct response status codes and headers and calls the `start_response` function that was passed into the app (from the webserver?) when the application is called. It then returns the response.

2. So we are already at a dead end! Django returned our app to the webserver. I don't know when the webserver actually calls our app or what start_response actually does, and we don't need to, we can just trust that the wsgi modules/wsgi servers do things right! The response is calculated when our application (WSGIHandler instance) is called. This takes place in WSGIHandlers.__call__.

3. The first thing WSGIHandler does is attempt to load middleware that is registered in the settings.py file.

4. The next significant action is the instantiation of a a new WSGIRequest

5. The last main action is retrieving a response for the new request object in get_response. This is where the bulk of the request logic comes into play. This function is responsible for calling all middleware hooks at the correct times, and for resolving a url to a view function, and executing that view function. This is where those nice error messages of your view did not return a HTTPResponse object comes from!

This has been a quick, pretty high level architecture overview, of which main functions django calls to turn a request into a response. There ARE other things going on like emitting signals along the way, and a couple config options to prepare a request for processing, but this is pretty much the gist of it.

Friday, April 26, 2013

Loading jQuery plugins using require.js and backbone.js

I recently began using backbone.js through the package grunt-bbb. I ran into a little bit of trouble loading external javascript libraries through require.js, specifically jQuery plugins.

For anyone new to backbone.js development, I would strongly advise you check out grunt-bbb. It is a framework and build system for backbone.js. It provides commands to generate scaffolding and for building/testing/linting your project.

Below are instructions specific to grunt-bbb, but can easily be adapted to any project using require.js.

require.js configuration provides a `shim` option. According to the docs:

Configure the dependencies and exports for older, traditional "browser globals" scripts that do not use define() to declare the dependencies and set a module value.

This allows you to define what order packages are loaded in. This is necessary because your jQuery plugin relies on jQuery.

In your require.js config it is important to provide both the location of your plugin and specify that jQuery is dependent on your plugin



The above code points to where the js plugins are located (without.js extension) and registers them as

 dependent on jQuery.

Now all you have to do is load the scripts in your module's define statement:





define(["app", 'prettyPhoto', 'jqueryui' ], function(app) {

});

The plugins are loaded by the key you define in `paths` object!