My team and I developed and maintain a cloud-based content management platform. The product is comprised of a series of multi-tenant micro-services, which we host, and a .NET client framework for interacting with them. In the process of rounding out our automated integration test suite, we encountered an alarming issue.
A single test — one intended to simply query for some content — was failing in its attempt to authenticate with one of our services.
The kicker was that the issue only presented itself in our CI environment. It never occurred when our test suite was executed locally and there had never been any reports of such an issue in the wild. In fact, the functionality in question was some of the most stable logic in the product. It had remained unchanged since our initial release.
The product implements an HMAC approach to validating the authenticity of a client’s request. Generally speaking, a client computes a hash for each request using a unique secret known only to the client and service. The service receiving the request performs the same hash computation and compares it against the value calculated by the client. If the authentication codes match, the request is deemed to be authentic. If not, the request is rejected.
So, to describe our issue another way, the hash of a request computed in our CI environment was different than a hash of the same request computed in our hosting environment. In fact, the hash computed in our CI environment was different than those computed in all other environments.
So the questions at hand are:
- What are the request factors that cause an HMAC to vary?
- What are the environmental factors that could cause those factors to vary?
- What steps need to be taken to mitigate those factors?
There are a number of components to a request, which factor into the computation of an authentication code. Some of those components include:
- The application’s unique identifier
- The application’s secret
- The request URI
- The request body
(There are additional components — time stamps, nonces, etc. — which I’ve chosen exclude to streamline the discussion.)
It’s easy enough to verify that the app ID and secret applied in the CI environment matched our expectations and, indeed, they did. The test in question was a simple query, a GET request, which meant that there would be no request body. This isolated the request URI as the only factor in the computation that could be causing the hash to vary.
The host and path targeted by the request were confirmed to be identical, in both the local and CI environments. This left the query string as the only remaining point of variability. The criteria used to formulate that query string was identical from test run to test run but somehow the environments were producing two different results.
Sure enough, the query strings were different for the same criteria:
They differed in the encoding of a single character, a tilde (~).
Now, the tilde can be a problematic character in URIs despite being listed as unreserved in RFC 3986. That said, it wasn’t clear why its handling would differ from one environment to the next. To better understand this, I had to look at how the URI was being obtained for the purposes of hash computation.
The hash is computed by evaluating an HttpRequestMessage just prior to it being issued by an HttpClient. The HttpRequestMessage class exposes a RequestUri property (a Uri object), from which we evaluate the AbsoluteUri. The value of this property served as an input to the hash computation. Could the Uri class be our culprit?
Indeed it could. A simple assertion found this operation failing in one environment and passing in another.
var expectedEndpoint = "https://host/api/service/?target=resource%3A%2F%2Fa&sort=-%7eValue"; var uri = new Uri(expectedEndpoint); Assert.AreEqual(expectedEndpoint, uri.AbsoluteUri);
I decided to go hunting. I decompiled .NET’s System.dll and took a look at the source. There was clearly some amount of encoding and decoding involved in the class’s handling of the URI value but there wasn’t an obvious conditional that marked a decision to handle it differently according to the environment.
I dove deeper. Buried in a supporting class’s definition, was this little gem: UriQuirksVersion!
The class was clearly conscious of its environment and leveraging that information to make decisions about its URL parsing: BinaryCompatibility.TargetsAtLeast_Desktop_V4_5.
Our application has a requirement of .NET 4.5.2 and our build scripts confirmed that we were appropriately targeting that version of the framework. Furthermore, some additional assertions executed in our CI environment confirmed that we were in fact compiled against the expected version of the framework. What else might influence the target framework?
The test container!
As an API product, we simply leverage Microsoft’s existing unit test framework as a test harness for our integration suite. When initiated from Visual Studio, the tests execute through VSTest. In our CI environment, however, the tests were running with MSTest. After modifying the CI scripts to launch our tests with VSTest (and an explicit target framework version of 4.5.2), the test in question began to pass.
In the end, this was thankfully a non-issue for the product. It required nothing more than an update to our CI scripts. That said, it is an important reminder that seemingly innocuous operations and evaluations may have obscure consequences. And that reminder applies to us whether we’re consuming a dependency or implementing someone else’s dependency.
(Note: I later found similar complaints regarding nUnit’s handling of the framework target.)