Monday 9 November 2015

Implementing a Replicated Token Service with JSON Web Tokens

Last week I observed one of the 8 fallacies of distributed systems in action:
"Topology doesn't change"
A client of mine deployed the latest versions of his web services to a highly-available QA environment. Sanity tests gave initial confirmation that the system was behaving as expected. But then, the QA team reported weird behaviour in the system's offline functionality. So I was called in the figure out the problem. The logs showed an application getting random HTTP 401s from the system's token service.

This token service is a Java web application that creates and verifies JSON Web Tokens (JWTs). A client receives a 200 HTTP OK from the service for a token passing verification. Otherwise, it receives a 401 HTTP Unauthorized code. On startup, the token service creates a public/private key pair (PPK)  in-memory for signing and verifying these tokens. I knew the token service in QA was replicated and requests to replicas were load-balanced in a round-robin fashion. This quickly led me to the realisation that the issue occurred when (1) a replica of the token service verified a token with its own public key and (2) the same token was created as well signed by a different replica with its own private key. This issue wasn't caught in the developer's testing environment because services weren't replicated.

I'm going to describe a solution I implemented for this problem because, though it's simple to program, such a solution might not be obvious. All shown code is in Java or SQL but it should be relatively easy to adapt the code to the technologies of your choice. 

At an abstract level, the solution is to have each token service replica's public key and key ID visible to all other replicas. In addition to making the key ID visible to the set of replicas, the signer embeds the key ID in the created token before signing it with its own private key. This allows the verifier to know which public key to use for verifying the token. When the token service accepts a request to verify a token, it extracts the key ID from the token to lookup the public key to use for verifying it. Security-wise, this approach enables us to keep the private key secret with respect to the other token service replicas.

Now let's delve into the details. Given that the token service replicas share a database, I re-use the database to share the public keys and key IDs between replicas. In a relational database context, the schema for holding such information might look like this:

  1. nodeId is a UUID representing the replica owning the table row. This enables me to delete the row owned by a replica when it gracefully shuts down, reducing, but not eliminating, the likelihood of orphan records.

  2. name identifies the type of configuration. Although in this solution I'm only storing the public key in the table, you might want to store other configurations.

  3. value_ is where I store the actual public key along with the key ID.
In the token service, I use jose.4.j 0.4.4, an open-source Java implementation of JWT, for generating and verifying tokens. Before I can go on to generate/verify a token, first I need create a PPK and register the public key, including its key ID, so that it can be read by other replicas:

The above code is executed at startup and merits a brief explanation:

Line 7-8: RsaJwkGenerator.generateJwk(2048) returns a 2048-bit PPK. The key ID for the PPK is set to the node ID which is simply a UUID created as well at startup.

Line 9: ConfigurationDataMapper.insertConfiguration(...) registers the public key by adding a record to the database table Configuration. Its parameters map to the table columns nodeId, name, and value_, respectively.

Line 9: rsaJsonWebKey.toJson() does the job of serialising the public key and key ID to JSON for us. Note the toJson() method does NOT include the private key in the returned JSON.

Line 10: Finally, the PPK is saved in the service's context in order to read the private key later on for signing the token.

As mentioned above, the token service creates signed tokens for clients. The code for this is implemented in the createToken(...) method:

I blatantly copied the code from jose.4.j's excellent examples page from where you can find an explanation of what it does. However, I want to highlight that I'm passing the PPK I saved earlier in the service context to createToken(...). Additionally, observe that on line 16 I'm setting the token's key ID to the PPK's key ID which is the node ID.

On receiving a request to verify a token, the service fetches all registered public keys and key IDs from the database before verifying the token [1]:

In the above method, the public keys are (1) re-constructed from the JSON persisted to the database and (2) added to a list. The list is passed to the isValid(...) method along with the token. isValid(...) returns true if a token passes verification, otherwise false:

In isValid(...), I pass the list of public keys to the JwksVerificationKeyResolver class constructor to create an object that resolves the public key to use for verifying the token according to the key ID extracted from the received token. The rest of the code builds a JwtConsumer object to verify the token.

The last item to tackle is to have a token service replica, that is shutting down gracefully, delete its public key from the configuration table:

This is required because the replica's private key and node ID are kept in-memory and therefore lost on shutdown. Of course, this isn't a foolproof way of eliminating orphan records. Furthermore, it's possible that a token signed by a replica is still in circulation after the replica has shutdown causing the token to fail verification. I'll leave these problems as exercises for the reader to solve.

1: createToken(...) definitely has room for improvement in terms of performance.