Vault eventual consistency
Vault Enterprise license or HCP Vault Dedicated cluster required.
When running in a cluster, Vault has an eventual consistency model. Only one node (the leader) can write to Vault's storage. Users generally expect read-after-write consistency: in other words, after writing foo=1, a subsequent read of foo should return 1. Depending on the Vault configuration this isn't always the case. When using performance standbys with Integrated Storage, or when using performance replication, there are some sequences of operations that don't always yield read-after-write consistency.
Performance standby nodes
When using the Integrated Storage backend without performance standbys, only a single Vault node (the active node) handles requests. Requests sent to regular standbys are handled by forwarding them to the active node. This Vault configuration gives Vault the same behavior as the default Consul consistency model.
When using the Integrated Storage backend with performance standbys, both the active node and performance standbys can handle requests. If a performance standby handles a login request, or a request that generates a dynamic secret, the performance standby will issue a remote procedure call (RPC) to the active node to store the token and/or lease. If the performance standby handles any other request that results in a storage write, it will forward that request to the active node in the same way a regular standby forwards all requests.
With Integrated Storage, all writes occur on the active node, which then issues RPCs to update the local storage on every other node. Between when the active node writes the data to its local disk, and when those RPCs are handled on the other nodes to write the data to their local disks, those nodes present a stale view of the data.
As a result, even if you're always talking to the same performance standby, you may not get read-after-write semantics. The write gets sent to the active node, and if the subsequent read request occurs before the new data gets sent to the node handling the read request, the read request won't be able to take the write into account because the new data isn't present on that node yet.
Performance replication
A similar phenomenon occurs when using performance replication. One example
of how this manifests is when using shared mounts. If a KV secrets engine
is mounted on the primary with local=false
, it will exist on the secondary
cluster as well. The secondary cluster can handle requests to that mount,
though as with performance standbys, write requests must be forwarded - in
this case to the primary active node. Once data is written to the primary cluster,
it won't be visible on the secondary cluster until the data has been replicated
from the primary. Therefore, on the secondary cluster, it initially appears as if
the data write hasn't happened.
If the secondary cluster is using Integrated Storage, and the read request is being handled on one of its performance standbys, the problem is exacerbated because it has to be sent first from the primary active node to the secondary active node, and then from there to the secondary performance standby, each of which can introduce their own form of lag.
Even without shared secret engines, stale reads can still happen with performance replication. The Identity subsystem aims to provide a view on entities and groups which span across clusters. As such, when logging in to a secondary cluster using a shared mount, Vault tries to generate an entity and alias if they don't already exist, and these must be stored on the primary using an RPC. Something similar happens with groups.
Clock skew and replication lag
As seen above, both performance standbys and replication secondaries can lag behind the active node or the primary. As of Vault 1.17, it's possible to get some insight into that lag using sys/health, sys/ha-status, and the replication status endpoints.
Secondaries and standbys regularly issue an "echo" heartbeat RPC to their upstream source. This heartbeat serves many purposes, one of them being to get a rough idea of whether the clocks of the client and server are in sync. The server response to the heartbeat RPC includes the server's local clock time, and the client takes the delta in milliseconds between that time and the client's local clock time to compute the clock_skew_ms field. No effort is made to factor into that field the time it took to actually perform the RPC, though that information is made available as the last_heartbeat_duration_ms field. In other words, the reported clock skew has an uncertainty of up to last_heartbeat_duration_ms.
Vault assumes that clocks are synced across all nodes in a cluster, and if they aren't, problems may arise, e.g. one node may think that a lease has expired and another node won't yet. Some community-supported storage backends may have further problems relating to HA mode.
There are fewer problems expected when clock skew exists between a replication primary and secondary. However, one known issue is that the replication lag canary discussed next will produce surprising values if clocks aren't synced between the clusters.
Non-secondary active nodes periodically write a small record to storage containing the local clock time for that node. Replication secondaries read that record and compare it to their local clock time, calling the delta the replication_primary_canary_age_ms, which is exposed in the replication status endpoints. Performance standbys do the same computation, exposing replication_primary_canary_age_ms in the sys/health and sys/ha-status endpoints. Performance standbys and replication secondaries include their current replication_primary_canary_age_ms as part of their payload for the aforementioned "echo" heartbeat RPCs they issue, allowing the active node or primary cluster to report on the lag seen by their downstream clients.
Mitigations
There has long been a partial mitigation for the above problems. When writing data via RPC, e.g. when a performance standby registers tokens and leases on the active node after a login or generating a dynamic secret, part of the response includes a number known as the "WAL index", aka Write-Ahead Log index.
A full explanation of this is outside the scope of this document, but the short version is that both performance replication and performance standbys use log shipping to stay in sync with the upstream source of writes. The mitigation historically used by nodes doing writes via RPC is to look at the WAL index in the response and wait up to 2 seconds to see if that WAL index appear in the logs being shipped from upstream. Once the WAL index is seen, the Vault node handling the request that resulted in RPCs can return its own response to the client: it knows that any subsequent reads will be able to see the value that was just written. If the WAL index isn't seen within those 2 seconds, the Vault node completes the request anyway, returning a warning in the response.
This mitigation option still exists in Vault 1.7, though now there is a configuration option to adjust the wait time: best_effort_wal_wait_duration.
Vault 1.7 mitigations
There are now a variety of other mitigations available:
- per-request option to always forward the request to the active node
- per-request option to conditionally forward the request to the active node if it would otherwise result in a stale read
- per-request option to fail requests if they might result in a stale read
- Vault Proxy configuration to do the above for proxied requests
The remainder of this document describes the tradeoffs of these mitigations and how to use them.
Note that any headers requesting forwarding are disabled by default, and must be enabled using allow_forwarding_via_header.
Unconditional forwarding (Performance standbys only)
The simplest solution to never experience stale reads from a performance standby is to provide the following HTTP header in the request:
The drawback here is that if all your requests are forwarded to the active node, you might as well not be using performance standbys. So this mitigation only makes sense to use selectively.
This mitigation will not help with stale reads relating to performance replication.
Conditional forwarding (Performance standbys only)
As of Vault Enterprise 1.7, all requests that modify storage now return a new HTTP response header:
To ensure that the state resulting from that write request is visible to a subsequent request, add these headers to that second request:
The effect will be that the node handling the request will look at the state it has locally, and if it doesn't contain the state described by the X-Vault-Index header, the node will forward the request to the active node.
The drawback here is that when requests are forwarded to the active node, performance standbys provide less value. If this happens often enough the active node can become a bottleneck, limiting the horizontal read scalability performance standbys are intended to provide.
Retry stale requests
As of Vault Enterprise 1.7, all requests that modify storage now return a new HTTP response header:
To ensure that the state resulting from that write request is visible to a subsequent request, add this headers to that second request:
When the desired state isn't present, Vault will return a failure response with HTTP status code 412. This tells the client that it should retry the request. The advantage over the Conditional Forwarding solution above is twofold: first, there's no additional load on the active node. Second, this solution is applicable to performance replication as well as performance standbys.
The Vault Go API will now automatically retry 412s, and provides convenience methods for propagating the X-Vault-Index response header into the request header of subsequent requests. Those not using the Vault Go API will want to build equivalent functionality into their client library.
Vault proxy and consistency headers
When configured, the Vault API Proxy will proxy incoming requests to Vault. There is
Proxy configuration available in the api_proxy
stanza that allows making use
of some of the above mitigations without modifying clients.
By setting enforce_consistency="always"
, Proxy will always provide
the X-Vault-Index
consistency header. The value it uses for the header
will be based on the responses that have passed through the Proxy previously.
The option when_inconsistent
controls how stale reads are prevented:
"fail"
means that when a412
response is seen, it is returned to the client"retry"
means that412
responses will be retried automatically by Proxy, so the client doesn't have to deal with them"forward"
makes Proxy provide theX-Vault-Inconsistent: forward-active-node
header as described above under Conditional Forwarding
Vault 1.10 mitigations
In Vault 1.10, the token format has changed, where service tokens now employ server side consistency. This means that by default, requests made to nodes which cannot support read-after-write consistency due to not having the necessary WAL index to check Vault tokens locally will output a 412 status code. The Vault Go API automatically retries when receiving 412s, so unless there is a considerable replication delay, users will experience read-after-write consistency.
The replication option allow_forwarding_via_token can be used to enforce requests that would have returned 412s in the aforementioned way will be forwarded instead to the active node.
Refer to the Server Side Consistent Token FAQ for details.
Client API helpers
There are some new helpers in the api
package to work with the new headers.
WithRequestCallbacks
and WithResponseCallbacks
create a shallow clone of
the client and populate it with the given callbacks. RecordState
and
RequireState
are used to store the response header from one request and
provide it in a subsequent request. For example:
This will retry the Read
until the data stored by the Write
is present.
There are also callbacks to use forwarding: ForwardInconsistent
and
ForwardAlways
.