Stupid NFS Tricks: markgritter

markgritter

Stupid NFS Tricks

Jul 04, 2015 01:05

One of the Tintri SE's asked whether the advice here about tuning NFS was relevant to us: http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch18_01.htm (It is not.)

Back in the days where CPU cycles and network bits were considered precious, NFS (the Network File System) was often run over UDP. UDP is a connectionless, best-effort "datagram" service. So occasionally packets get lost along the way, or corrupted and dropped (if you didn't disable checksumming due to aforesaid CPU costs.) But UDP doesn't provide any way to tell that this happened. Thus, the next layer up (SUN-RPC) put a transaction ID in each outgoing call, and verified that the corresponding response had arrived. If you don't see a response within a given time window, you assume that the packet got lost and retry.

This causes problems of two sorts. First, NFS itself must be robust against retransmitted packets, because sometimes it was the response that got lost, or you just didn't wait long enough for the file server to do its thing. This is not so easy for operations like "create a file" or "delete a file" where a second attempt would normally result in an error (respectively "a file with that name already exists" and "there's no such file.") So NFS servers started using what's called a "duplicate request cache" which tracked recently-seen operations and if an incoming request matched, the same response was echoed.

The second problem is figuring out what the appropriate timeout is. You want to keep the average cost of operations low, but not spend a lot of resources retransmitting packets. The latter could be expensive even if you don't have a crappy network--- say the server is slow because it's overloaded. You don't want to bombard it with extra traffic.

Say you're operating at 0.1% packet loss. A read (to disk) normally takes about 10ms when the system is not under load. If you set your timeout to 100ms, then the average read takes about 0.999 * 10ms + 0.000999 * 110ms + 0.000000999 * 210ms + so on, about 10.1ms. But if the timeout is a second, that becomes 11ms, and if the timeout's 10 seconds then we're talking 20ms.

So, at least in theory, this is worth tuning because a 2x difference between a bad setting and a good setting is worthwhile. Except that the whole setup is completely nuts.

In order to make a reasonable decision, the system administrator needs statistics on how long various NFS calls tend to make, and the client captures this information. But once you've done that, why does the system administrator need to get involved? Why shouldn't the NFS client automatically adapt to the observed latency, and dynamically calculate a timeout value in keeping with a higher-level policy? (For a concrete example, the timeout could be set to the 99th percentile of observed response times, for an expected 1% over-retransmit rate.) Why on earth is it better to provide a tuning guide rather than develop a protocol implementation which doesn't need tuning? This fix wouldn't require any agreement from the server, you could just do it right on your own!

Fortunately, in the modern world NFS mainly runs over TCP, which has mechanisms which can usually tell more quickly that a request or response has gone missing. (Not always, and some of its specified timeouts suffer the same problem of being hard-coded rather than adaptive.) This doesn't remove the first problem (you still need the duplicate request cache) but makes an entire class of tuning pretty much obsolete.

Nothing upsets me more in systems research and development, than a parameter that the user must set in order for the system to work correctly. If the correct value can be obtained observationally or experimentally, the system should do so. If the parameter must be set based on intuition, "workload type", expensive consultants, or the researcher's best guess then the real problem hasn't been solved yet.

networking, geek, rant