Yay code review. NFS Lifetime rules are still brain-bending.: landley

landley

Yay code review. NFS Lifetime rules are still brain-bending.

Apr 09, 2011 01:22

Ok, found a workaround for the linux-2.6.39-rc1 hang that Jens Axboe's been distracted from solving for a couple weeks: disable preemption. So I can go back to testing/developing against linus's current -git tree instead of 2.6.38, which is good.

My Ottawa Linux Symposium paper submission 'Why Containers Are Awesome' has been approved, meaning I need to actually _write_ the paper now. I'm collecting a file full of links, might take a stab at it this weekend...

On the NFS front I'm pulling back from my ongoing battle with lockd/statd (which are a horrible mess from a design level and apparently always have been), because I managed to poke some people into reviewing my NFS patches (yay!), and Serge Halyn raised a good point about lifetime rules in the third patch. Which means I have to reevaluate the lifetime rules, which are always the hard part with kernel stuff. Sigh.

The struct net * is reference counted via get_net() and put_net(). The process context has a reference, the RPC transport object has a reference, and my first patch makes nfs_client take a reference. The reason three different objects take a reference is they have independent lifetimes: the mount process going away (or calling unshare()) doesn't mean the NFS mount does, and RPC layer tends to cache things with no obvious rhyme or reason, so it can still have stuff floating around after its cache after the NFS mount gets umount-ed.

The third patch fiddles with an object with a _fourth_ lifetime, the sunrpc auth_unix cache. When I wrote the patch, I convinced myself that the existing lifetime rules cover it, but now I can't reconstruct my reasoning so it's probably best to just make the cache take a reference. (You don't want to do too much of this because reduces SMP scalability, but this is a mount time action rather than an inode operation so it's not really a hot path.)

Among other things, the auth_unix cache maintains a cache of resolved IP addresses, and addresses only mean anything within a given network context. Any time the code has an IP address, it needs an associated network context. I'm starting to suspect that what I REALLY need to do is stick a struct net * in struct sockaddr, except those are just a data blob that gets memcopied all over the place, so having them maintain a counted reference would be a major rewrite... Perhaps some kind of new "struct sockaddr_net" aggregating them?

Anyway: the sunrpc_cache_lookup() code takes a key and a value (except it calls the value the key?), and the "lookup" function creates new instances and adds them to the table if they don't exist.

Yes, "lookup" is adding entries to the list.

The svcauth_unix.c file actually contains _two_ caches, the ip cache and the gid cache. (Which contains both uid and a struct gid *). The ip cache is the one with a sockaddr and thus needing a struct net *. The file svcauth_unix.c defines an instance of struct cache_detail (called unix_gid_cache, with .init initialized to unix_gid_init). It declares it extern, and then defines it later in the same file. (Again, I didn't write this.) But that file does _not_ define one for ip_map.

Instead, "struct cache_detail *ip_map_cache;" is in netns.h (inside struct sunrpc_net) which is initialized by ip_map_cache_create(). Yes, one is a structure statically initialized, the other is a dynamically allocated structure initialized in a function, because consistency is the enemy of NFS.

So .init is initialized to ip_map_init. The .update member is intialized to update (not ip_map_update: remember, consistency is not the NFS cod's job. There is an ip_map_update but it's not used here.) And cache_put is initialized to ip_map_put. (There's no "get" to go with the put so presumably it's a delete funciton rather than reference counting. There's no explanation why some members have a cache_ prefix and others don't.)

I really, really, really hate this code.

The update() function doesn't update the IP address, so it doesn't need to update the network namespace, so I can pretty much ignore it here. (I'll have to get back to what ip_map_update() and __ip_map_update() do once I find which code paths they get called from, and with what arguments.)

(Note: I want the get_net() and put_net() to match the lifetime of ip_map->m_addr, but I can't just grep for that because it's inline in struct ip_map and isn't explicitly freed. So I have to search for it using multiple mechanisms to find all the places I need to touch.)

So, looking at ip_map_put(), it does test_bit(CACHE_VALID) and CACHE_NEGATIVE on calls on the cache_head's flags before doing a put on m_client. I THINK this is just testing whether or not m_client has been set, not whether or not the struct ip_map has been initialized by ip_map_init(). But honestly, I'm guessing here. I have to go read yet another tangent...

Sigh. I honestly thought I understood this patch when I made it. This code has LAYERS of crazy painted over each other, none of which are documented, all of which fight.

It's quite possible that after two days of reading through strange tangents I'll come to the conclusion that my original patch was correct, and will then be able to explain WHY. Those are always the frustrating ones, where it looks like I didn't do anything but it took BUCKETS OF WORK to figure out the correct (small, simple) course of action.

dullboy