This is a classic case of the "GCC is broken" phenomenon. Only this time it was with Thrift.
We have an application that's built on the LAMP stack, except that the A[pache] there is T[hrift] and the P stands for Python, not PHP. The weird thing with this application is that as time passed, the CPU usage continued to grow and grow while the response times from the application kept falling and falling. The really weird thing is that we didn't see a change in memory usage.
While looking for the problem, the first thing I did was to set up profiling for the application. The first profile runs didn't give much clues. The longest running functions were the system calls to socket.accept(), Thread.join(), _mysql.commit(), etc. All network related stuff. I wanted to get a call stack dump so I started going through the documentation for inspect and found
this note. Yes, I wasn't explicitly deleting frames. I patched a server with this fix and yes, the CPU profiles suddenly looked so much better. It was two Fridays ago. The weekend was good.
Next Monday, I saw a mail saying that the issue had, in fact, not been fixed. The CPU usage still kept increasing, albeit at a slower rate. The ops team had put a cron to restart the Thrift server every 3 hours. Whoa. Begin two weeks of utter torture, compounded many times by the fact that we didn't really have much time to fix it since we were losing incalculable revenue every day.
For the first two days, we were still thinking it was an issue with call stack inspection and tried removing it altogether. Unfortunately, that didn't work either. It was time to get serious. On wednesday, I wrote
thriftbench so that I could do some benchmarking like Real Engineers™.
I had always been skeptical about Thrift's behaviour of keeping a single instance of the application running all the time in a single process. Apache spawns multiple children, each of which dies after serving certain number of requests and it refreshes the entire interpreter runtime for PHP on every request. It's a marvelous architecture built to insulate developers from little stupid issues like we're about to uncover. So as soon as I got hold of thriftbench, I also set up a parallel http_load benchmark for the JSON-HTTP version of our application (yeah, we have both right now but not for long, hopefully).
The difference in results was spectacular. HTTP won on every count -- response times, CPU usage, throughput, network I/O. All except the last were tied to the fact that Thrift was running a leaky application logic. The last one's noteworthy. JSON beat ThriftBinary in terms of serialised data size by ~ 10%. Take note and underline with a red pen. JSON serialisation ROCKS!
Fortunately or unfortunately, switching to JSON-HTTP instead of Thrift was not an acceptable solution. So we trudged on. First we ascertained that it wasn't our usage of funky Python decorators. Then we ascertained it wasn't call stack inspection, really. We also ascertained it wasn't logging, which was shared by all threads in the server. We looked at what Thrift does differently from Apache, and one difference that stood out was that while Thrift kept a long-running process, Apache renewed the application state on every request. I thought I should try creating a new instance of the application on every request in Thrift. The application instance doesn't contain any expensive resources. Just a bunch of config parameters. To my dismay discovered that I was already doing that and it still failed. I was creating a new instance of the application on every request... just like HTTP, I thought.
More painstaking days of debugging happened and I ascertained that absolutely nothing was wrong with our interaction with MySQL. That was necessary since the only difference between our application and all other Thrift applications in the company was that they all use PostgreSQL with a persistent connection (or connection pool) while we renewed connections on each request, as is recommended for MySQL and usually done in LAMP.
The real situation of desperation was created this Wednesday when I ascertained that nothing except our own application logic misbehaved with Thrift. There could be so much stuff in the application to change. Changing the application also meant introducing major risk in functional stability. Yesterday, I took Python's __getattr__ and __call__ to the limits by quickly generating JSON-HTTP wrappers for all of the application API and another wrapper that made Thrift server go to the JSON-HTTP service over localhost. You have to have fall-backs, you see. We'd have been faster with Thrift-over-HTTP than with Thrift alone with a broken application. Earlier in the week, I had also worked on enhancing a WSGI application that can expose any Python module or class as a JSON-HTTP service, which, I'll do my best to open source.
Today I took just the Python application, stripped away from Thrift, mocked away the DB to just get a hard-coded result set and benchmarked it. The CPU leak was observed, as expected. I ran the profiler again on it and this time it showed up.
Essentially, we were increasing the size of a class-static list with every object we were creating (i.e. once per request) and then doing expensive traversals over it. The reasons why it wasn't easily caught were:
1. The object instances always get the right copy of the list since we are removing duplicates that are being added to it, while the copy in the class definition kept increasing
2. The increase in memory size per request is minor compared to the overall memory footprint of the script but the iterations over it are expensive (constructing a set out of the list's elements)
The fix is as simple as making the list an instance attribute instead of a class attribute. The issue can be reproduced and understood by this simple script:
class BaseClass:
class_attr = ['a', 'b', 'c']
def update_class_attr(self):
self.class_attr.extend(['d', 'e', 'f'])
self.class_attr = ['a', 'b', 'c']
class DerivedClass(BaseClass):
def __init__(self):
BaseClass.update_class_attr(self)
def main():
obj = None
for i in range(1, 10):
obj = DerivedClass()
print obj.class_attr
print BaseClass.class_attr
print DerivedClass.class_attr
if __name__ == "__main__":
main()
Go ahead. Try it out. See if you can explain what's going on here. The irony of the situation is that what I thought would fix the issue turned out to create the issue in the first place! Had we not been creating a new instance of the application on every request, we'd never have seen it in production, ever!