Optimizing Coastline: virtulis

virtulis

Optimizing Coastline

Nov 22, 2014 11:45

As you may or may not know, I've written a weird JavaScript library called Coastline. It started as an attempt to not go crazy with all the JavaScript callbacks and promises, and slowly evolved into something you might consider a language of it's own. The basic message is - wrap every line of your code in some Coastline construct and never worry about callbacks or race conditions again. Sounds good, right? And with the right macros it's not that much work either.

As you may or may not know, it's also slow as fuck. I knew from the very beginning it couldn't be too fast with its "an object per operation" model, but I never bothered to check how slow it was. Then today I bothered.

c.each
10695 ops/sec, 935ms

c.loop
11325 ops/sec, 883ms

recursive
25445 ops/sec, 393ms

all in
37594 ops/sec, 266ms

Holy crap is that slow! And that's a Xeon CPU with tons of free RAM, not some old garbage. Ok, I quickly remembered my own tip to disable stack traces. I'm sure it's gonna help a lot.

c.each
25000 ops/sec, 400ms

c.loop
27701 ops/sec, 361ms

recursive
66667 ops/sec, 150ms

all in
84034 ops/sec, 119ms

Well, that's much better! Just joking. It only got, what, twice as fast when I disabled the only slow feature I knew of?! What's the other half? Surely I must be forgetting something. However after scanning the code and checking that - just in case - V8 is still perfectly capable of executing a billion of basic ops a second (just five orders of magnitude more than it can with my awesome library), I figured it won't be that easy after all.

I got even more puzzled after I ran Firefox profiler on the test code and all it showed was CPU time evenly distributed across all the functions. So there is no bottleneck. My code is just slow, all of it. Sucks to be me. Maybe the code is okay, it just doesn't get optimized properly? So I ran node with --noopt flag and observed that timings didn't change a bit. Actually, on average it was even a bit faster. So not only it's not optimized properly, I've managed to trick the compiler into optimizing it improperly.

...Did you know what you get when you google "javascript optimization"? Generally so-called experts talking about reducing page load times. Why would anyone care about actual code performance in 2014, right? But ok, adding "V8" to that query seems to show something relevant, so let's go there. I guess Spidermonkey and whatever else there is will have to wait for now.

Here's the first useful link I found: https://github.com/petkaantonov/bluebird/wiki/Optimization-killers. I found myself guilty of two things listed there. First, I was doing this:

push: function (task) {
if (typeof task != 'object' || arguments.length != 1) {
task = fix(arguments);
}

Not only was I leaking the whole arguments object to a separate function just to make it a proper array, I was then assigning the result to one of the arguments! Oops. So I replaced it with this stupid-looking code:

push: function (task_arg) {
var task;
if (typeof task_arg != 'object' || arguments.length != 1) {
task = new Array(arguments.length);
for(var i = 0; i < task.length; ++i) {
task[i] = arguments[i];
}
}
else {
task = task_arg;
}

And indeed, the tests got a little bit faster. There are a few more places I'm doing a similar thing, but they're not involved in the test (or so I think) so I'll fix them later. Second, I was accepting an object in coastline.context constructor and applying the properties with the extend() I copied from Underscore. Who knew for-in over non-arrays was a no-no? So instead of an elegant loop we now have this:

coastline.context = function (options) {

var id = coastline.nextTaskId++;

if (coastline.trace) {
this.errorObject = (new Error());
}

this.id = id;
this.queue = [];
this.flags = options.flags || {};

if (options.caller) {
this.caller = options.caller;
this.top = this.caller.top;
this.data = this.caller.data;
}
else {
this.top = this;
this.data = {};
}

this.parent = options.parent || null;

this.bind = options.bind || null;
this.object = options.object || null;
this.method = options.method || null;
this.arguments = options.arguments || null;
this.addObject = options.addObject || null;

};

A pattern is emerging: the more idiotic the code looks, the faster it works. Oh well. How are we doing?

c.each
58480 ops/sec, 171ms

c.loop
26954 ops/sec, 371ms

recursive
129870 ops/sec, 77ms

all in
142857 ops/sec, 70ms

Well, at least we broke the 100k barrier. On a Xeon at least.

Now, let's figure out why the numbers are so different. c.each and c.loop are using extra contexts internally, so them being slower confirms it's all about the contexts. But why is c.loop so much slower than c.each, and why is the recursive solution noticeably slower than the "all in" one? In both faster cases the code allocates 10k context objects before executing them one by one as opposed to appending them one-by-one. So how come with all the allocations and large array shifting it still works faster? My only guess is it's somehow optimized better.

But still, I'm not satisfied with any of these numbers, so let's go searching for more "optimization killers".

http://www.html5rocks.com/en/tutorials/speed/v8/

Initialize all object members in constructor functions (so the instances don't change type later)
Always initialize object members in the same order

I certainly add a lot of properties to context as I go. For some reason I thought it would consume less memory to initialize them as needed and never thought about how compiler will deal with it. Let's try writing more stupid code to preset all of them. At this point I'm simultaneously writing this post and optimizing, which is fun, means I get to write about things that failed. This seems to be one of those things at the first glance. For some reason now benchmark shows wildly different results on every run and while I'm seeing numbers above 150k every once in a while, I'm not sure it's gotten better on average. Let's add one more zero to number of operations and try again.

c.each
41237 ops/sec, 2425ms

c.loop
31182 ops/sec, 3207ms

recursive
69444 ops/sec, 1440ms

all in
101729 ops/sec, 983ms

vs.

c.each
41719 ops/sec, 2397ms

c.loop
33311 ops/sec, 3002ms

recursive
54675 ops/sec, 1829ms

all in
88183 ops/sec, 1134ms

Nope, it's actually slower now. It's slower in both cases with 100k ops, again, not sure why. GC? But it's certainly slower when all properties are preset. Let's comment this stupid code out for now, who knows, maybe it'll show improvements later.

Don't delete elements in arrays, especially numeric arrays

I do a lot of array shifting in Coastline. I know it's slow, but could it be that slow? Let's try avoiding it and see. I replaced a while-shift loop on the queue with a for, slicing it every 10k elements (so that the contexts and the objects they point to can be GC'd if the loop never stops). While the optimization itself didn't seem to do much good, what I accidentally noticed was that there were 2x more slices on the "recursive" solution, meaning that there were 2x more contexts created. So "all in" wasn't actually faster, it was me being an idiot somewhere.

What it turned out to be, however, was parent task being added back to queue after the child task finished. At this point I realized I totally misunderstood how my own code worked and have been optimizing the wrong array. Still, I figured I'd remove that unneeded addition - and hit maximum call stack (which is the reason the processQueue is there in the first place). I guess it stays for now.

So, since the "recursive" solution turned out to actually be recursive, I decided to modify it a bit.

recursive
90744 ops/sec, 1102ms

recursive2
94877 ops/sec, 1054ms

all in
107527 ops/sec, 930ms

At this point the benchmark started deviating more and more again (because of that slicing, I'd guess). If I were not a moron in a hurry, I'd write a proper benchmark suite that would run for a few minutes before spewing out more or less consistent results, but a moron in a hurry I am, so we'll keep speculating.

Overall we can see that "recursive2" (which adds the next task to the caller of the current task, keeping it to 2 levels) is slightly better than "recursive", but still worse than "all in". But at least I know where to dig now. Ok, "shift" and "unshift" are still found in 10 places around the code, but fixing most of that would require me to be less sleepy than now.

Let's go and read some actual info at last.

$ node --trace-opt examples/benchmark.js | grep disabled
[disabled optimization for , reason: eval]
[disabled optimization for IN, reason: call to a JavaScript runtime function]
[disabled optimization for ToString, reason: call to a JavaScript runtime function]
[disabled optimization for INSTANCE_OF, reason: call to a JavaScript runtime function]
[disabled optimization for coastline.context._next, reason: reference to a variable which requires dynamic lookup]
[disabled optimization for coastline.context.done, reason: reference to a variable which requires dynamic lookup]
[disabled optimization for DELETE, reason: call to a JavaScript runtime function]
[disabled optimization for ToObject, reason: call to a JavaScript runtime function]
[disabled optimization for coastline.context, reason: reference to a variable which requires dynamic lookup]
[disabled optimization for APPLY_PREPARE, reason: call to a JavaScript runtime function]
[disabled optimization for coastline.context._finish, reason: reference to a variable which requires dynamic lookup]
[disabled optimization for coastline.context.push, reason: reference to a variable which requires 45269 ops/sec, 2209ms
[disabled optimization for coastline.context._process, reason: reference to a variable which requires dynamic lookup]
[disabled optimization for ToNumber, reason: call to a JavaScript runtime function]
[disabled optimization for ToPrimitive, reason: call to a JavaScript runtime function]
[disabled optimization for toString, reason: inlined runtime function: ClassOf]
[disabled optimization for DefaultNumber, reason: call to a JavaScript runtime function]
[disabled optimization for EQUALS, reason: call to a JavaScript runtime function]
[disabled optimization for coastline.context._each, reason: reference to a variable which requires dynamic lookup]

Oh. Reference to a variable which requires dynamic lookup. The hell does that mean? And what the hell about 45269 ops/sec? That second one can't even be found in google. I started by looking at "coastline.context" constructor, and it turns out the optimization was disabled by this line:

this.errorObject = (new Error());

So I put Error in a variable and used that variable everywhere. This reduced the number of "disabled" messages, with only _next, push and _each left, and increased performance a bit. I did the same for the rest of the globals and all the reference-to-a-variable messages were gone. And it got a little bit faster. I guess. I really do guess because by now the benchmark seems to be showing random numbers.

But they do seem to be slowly increasing, yes. When I go back to 10k test, "all in" tops at 188k, which is certainly more than what we started with, but much less than what would satisfy me.

I'm certainly not done, but now it's time to sleep.

https://bitbucket.org/verypositive/coastline/commits/6cc6954bd1728fee161f41d600d1d16ce17b87ef

я им обещал, я лох, я больше не ем грибы, work work, wtf, coastline