[Taken from Part 3 of my talk at the Seattle Node.js meetup on 7/24/2015. Code. Direct link to original screencast. My original Dangerous Cliffs of Node.js post.]
Asynchronous programming is difficult to wrap your mind around: threads, semaphores, and deadlocks, oh my! On one hand, Node.js makes this a whole lot easier: no locking or mid-execution interruptions. But its event loop is foreign territory. Let’s explore by poking it and prodding it - let’s try to break it!
The first way to break the event loop is to execute a task that takes too long. Take a look at this short script:
var fs = require('fs');
var doSyncWork = function() {
var start = new Date();
var now = new Date();
console.log('doSyncWork: start');
while (now.getTime() - start.getTime() < 1000) {
now = new Date();
}
console.log('doSyncWork: done');
};
var getFile = function() {
var start = new Date();
console.log('getFile: start');
fs.readFile('nonexistent', function() {
var now = new Date();
var delta = now.getTime() - start.getTime();
console.log('getFile: done,', delta + 'ms');
});
};
var previous;
var writeInterval = function() {
var delta;
var now = new Date();
if (previous) {
delta = now.getTime() - previous.getTime();
}
console.log('writeInterval:', delta ? delta + 'ms' : '-');
previous = now;
};
setInterval(writeInterval, 100);
setTimeout(getFile, 250);
setTimeout(function() {
getFile();
doSyncWork();
}, 500);
setTimeout(function() {
process.exit();
}, 2000);
We set up an interval to fire every 100ms, logging how long it has been since the last call. Then, after 250ms, we run getFile()
which goes to disk asynchronously and prints how long that took. Finally, at 500ms we call getFile()
again then immediately call doSyncWork()
, a very evil function which takes one whole second to do its synchronous work.
Let’s run it:
writeInterval: -
writeInterval: 141ms
getFile: start
getFile: done, 1ms
writeInterval: 104ms
writeInterval: 106ms
getFile: start
doSyncWork: start
doSyncWork: done
writeInterval: 1047ms
getFile: done, 1000ms
writeInterval: 102ms
writeInterval: 104ms
writeInterval: 104ms
writeInterval: 106ms
There are a few key things to note. First, that initial getFile()
call takes 1ms to go to the filesystem and return its result. Next, the first couple writeInterval()
calls are spaced out by a little bit over 100ms, as requested. These are both signs of a healthy event loop, always ready to take on a new task.
But it gets interesting around the doSyncWork()
call. First, note that no writeInterval()
calls happen during that synchronous work - they are skipped entirely! Second, the getFile()
call takes 1000ms to finish - it is started before the event loop is blocked, and finishes after the event loop is available once more.
It’s key to note that this also happens on a server. If one visitor to your Node.js server is able get a long task onto the event loop, all other requests will be delayed, just like getFile()
was. Unlike Ruby on Rails, with its separate thread per request, a Node.js server allows one operation to monopolize all resources in the process.
The other way to break the event loop is to put too many tasks onto it. Your server may be working through each task very quickly, but there are just too many! The next script is a little more complicated, designed to replicate a server under load:
var async = require('async');
var toobusy = require('toobusy-js');
var LAUNCH_DELAY = 10; // 100 requests/second
var SYNC_WORK = 8; // 8ms of synchronous work per task
var TASK_DELAY = 20;
var concurrent = 0;
var completed = 0;
var doSyncWork = function() {
var start = new Date();
var now = new Date();
while (now.getTime() - start.getTime() < SYNC_WORK) {
now = new Date();
}
};
var doTask = function() {
concurrent += 1;
setTimeout(function() {
concurrent -= 1;
completed += 1;
doSyncWork();
}, TASK_DELAY);
};
var writeStatus = function() {
console.log('lag:', toobusy.lag());
console.log(' concurrent:', concurrent);
console.log(' completed:', completed);
completed = 0;
};
setInterval(writeStatus, 250);
var previous;
var go = function() {
setTimeout(function() {
var now = new Date();
var count = 1;
if (previous) {
// replicating lots of events coming in while event loop blocked
var delta = now.getTime() - previous.getTime();
count = Math.floor(delta / LAUNCH_DELAY);
}
for (var i = 0; i < count; i+= 1) {
doTask();
}
previous = now;
go();
}, LAUNCH_DELAY);
};
go();
The two key variables to look at are LAUNCH_DELAY
and SYNC_WORK
. Doing some quick math, you can see that 100 requests per second, each taking 8ms of work, adds up to 800ms of work per 1000ms. The server will be able to handle that load, and here’s the confirmation:
lag: 0
concurrent: 1
completed: 16
lag: 4
concurrent: 3
completed: 25
lag: 4
concurrent: 2
completed: 21
lag: 3
concurrent: 2
completed: 20
lag: 2
concurrent: 3
completed: 19
lag: 1
concurrent: 2
completed: 19
^C
The server is keeping up. It’s getting work done, and the lag, the time it takes to get through the even loop, is quite low. Now, if we change LAUNCH_DELAY
to 5, doubling the load to 200 requests per second, the amount of work per 1000ms will now be 1600ms. Let’s watch the process struggle:
lag: 0
concurrent: 9
completed: 35
lag: 64
concurrent: 22
completed: 75
lag: 156
concurrent: 81
completed: 108
lag: 247
concurrent: 129
completed: 137
lag: 165
concurrent: 206
completed: 218
lag: 524
concurrent: 329
completed: 348
lag: 1111
concurrent: 526
completed: 556
^C
That is not a healthy event loop. Every iteration is getting slower, and the process never gets a chance to catch its breath under the constant influx of new tasks. This is what will happen to a Node.js server under heavy load. That getFile()
call from the first script will take over a second on this server, because that’s how long it takes to get through the event loop. Once more, interaction between independent user requests can happen in a single Node.js process.
toobusy-js
to statsd
/graphite
on regular intervals. With this information in hand, you can add capacity before things start to fall apart.toobusy-js
once more here to build middleware that rejects new requests when things aren’t looking good. From there, you can set your load-balancer to try another node if it gets this kind of ‘too busy’ response.Breaking something is important for truly understanding it. As Neil deGrasse Tyson said:
“…because children are born scientists, they are always turning over rocks and plucking petals off flowers, they are always doing stuff that by and large is destructive. That’s what exploration really is when you think about it. An adult scientist is a kid who never grew up. That’s what an adult scientist does.”
Software is the perfect playground - it takes some real effort to produce impactful negative consequences from experimentation. No need to replace broken eggs, or clean up the mess. No pain or injury. Just revert the changes to source code or restore data from backup.
So, break that event loop! Know what happens when you forget a callback or call it twice! Test your apps under load! Experiment!
You’ll get better at both designing systems and debugging them.
A good workplace is a welcoming space for everyone, encourages open collaboration, and enables everyone to do their best work. Let’s talk about some of the techniques for collaboration I’ve learned... Read more »
I’ve spoken before about my training and development contracts, but my second stint at the Social Security Administration (SSA) was an interesting mashup. I worked with the code itself as well as... Read more »