Production Practices

Node.js debugging best practices

Debugging Performance Issues

Performance problems are best resolved with a methodical approach. Except in cases of clear regressions and other simple situations, resolving poor performance starts as a system-wide debugging affair. It’s important to be able to narrow scope quickly and definitively.

So while strictly not a specific Node.js practice, the Utilization, Saturation, Error (USE) Method is recommended. The USE Method is a simple strategy you can use to perform a complete a check of system health, identifying common bottlenecks and errors. It can be deployed early in the investigation and quickly identify problem areas, which then can be studied in more detail other methodologies, if need be. The strength of USE is its speed and visibility: by considering all resources, you are unlikely to overlook any issues. It is developed by Joyent Performance Engineer Brendan Gregg.

Performance Methods and Checklists

The checklists below break down common systems into components (CPU, Memory, Network, Storage) and specify what tools are available for, say system-wide CPU utilization as a command-line tool.

  • USE Method Utilization, Saturation, Error
  • SmartOS
    • Comes with DTrace by default
    • mpstat, kstat, sm-cpuinfo, zonememstat, dladm, etc
  • Mac OS X
    • Comes with DTrace by default
    • iostat, top, uptime, vm_stat, netstat, iostat, iosnoop, etc
  • Linux
    • vmstat, dstat, mpstat, sar, iostat, etc
  • System Performance, the book, for the deepest debugging.

Debug a Running Process (DTrace)

DTrace is a dynamic tracing framework with the ability to trace from the kernel all the way through your application’s code. It’s extremely powerful, but most importantly, safe to use in production.

You instrument your application by enabling predefined probes in the kernel (for instance for syscalls like read() and write()), with probes you defined at compile time for your application, or with probes defined at runtime for your application. This visibility into your applciations performance is a crucial tool to effectively practice the USE method.

Node comes with a series of built-in probes to instrument common events throughout your application.

  • gc-start, gc-done
  • net-server-connection, net-stream-end, net-socket-read, net-socket-write
  • http-client-request, http-client-response
  • http-server-request, http-server-respones

By instrumenting these probes you can gain insight in the patterns of your application by observing the latencies of the common tasks you perform. For an example of using these static probes look at nhttpsnoop, which allows you to see accounting of HTTP requests and responses coming into and out of your application.

By using the dtrace-provider module you are able to define DTrace probes at runtime for your Node application. These probes come with little to no overhead for your application, and allow you to customize very quickly what information you will want to observe about your application while it’s running. See bunyan and restify for examples of how these runtime probes can be leveraged.

You can also use other probes built into the system for more sophisticated analysis. For example, you can trace malloc or operator new and get a count of allocations by stacktrace.

Cloud Analytics

Cloud Analytics is a tool for real-time performance analysis of production systems and applications deployed in the Triton Cloud. Built on DTrace, Cloud Analytics provides extraordinary visualizations into how your application is performing, as well as being able to correlate that information with your other related services.

Using Cloud Analytics you’re able to observe latencies throughout many processes running in your environments, from filesystem and network I/O through the static probes defined for Node. You’re able to visualize your performance characteristics through traditional graph formats, as well as powerful heatmaps which helps identify outliers instead of only viewing the average latency of your application.

Read More

Profiling your application

Being able to determine where your application is spending its time is critical to successful operation. Using DTrace, its profile probes, and the ustack helper you can get that information.

# dtrace -n 'profile-97/execname == "node" && arg1/{
    @[jstack(150, 8000)] = count(); } tick-60s { exit(0); }' > stacks.out

Will sample your process 97 times a second (you can go as high as 4999 times a second), aggregate the stack traces for all the times your process was sampled, and store that result in a file.

Once you’ve profiled your application with DTrace, you can turn that file into a flame graph using stackvis.

$ npm install -g stackvis
$ stackvis dtrace flamegraph-svg < stacks.out > stacks.svg

Read more about profiling Node.js

Profiling your add-on modules

Your application may use binary add-on modules for connecting to things like databases, external resources, or integration with existing business logic. Under those circumstances you may have need to quantify the latencies for execution of those interactions. Using just a little bit of DTrace know-how you can see the distribution of latencies for arbitrary methods that queue work in the Node event queue.

Read more about tracing latency in add-ons.

Stuck Process

DTrace can be used for more than just data collection, including debugging non-fatal buggy program behavior. DTrace can instrument very specific events on the system (like “the close syscall in a Node process returned -1 with EBADF” or any application-level event), and you can use that to take action like stopping the process to take a core dump and then resuming it again. Putting these pieces together makes it possible to gather all kinds of debugging information from a program that’s behaving badly.

Read more about this technique.

Postmortem

The technique of determining the root cause of a programming error after a process has died is known as postmortem debugging. This is an important technique to practice regularly, as being able to quickly and accurately diagnose the cause and not the symptom of a failure will mean fewer iterations of these procedures. Fewer iterations of debugging will mean faster development cycles and thus having confidence in your application logic sooner.

To ability to successfully diagnose the root cause of a software problem after a crash depends on the amount and quality of information you have at your disposal. Often quality of information will trump quantity. Depending on the design of your application you may want to be able to garner the state of as many inter-related processes at the time of the crash. Then you’ll need to be able to correlate that information into something actionable.

Aside from quality logs, it’s important to be able to capture as much of the state of the process that died as possible. A stack trace is a good start, but is probably not enough information to be able to act quickly to solve a problem. So along with the stack trace of where the process was executing, you want a dump of all the memory that was wired at the time of the process’s death. This is functionality is often provided by the operating system, and is referred to as a core file.

Not all environments enable core file generation by default, so make sure that when deploying you’ve enabled those features, a simple ulimit -c unlimited in the shell where the process will be running can be sufficient. On SmartOS-based environments you are also able to generate a core file even if the process hasn’t died, you can do this with gcore.

Postmortem Debugging with mdb

mdb is the modular debugger, it can be used to debug running processes or with core files. Core files can be generated in one of two ways, your process abort()d, or you used gcore to generate a core file for a running process.

Generating core files for a running process is a very powerful technique, doing so allows you to inspect your applications state before something has gone wrong. Also, if you’re trying to debug a memory leak having multiple core files allows you to track the changes in the counts and kinds of memory being kept around.

Your Node process may abort() for a variety of reasons, from a programming error inside Node itself, or by programmatically invoking abort directly to ensure a core is generated. There are two ways you can programmatically induce your Node process to abort, the first is process.abort which you can call at anytime in your code. The other is to run your Node process with --abort-on-uncaught-exception which tells V8 that if your process encounters an uncaught exception it should generate a core file.

mdb has a very powerful module designed specifically for Node, which can be loaded with ::load v8. Loading this module allows you to see what JavaScript objects are found in V8’s heap with ::findjsobjects, see the stack trace from the entry point through JavaScript frames with ::jsstack, as well as a variety of other commands to inspect the JavaScript state of your process.

Read more about postmortem debugging.

::findjsobjects

The ::findjsobjects command in mdb is a crucial utility for determining the current state of active objects found in the V8 heap. Running without any parameters will give you a list of representative object addresses, their counts, the number of properties, the constructor name, and the shape (the enumerated order of the properties) for the objects.

af821729        1      997 Object: sxw, mxu, xbd, smf, gdl, cxt, potx, ...
9c31d5d5     1192        1 Array
9c31e6c9      265        7 Module: parent, paths, loaded, exports, ...
9c31bebd      268        7 Array
a470b0ad     1603        2 Arguments: callee, length
9c31bd79      184       40 Array

While that in and of itself is helpful for tracking down memory leaks, you can also iterate over those objects and find who is keeping a reference to them.

> 9c31e6c9::findjsobjects | ::findjsobjects -r
9c31e6c9 referred to by a6e8db0d./opt/marlin/node_modules/restify/lib/request.js
9c31e6c9 referred to by 9c3d8871[1]
b9f45325 referred to by a6e8db0d./opt/marlin/node_modules/manta/node_modules/restify/node_modules/bunyan/lib/bunyan.js
b9f45325 referred to by a6ee4431[0]
b9f45325 referred to by a6efd025.parent
af8fd5a5 referred to by a6e8db0d./opt/marlin/node_modules/restify/lib/response.js
af8fd5a5 referred to by 9c3d8871[2]

Read more about debugging memory leaks.

::jsstack

While operating in mdb on a core file from Node, you can use the ::jsstack command to determine from the entry point through JavaScript frames where the process was when the core file was created.

> ::jsstack
8047358 node::DTRACE_HTTP_CLIENT_REQUEST+0x447
8047408 _ZN2v88internalL21Builtin_HandleApiCallENS0_12_GLOBAL__N_116BuiltinArgumentsILNS0_21BuiltinExtraArgumentsE1EEEPNS0_7IsolateE+0x157
8047424 0x9f90a336 internal (Code: 9f90a2c1)
8047444 0xb702c976 <anonymous> (as OutgoingMessage._finish) (a6e38fbd)
8047484 0xb702c80b <anonymous> (as OutgoingMessage.end) (a6e38f99)
80474a4 0x9f90db41 <ArgumentsAdaptorFrame>
80474bc 0xbe6e72d6 <anonymous> (as StringClient.write.options.headers) (a4781f81)
...
8047b18 node::MakeCallback+0x11e
8047b88 node::MakeCallback+0x69
8047c18 node::PipeWrap::AfterConnect+0x158
8047c78 uv__stream_io+0xc9
8047ca8 ev_invoke_pending+0x63
8047ce8 uv__run+0x94
8047cf8 uv_run+0x17
8047d58 node::Start+0x1c7
8047d78 main+0x1b
8047d94 _start+0x83

That can be quite powerful information, but used in combination with ::jsframe you are also able to inspect the arguments that were passed to the frame as well.

> 80474bc::jsframe -iv
80474bc 0 <anonymous> (as StringClient.write.options.headers) (a4781f81)
    file: /opt/marlin/node_modules/restify/lib/clients/string_client.js
    posn: position 2553
    arg1: b9f08081 (Oddball: "null")
    arg2: a47827f9 (JSObject)

And then with that information you’re able to inspect the arguments directly with ::jsprint

> a47827f9::jsprint
{
    _hasBody: true,
    _headerSent: true,
    _header: "POST /my/jobs/task/live HTTP/1.1
...

::findleaks

In addition to encountering memory leaks from closures in JavaScript you may find you have to tackle traditional memory leaks of malloc()s without corresponding free()s. To diagnose these sorts of issues run your process with UMEM_DEBUG=default set in the environment. This environment variable will (among other things) begin storing audit information about allocations happening in your process, that information includes the stack trace for the allocation.

When you have a core that was run with UMEM_DEBUG set, you can then in mdb use the command ::findleaks which will give you a list of any leaked memory aggregated by their callstacks.

> ::findleaks
...
CACHE LEAKED BUFCTL CALLER
08072c90 1 0807dd08 buf_create+0x12
08072c90 1 0807dca0 func_leak+0x12
08072c90 1 0807dbd0 main+0x12

Once you have a list of addresses that were leaked, you can use the BUFCTL address to see the callstack by using the ::bufctl_audit command.

Read more about ::findleaks.

Logging

Having logs from your application during the time of the failure is a tremendous help, but it’s important to be judicious with your logging. It isn’t normally feasible to run in production with DEBUG or TRACE levels of logging being enabled. Even if you could run in production with those levels, you would have to sift through a lot of noise to be able to find the pertinent information.

bunyan

Bunyan is a straightforward logging library for Node.js applications. Bunyan’s output is line delimited JSON, which makes it easy to consume with your normal unix command line utilities like grep and sed, and also with its own CLI utility, or with the json cli utility.

Bunyan also comes with built-in support for DTrace, which allows you to keep your existing log level for your existing destinations (e.g. INFO) but enable at runtime a more verbose level (e.g. TRACE) and be notified in user space of those log entries, without them also going to your existing destinations and potentially filling the disks. DTrace is completely safe to use at runtime, so if enabling higher levels of logging would result in detrimental effect your system DTrace will exit before doing your system harm.

That is, you’ve already defined in your configuration the level at which you want your application to log, but your process is currently misbheaving and you’re looking to find out more information without potentially restarting the service or increasing the amount of storage where your logs are stored. With bunyan and DTrace you can enable your process to send to you the log level you’re interested in at runtime.

# bunyan -p *
[2013-09-27T18:16:19.679Z] DEBUG: test/1076 on sharptooth.local: something went wrong

Read more about using bunyan to do runtime log snooping.