Debugging Django Management Commands with strace


This past week I ran into an issue where a django management command running on a cron began to hang.   I began to rollback changes to the environment, downgrading python versions, downgrading postgresql versions and even rebuilding the application in another region on Digial Ocean.  None of this seemed to work.  My initial hypothisis was that there was a bug in the management command or we had a noisy neighbor on our VM instance.  Eventually, I was able to rule both of these options out.  The management command worked perfectly on my Ubuntu laptop but as soon as I pushed it to our staging environment it quickly began to hang while running on the cron.  After spending days looking into the code and the environment I decided to use strace to try and figure out what was going on.

Debugging with strace

If you're not familiar with strace, it's a utility on the Linux operating system that helps you trace system calls and signals.  If you're not familiar with how it works, there are ton of great articles about it.  A simple google search will give you a books worth of documentation.  A great place to start is here.

The good thing was the problem was easily reproducable.  The cron job was setup to run every minute, and the job ran cleanly for a few minutes but would then hang for a few minutes.  Once it ran again every cron that was "missed" while the job was hanging would then run.  You can see the beahvior prettly clarily from our logging system below.

GrayLog graph

Diagnosing the problem with strace

Once I was able to rule out it wasn't a bug in my code or a noisy neighbor, I turned to strace.   The first thing I did was try and run strace on the PID while the cron was hanging.  Since the problem was easy to reproduce and the job would hang for minutes at a time, I could easily catch it and run strace.  The first thing i did was run the following command sudo strace -p 3695 -s 100000.  As you can see we're using two flags with strace,-p and -s.  The -p flag specifies the PID and the -s flag the size of the output.  By default this is 32 characters.  I wanted to make sure enough output was displayed on the terminal so I set it to 100000.

Since the process was hanging, it was no surprise that strace showed us the process called the wait system call.  

strace: Process 3695 attached
wait4(-1,

This at least verified my hypthosis.  That the operating system was waiting for something, I just wasn't sure what it was waiting for.  Since the PID was waiting, I let the process finish and the following was produced.

strace: Process 3695 attached
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3697
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3697, si_uid=1000, si_status=0, si_utime=268, si_stime=29} ---
rt_sigreturn({mask=[]})                 = 3697
exit_group(0)                           = ?
+++ exited with 0 +++

The full output didn't give me much else that I could use so it was time to move on to the next step.  The other thing I noticed while debugging was that all django management commands would hang while the cron job was hanging.  Meaning, if I tried to run a management command it would wait as well.  This led me to believe something in the management command was preventing all of the management comands to run.  So, I decided to run the management command with strace while the cron was hanging.  Once I saw the cron hanging again I ran the command using strace like so strace python management.py scheduler.  This was exactly what I was looking for.  strace went through it's process and then stopped on the below system call

openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3
poll([{fd=3, events=POLLIN}], 1, -1

 

PyNaCl Bindings Hang

As you can see the management command was hanging when making the call to openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3.  Since I wrote the management command I knew this was something I didn't do myself and it looked like there was something going on with a random number generator.  I did a quick google search and low and behold I was led to an open bug ticket in the PyNaCl library.  The bug report explained the exact behaviro I was running into and it had a link to paramiko which I'm using with the scheduler.  While reading through the ticket, I realized this was something that happens on VM's which is why I wasn't able to reporduce the issue on my local machine.  

I'm not going to walk through the details of the bug itself (this could be another blog post in itself), it's very well documented here.  If you do run into this problem, the suggested fix is to install haveged which solved the problem for me as well.  For more information about how what haveged is and how it works click here.

Conculsion

This was one of the harder bugs I've had to solve in some time, it took me a week to figure out what was going on.  I tried everything but in the end strace saved me.  When debugging your code, if you notice behavior that doesn't seem normal and the system resources are running at typical levels strace is something that can help you a great deal.