If you a developer who works on Distributed system, there is one thing you learn well. That is how to debug, and how to avoid having to debug. Following are some of my thoughts and somethings I generally do.
Debugging a Local Java Program
- If you write your program well, generally you will have a stack trace when you have a problem. (This does not apply well with performance problems and memory leaks. I will write a separate note about those. )
- Look at the trace; go to where the error happened. Try to figure out what happened. The best way to do this is by walking through the logic again.
- If that did not work, copy and paste your stack trace in to Google. About 80% of the time, you will find the answer there. Pay special attention to JIRA bug reports for the projects you are using and online forums like stackoverflow.com.
- If that did not work, you will have to debug. You can debug by running the code from your IDE (e.g. Eclipse) or by connecting to a server through a remote debugger. Walk through the logic using the debugger that generally tells what happened. For example, the article explains how to debug with eclipse and how to connect to a remote server.
- If none of these worked, now it is the time to go and ask for someone to help. There are some bugs that are very hard for the author of the code to see. However, you should go for help with problem recreated, debugger attached, and ready to let him step through the execution.
- If you have trouble with a specific tool, you can ask for help as user lists, forums, or general developer forums like stackoverflow.com.
Debugging a Distributed System
- Debugging distributed systems are hard. So best approach is to not to have to debug them. You can almost get there by writing unit tests and tests what you write in small steps. My preferred approach is to make one path work end to end, and do small changes while testing each change.
- Distributed system will have multiple Processors (JVMs). So it is tricky to debug them. If it is at all possible, find a way to run whole your distributed system within the same process (JVM). This will need some imagination from your end, but it will save you lot of trouble later.
- If you are debugging a distributed system, it is often useful to capture messages that are sent and received. You can do this via tools like TCP Monitor, SOAP Monitor, or Wireshark.
- It is doubly important to log all exception that can happen in your code. Otherwise, you will have no idea whether system worked or not.
- I often append the timestamp and the name of process or host to each log line. One way to do this is by writing a Log4j appender. Time stamp and process or host name let me merge sort all the logs into one file and read the execution of the system in one read.
- It is likely that your distributed system process lot of messages. So it is very hard to read and understand the log. One way out of this is to trace every 1000th messages. I do this by having a message count and using.
- If you running a complex system that has more than five nodes, you should invest in some mechanism to collect the logs using something like FLUME and automate their processing to find stack traces etc.