++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++ Some Really General Debugging Methods +++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Written by Gerald Lai 2005 OK. So you have put all the components together to form your system. You test it and your whole system grinds to a halt. Nothing works! You take one look at the hairy mess known as your "project" and wonder where the problem could be at. After all, how could anything possibly go wrong? You were most absolutely tremendously PERFECT while you were assembling it.. Real Life 1 - You 0 Ever been in this situation? The following methods below describe some approaches to tackle the problem of locating and solving bugs in your system or any general hardware/software system for that matter. The process of debugging is the process of systematically removing problems in a system by correcting them. Exercise systematic checking, logical reasoning and some common sense for good debugging. There are many more methods that are not mentioned. It is up to you to come up with more ways that will ultimately increase your value as an engineer. Remember, the only way to get good at debugging is through practice. Of course, we do not practice making bugs that we have to solve. Rather, the *experience* of solving many problems thrown our way is what seasons us. +------+ +------+ | AUX1 | | AUX2 | +------+ +------+ || || +============SYSTEM===========+ +-------+ | | | | | | | INPUT | ==> | A | D X E | H | I | +-------+ X | | | | | |-----|-----|-----|-----|-----| | | | X| | | +--------+ | B | C | F | G | J | ==> | OUTPUT | | | | | | X +--------+ +============SYSTEM===========+ NOTES: SYSTEM needs AUX1 & AUX2 to function properly SYSTEM flow: INPUT -> blocks A through J -> OUTPUT X = bugs occuring at: 1. input 2. interface 3. functional block 4. output OK. So you have put all the components together to form your system. You test it and your whole system grinds to a halt. Nothing works! Relax.. take a look at some of these methods: River Flow Dam (classical approach) -------------- [BASIC IDEA] The flow of a system from input to output is like a river. If the problem happens downstream, dam the river upstream and solve the problem up there first before it spreads further down. [EXAMPLE1 - a simple example] You find that your 2-input OR logic gate circuit is not powering up your LED like it should. You check the output of the OR gate and find out that the voltage level is not high enough. You proceed by "placing a dam" on the problem by FIRST making sure that the voltage level on the inputs are registered high enough. You find that one of the inputs is tied to ground. Alright. The other input is connected to +5V. Seems fine enough. You then whip out your multimeter to test the input connections and discover that there was a disconnection (from within the wire insulation) connected to the +5V supply. You change the wire. {Moral of the story} Do not even bother with the output if you haven't got the inputs set up right. Always check your inputs. Common sense. [EXAMPLE2 - a more practical example] You find that you have some problems with the initialization code part for your microcontroller. Your microcontroller C code structure looks something like this: main() { // initialization part GOOD CODE A *FAULTY* CODE GOOD CODE B // main loop while (1) { MAIN CODE HERE } } Proceed to "dam the river" by placing an extra 'while loop' immediately after all the initialization code. Progressively move the loop up till the problem is gone. The moment the problem is gone, the code between the current 'while loop' dam and the last dam is the problematic code. On the first dam, modify the initialization part by adding a 'while loop' dam right before the main loop: GOOD CODE A *FAULTY* CODE GOOD CODE B while (1) { DAM TEST CODE } // added loop of first dam This ensures that the microcontroller only executes initialization code that we wish to test. Note that execution will never get to the main loop because it gets trapped in the added loop. Hence, the main loop does not need to be removed or commented out. This is more efficient from a programming perspective. For the DAM TEST CODE, insert code that would test the validity of the initializations and signal the result (see method 'Error Propagation & Inducing Indication'). Testing for the first dam reveals that the problems still exist. We then move the dam up to look like this: GOOD CODE A *FAULTY* CODE while (1) { DAM TEST CODE } // added loop of second dam GOOD CODE B The second dam made no difference. The problems were still there. But we now know that, as far as the dam test code is concerned, code B is did not cause the problems. GOOD CODE A while (1) { DAM TEST CODE } // added loop of third dam *FAULTY* CODE GOOD CODE B Finally, on the third dam, the problems went away. From this, we infer that the problems reside between the second and third 'while loop' dams. In other words, the faulty code is in-between code A and code B. [VARIATION] Instead of looking at it in terms of input flowing to output, we can also look at it transiently (in terms of time). If we know that our system produced wrong output results at 3 different time intervals t, t+11 & t+34, then we would know to look for the problem BEFORE time t. That is, look where it first happened. The effectiveness of solving the bug is greatly reduced when looking at the system for time > t (after the first problem has happened). It may, however, give us some clues albeit misleading sometimes. Also see method 'Chronological Screwup'. Catch the Chicken (process of elimination) ----------------- [MUSE] Finding a bug is like catching a chicken. When you try to catch a chicken, you should not rush at it. It will elude you. Instead, corner it against a fence until it finds no way out and decides to make a run for it. When it runs by towards you, catch it. If it stays put, inch in closer... [BASIC IDEA] By the same token, eliminate the posibilities where a bug could or could not occur. Deductive reasoning is the KEY and testing can help. Sometimes committing too much to hunting down a bug may not be beneficial in terms of effort or time. It may be better to cover breadth than go far in depth. Eventually, the bug will surface itself if we squeeze it out rather than digging for it. [STRATEGY] A good strategy would be to determine if the bug is either hardware or software related. If it is hardware related, then we can determine if the bug occured in either the filter or ADC/DAC components, for example. After ruling out ADC/DAC, we can check, on a lower level, the parameters of the filter to see if we used the right values of resistances and capacitances, check the connections, etc. [EXAMPLE3 - a PC troubleshooting example] Your friend calls you up one day and cries for help. He was working on an assignment (due midnight) on his computer when a "fatal error" message popped up. The computer then shuts down automatically. (Your friend uses Windows XP) You rush to his house. Each time you turn on his computer, you see it boot through the startup screen and into the desktop. Around 15 seconds after that, it shuts down. You try turning it on for a couple more times. Each time the computer shuts down approximately 15 seconds into the desktop. As you know it, fatal errors could be caused by a malfunction in hardware, particularly the hard drive of the computer. You ask your friend how old is his computer. He tells you that he got it from Dell 3 months ago. The PC is quite new and it would now be safe to assume that the bug could not reside in hardware. Besides, you already notice the 15-second-till-shutdown routine, which is too consistent to be an erratic hardware problem. You proceed to boot the PC in Safe Mode (a safety feature of Windows). While the PC is booting up, you ask your friend if he noticed anything strange lately while visiting websites or checking e-mail. Your friend recalls an e-mail from a family relative of his. He opened the e-mail to look at its contents to see if it was anything important. It turned out to be some advertisement on recreational drugs, which seemed weird considering it came from his second aunt. The PC finally boots up in Safe Mode. You rush against time to save his assignment on a floppy disk and with whatever remaining time, you decide to check the startup list of programs. Sure enough, you notice some erroneous entries of programs that should not be loaded when Windows boots up. 15 seconds pass and your friend is amazed that the computer did not shut down. Of course. Windows skips the startup list when it is booted in Safe Mode. Hence, you conclude that one (or more) of the programs in the startup list is the troublemaker. You disable what you think is malicious and reboot the computer once more. After the reboot, you proceed to run scanners to detect and remove the malware on the computer. {Moral of the story} Even though this example is of a troubleshooting nature, it highlights many key points in debugging to consider. For instance, many real-world bugs we may face in the future reside in systems that we either do not have prior knowledge of or is too complex to comprehend all at once. We need to understand the system first and inquire more about it before tackling the bugs. Also, it is important to remember that if time is of the essence, we need to salvage what we can first (go for the quickest fix) before anything else. Functional Block Testing (divide and conquer) ------------------------ [DIAGRAM DESCRIPTION] Almost every system is built from basic parts, adopting full use of the divide and conquer principle. These basic parts come to be known as "functional blocks" of the system. For example, consider the DIAGRAM on the first page. It describes the layout of a general system. The system in the DIAGRAM first receives an input (or stimulus) and processes it to produce an output (or result). The flow from input to output may pass through (or make use of) several functional blocks. In the case of the DIAGRAM, the flow moves through blocks A through J. The system also requires exclusive entities AUX1 and AUX2 to function properly and carry out its job. Examples of AUXiliary entities: power supply, existence of another system (for co-dependence), key, operator, etc. In some sense, AUX1 and AUX2 are inputs themselves but they are usually self-sufficient and standalone entities. As stated in the DIAGRAM, bugs could either occur at the input, output, functional block or interface levels of a system. Occasionally, the bugs could come from the AUXiliary interface. This is worth checking from time to time. [BASIC IDEA] A system can generally be disassembled and tested part-by-part. Therefore, one of the exhaustive ways of searching for bugs is to tear the whole system down and test every functional block. See method 'Accuse the Easiest First'. This method leads to one intriguing matter that ought to be kept in mind: If all the functional blocks function as they should BUT the system fails when all the blocks are put together, THEN the culprit must be in the interface! When that happens, we know that some of the functional blocks are not playing nice with each other. We can track down many bugs in this manner by building our system and testing it part-by-part, trapping the chicken down to the interface when we finally put everything together. After all, the hardest bugs to find are the ones that reside on the interface level. Accuse the Easiest First (fix the simplest first) ------------------------ [BASIC IDEA] If we wish to test several functional blocks of our system to hunt bugs, attempt the smallest basic block first that is the easiest to test and perhaps fix. Proceeding in this order will not only reduce the chances of breaking the system with a minor bug fix but also reduce the chances of introducing phantom bugs that come back to haunt the system. The reasoning is simple: a more complex functional block plays a larger vital role in a system and will have many interface connections with other basic functional blocks. Hence, applying a bug fix to the complex functional block, without first considering if the fix could be done to a more basic functional block, may cause adverse unknown effects on the whole system. Fixing a basic functional block is not only easier to comprehend, it helps us notice other bugs and understand the flow of our system better. In the end, it saves a lot of unworthy hassle. [EXAMPLE4 - a system design example] Consider a part of a system that selects and sums multiples of number inputs. Functional block A DESCRIPTION: reads number input and outputs number multiplied by 2 INPUT : in_a OUTPUT : out_a = 2 * (in_a + 3) COMMENT : an offset error of +3 for in_a exists in this block the correct OUTPUT is out_a = 2 * in_a Functional block B DESCRIPTION: reads number input and outputs number multiplied by 3 INPUT : in_b OUTPUT : out_b = 3 * (in_b - 2) COMMENT : an offset error of -2 for in_b exists in this block the correct OUTPUT is out_b = 3 * in_b Functional block C DESCRIPTION: reads number input and outputs number multiplied by 4 INPUT : in_c OUTPUT : out_c = 4 * in_c COMMENT : no errors exist in this block Functional block D DESCRIPTION: selects 2 number inputs and outputs sum of those numbers INPUT : out_a, out_b, out_c, select OUTPUT : out_d = out_a + out_b {if select = 0} = out_b + out_c {if select = 1} = out_c + out_a {if select = 2} COMMENT : receives outputs of functional blocks A, B and C as inputs In this case, functional block D is the complex functional block whereas functional blocks A to C are the basic functional blocks of the system part. The above is a simple mock system part with a couple of unnoticeable offset error bugs. For functional blocks A and B, the OUTPUT function conflicts with the DESCRIPTION of the block. These offset errors were unintentionally introduced during design. You do not realize this at first. To test this system part, you begin by testing 'out_d' by varying 'select' for many different 'in_a', 'in_b' and 'in_c' values. When 'select' = 0, the resulting 'out_d' is correct. When 'select' = 1, there is a -6 offset from the correct result for 'out_d'. When 'select' = 2, there is a +6 offset. Without "accusing the easiest first", you fix functional block D such that: OUTPUT: out_d = out_a + out_b {if select = 0} = out_b + out_c + 6 {if select = 1} = out_c + out_a - 6 {if select = 2} Is this solution correct? Does it work? Well, the answer is YES if this is the whole system itself. It does work. Unfortunately, this is part of a larger system. Applying the solution above will not work if the larger system makes use of the basic functional blocks A to C for purposes other than functional block D. With that solution, you have basically "fixed the problem" only as far as block D's job is concerned. Besides, you altered functional block D's description of "selecting and summing" to "selecting, summing and taking care of offset errors". By introducing complications like this that could be avoided, you are only setting yourself up for future pitfalls (phantom bugs) and a lashing from colleagues working on the same large system that eventually grew around your functional blocks A to D. Instead, you decide to explore the problem further by first testing the simpler blocks. Doing so, you found and fixed offset errors in blocks A and B. Not only did you contain the fix to its appropriate blocks and did not have to introduce any extra functionality such as offset compensation (like in the previous solution), you also notice the original errors virtually disappear after putting everything together again. Recreate Error (understanding) -------------- [BASIC IDEA] If we do not understand how a bug occurs, try to recreate that bug. Understanding the bug means half the battle is won. [STRATEGY] There are 3 approaches to recreate a bug: feed the bug, strip down the system or create it from scratch. An elusive bug has to be triggered somehow for it to appear. Feeding the bug involves feeding the system with a set of inputs or placing the system in a specific state/environment so that the bug occurs. It is analogous to attracting a rodent out from hiding. Once we know what inputs and state of the system causes the bug to appear, we can begin studying how the stimuli affects the functional blocks of the system to produce the bug. This is the most prolific method for bug tracking employed by software writers (via bug reports from beta testers and the internet community). The second approach is to strip down the system piece-by-piece until it is easy to notice how the bug occurs. Note that this approach is qualitatively different from the first approach in that it targets bugs that respond to many sets of inputs (i.e. persistent bugs). It is analogous to finding the source of the ant infestation in our home. We know there are ants all around, but where are they coming from? The system has to be stripped down systematically (reduced functionality) until the bug ceases to exist (much like EXAMPLE2) or until the system is bare enough for us to know what is going on. Sometimes, this approach may not be feasible for large complex systems. The third approach is rare and it is sort of the inverse of the second approach. It involves creating a pseudo-bug replica of the problem from the ground up. Parts of the buggy system may be used for this. The result is a separate mini-system with a mini-error in it. For example, in finding a bug residing in the filter circuit section of a system, a miniature high-pass filter could be created to study the system errors of filtering out low frequencies. Error Propagation & Inducing Indication (highlighting) --------------------------------------- [BASIC IDEA] If there is a way of highlighting a bug, DO IT! In reality, we can only get so close as to indicate clues that lead to the whereabouts of the bugs we are seeking. High-level programming languages often feature fantastic error propagation for debugging purposes. For example, a piece of software that encountered an error in some deep rooted function can propagate this information up to the main function as an indication. If we are programming a large software system, we should make full use of that feature. However, complex error propagation indication is only available in the software domain. When dealing with hardware circuits, errors do propagate but not on purpose. It is infeasible to induce some sort of propagation indication on hardware. Sometimes, all we have is the simplest of indication methods. However, they can be really powerful if used well. [EXAMPLE5 - a proof of concept] An example of hardware indication is to use light emitting diodes (LEDs). When programming a microcontroller device, LEDs are one of the best ways of indicating the execution of a part of code. In this case, lighting up an LED would be a simple microcontroller port output command. Not only is it fast, it is also reliable. Many times have others tried to induce an indication to an LCD display, which takes several commands just to issue a change. During those commands, some other code may break and mess up the microcontroller, causing an incorrect indication on the LCD display. C code to flash an LED on PORTF pin 0: PORTF ^= 0x01; // flip bit 0 on PORTF C code to put letter "E" on column 1 line 1 of LCD display: PORTB = 0x01; // put LCD display in COMMAND mode LCDDATA = 0x02; // COMMAND: cursor home to column 1 line 1 PORTB ^= 0x80; PORTB ^= 0x80; // strobe LCD enable bit 7 PORTB = 0x00; // put LCD display in DATA mode LCDDATA = 0x45; // DATA: write hex value for ASCII letter "E" PORTB ^= 0x80; PORTB ^= 0x80; // strobe LCD enable bit 7 Chronological Screwup (backtracking) --------------------- [BASIC IDEA] As far as building and testing a system is concerned, if a system undergoing construction tested good the last time we tested it, and *lately* it has not been working right, THEN something we have done *recently* MUST have caused the problem. [MUSE] Many times do we get lost in our own world while building a system. Sometimes our minds drift elsewhere and before we know it, we have a bug in our soup. It is important (in the process of backtracking) to be able to remember "what we did" or "what we were trying to do". We need to ask ourselves, "when was the system good the last time?" and more importantly, "what happened right after that?". [STRATEGY] It is imperative to adopt a concrete backtracking strategy if we wish to spend less time thinking about the past and more time on the future. Examples of good backtracks: Software backups - when programming, save early, save often and save always - label the backups well (i.e. with date, time & brief description) - utilize some sort of current version system (CVS) if possible - if we find that we are not using our backups at all, it is a GOOD sign - if we find that we are plowing through our backups, it is an even BETTER sign (and a cause for celebration, imagine if we didn't have backups.. oh don't bother, we are bound to run into that situation eventually) Prototypes - hardware circuits are a little different; their backtracks are not works-in-progress of its own - work the schematic all out on paper first and triple-check calculations - there are a lot of software that can simulate circuits; use them to aid calculations and confirm expectations - when possible, breadboard a circuit and test before soldering it shut on a PCB ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++ The Hard Facts About Debugging ++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The ideal way to handle a bug is, of course, to AVOID it in the first place. However, it is humanly impossible to make no mistakes. Debugging is a frustrating, slow and painful process, and it is here to stay. Luckily, there are "preventive measures" we can take. The following section provides tips on how to make debugging a little easier and a lot more tolerable: Develop attention to detail --------------------------- One of the most important skills to develop as an engineer is your attention to detail. That is, to put as much detail into your thoughts, methods and designs. Take a knob switch for example. You know that the switch is closed when the knob is turned. But think about which way you have to twist the knob. Clockwise or anti-clockwise? Does it matter? How would you open the switch? Twisting it in the same or opposite direction? If it is the same direction to close and open the switch, then there must be two or more settings in one revolution twist of the knob to differentiate between the two. How are the settings spaced along the circumference of the knob? Dwell in the details. Developing your attention to detail also improves your memory to a certain extent. This is useful for backtracking (see method 'Chronological Screwup') to remember what you did previously. Be neat ------- Be neat so that you (and others) do not have to revisit your clutter when you debug. Develop a style to whatever design methods you do (i.e. describing systems, drawing schematics, writing code, laying out circuits, etc.). Adopt a naming convention that is concise and helpful to you and others. For example, name active-low signals with a suffix "_n"; an active-low reset would be named "reset_n". Remember, consistency is the key to neatness. Get intimate with your components --------------------------------- You cannot construct what you do not understand. When dealing with circuits, study your electronic components first before using them. Always refer to datasheets and timing diagrams. Get intimate. Test the components individually when possible. Use a multimeter, oscilloscope and breadboard. All of this will help you build your understanding of the details. The same applies with software. When programming, study the library functions you wish to use by first reading the documentation. Experiment with the functions in a test code before using them. Never use code you do not fully understand. Otherwise, be prepared to debug later. Don't blame your tools - blame yourself and then again ------------------------------------------------------ "A bad carpenter always blames her/his tools." Most electronic tools and components are hardy enough that they do not easily fail. Some even come with protection built into them. Even if they do fail, some of them fail gracefully and you will notice the failure. Hence, DO NOT blame your tools the first time around if nothing works! Chances are that you may have made a mistake that makes it *seem* as though the tools/components are at fault. Get help & understand how you were helped ----------------------------------------- "Give a person a fish and you will feed that person for a day. Teach that person how to fish and you will feed that person everyday onwards." If you can get help, please do! However, if you received good help, you need to understand how you were helped. Not knowing how you were helped will only leave you ignorant of the changes in your system. It will also reduce your knowledge and increase the doubts you have of your system. It would not be absurd to go so far as to reject the help if you do not understand it. There's no VOODOO ----------------- One of the best things about the computer/engineering field is that every problem has a logical explanation to it. The systems we work on are deterministic. There is no magic involved. Here are a couple of voodoo myths to look out for: ["I slapped my circuit with my palm and it works now!"] ** No amount of voodoo hand-waving, touching, chanting and cheering will fix your problems. If your circuit did work as a result of a slap, the slap in itself was not the solution. The slap caused a change in your circuit (probably undoing a short of 2 components, for example) that "fixed it". ["If it worked for someone else, it should work for me."] ** Knowing that it worked for someone else is NOT a reason to be using the same system design. Find a better reason. Test a design you wish to adopt and the statement above would change to be "I tested it thoroughly and it fits my specifications, so it should work for me". Document your solutions ----------------------- There may be times in the future when you may encounter a similar bug you fixed in the past. By then, you would have forgotten how to fix it and waste unnecessary time debugging. Documenting your bug solutions for future reference is a good way to answer those "how did I fix this the last time?" questions. Documentation can be as easy as opening a text file, writing a brief description of the problem & solution technique/code that you used to fix it and saving the file with a descriptive filename. Perhaps include references of how you found the solution itself - something to jar that memory of yours. Nirvana mode ------------ Sometimes you feel special, as though the laws of physics and logic have stopped just for you. Some bugs can do that to you. Thus, it is absolutely important to keep a calm _state of mind_ while debugging. Do not assume that anything is right or wrong. Proceed with clear logical reasoning void of emotions. Hence, if you feel frustrated and fatigued at any point, take a break for some fresh air, get a drink and then come back. There is a plethora of negative attitudes one can adopt that would only hinder the debugging process. It is also non-conducive towards the well-being of the designer learning from the experience of debugging. Here are some of the attitudes' captions: ["You are my subordinate. I don't think very highly of you. I know much more than you. How could you ever help me?"] ** In case you have not noticed, you still have a bug you have not yet fixed. Accept any help you can get. What is there to lose? Remember, it is even possible for a baby to point at the buggy location on your circuit. Lose the ego. ["You are my superior/mentor/boss/instructor. You have more experience. You must know what is wrong. What is it?!"] ** Technically, nobody else understands your system part better than you do. You built it! So do not be disappointed next time when someone (who has more experience) fails to provide you with the solution you needed. Instead, work together to tackle the problem. ["But I have tested that already. What is the point of testing it again?"] ** The ideal number of debugging tests you should conduct is infinity. The constraints would be time, effort and resources. There is never enough tests you can do before the bug is fixed. Missing a test can mean the loss of a valuable piece of information that will help you solve the bug. ["It's broken. I know because I tested it."] ** Claiming that something is broken and that it is the final conclusion because you "tested" it, does not make the situation any better. Instead, record your test results on paper *as proof* and recreate those results to be really sure. The test results will serve as useful information later to find out what happened. Remember, words mean nothing without the backing of a technical proof. ["Oh, I don't have to test that. I'm sure it's correct."] ** When you are debugging, do not pre-assume anything is correct. Reset all your assumptions and begin with a fresh new slate. ["I didn't expect that it would take this long. I only allocated this much time to it."] ** Debugging is a serious issue, even at top corporations. Professionals often underestimate the time required for debugging. No good rule exists to estimate debugging time. Just be sure to leave yourself ample time. And then triple that amount. ["I didn't expect that it would take this long. I need to get to a party soon."] ** If you are not willing to put the time and effort into fixing bugs, then the bugs will remain in the system. What were you expecting? [EXAMPLE6 - code problems] Some day, you will narrow your bug search down to this small piece of C code and shed hair trying to figure out why it always prints "PROGRAM FAIL". char k = 2; if (k = O) { k = 3; printf("PROGRAM FAIL"); } When this happens, it is time to get into Nirvana mode. You breath rhythmically and relax your gaze over the code... char k = 2; if (k == 0) { k = 3; printf("PROGRAM FAIL"); } You find that you had one "=" as an assignment (that always returns true) within the IF statement. Instead, you should have two "==" as a comparator. From now on, you force yourself to always put constants on the left side of the comparator (e.g. if (123 == variable)) just in case you forget "=" again. You also find that you previously typed a capital "O" for a zero "0" and fix it.