Date: Sun, 16 Apr 2006 03:42:03 -0500 (CDT) Subject: heisenbugs force real time code rewrite X-UID: 163 The robot side control system is completely rewritten. It seems to be stable now with much higher performance than the old code. Here's a video of the rate gyro feedback to the steering system. http://golem5.org/robot1/video/mvi2875.mpg I'm holding the robot off the floor and turning it left and right. It fights this by turning the wheels opposite the direction of turn. The old control system code tried to fork off a child process with RT scheduling from a multithreaded parent process. The real time child process handled bit banging with the motor control board. This approach seemed to be stable. But it was really full of Heisenbugs. With the pre-emtible Linux kernel, no virtual memory, and even low entroy (no hard disk), the code was very unstable. It was random chance if it worked or not. Adding a print statement or even function call (that was never actually called in practice) could alternately break or make the bit banging work. The problem was that while I could play around with the code and eventually happen upon a solution that did work, if I made even slight modifications to it, it stopped working. What torpedoed this was trying to have communications between both computer boards on the robot. The top board handles all motor control and communications. The bottom board handles higher level auxiliary functions. For now, that means it logs robot state and records video frames. I could never get the second board integrated with the first and still have the bit banging work. The extra time spent in socket communications messed up the timing. So I've basically spent Thursday night, Friday, and Saturday, now into Sunday morning, working on this. Saturday night I realized that I had to trash the existing code and rewrite it. What's funny about this solution is that it is superior in all respects. 1. less code, simpler (about 30% smaller) 2. more reliable (have not seen Heisenbugs or crashes) 3. higher performance (main loop several times faster without lag) However, it is weird. I think it works because all of the separate processes are forced into sequential scheduling due to the pipes involved. Here's what I run on the various computers. geode1 - echo 10650 3 | ./sensor | ./boss 5678 | ./motor geode2 - wget -O - http://192.168.1.100/img/mjpeg.cgi | ./hippo abc laptop - ./jstick wifi geode1 5678 So jstick on laptop connects to boss on geode1. boss receives sensor readings from sensor. boss sends motor control commands to motor. hippo receives the network camera MJPEG stream and accepts logging commands from boss. If someone asked me to do hardish soft real time on Linux (like what I'm doing now), I'd first ask if they had considered QNX or VxWorks (even though I have never used them before). Then I'd suggest RTAI or RTLinux. But if vanilla Linux were a hard requirement, I think that it's possible to make it work. The trouble is that you may spend a lot of time coming up with a solution. I suspect that the RTOS world has lots of tricks that most software people have never seen before.