Date: Thu, 20 Apr 2006 07:43:54 -0500 (CDT) Subject: exactly why the control system failed X-UID: 166 While falling asleep, the real reason the control system failed came. Here is the core of the control system again. 1. /dev/gpio2 ----2400baud----> sensor 2. sensor ------pipe------> boss 3. boss ------pipe------> motor 4. motor ----2400baud----> /dev/gpio0 The service rate of the entire system is ultimately limited by step 4. 2400 baud means 2400 / (8 bits + 1 bit) = 266 bytes per second. The pipe to the "motor" process will start to back up if more bytes than this arrive. Each motor control command frame consists of 4 bytes: framing, command, value, checksum. So 266 bytes per second means 266 / 4 = 66 command frames per second. The pipe to the "motor" process will start to back up if more command frames arrive than this. When drive is in neutral, "boss" writes 3 command frames every time it cycles through the control loop. As soon as the drive shifts into forward, "boss" writes 7 command frames per cycle. So in neutral, there are 12 bytes written per cycle. In forwards, there are 28 bytes written per cycle. So here is what happens. 1. "sensor" outputs data up to 50 times per second. It's slower when the full control system is running. A rough estimate is 20 times per second if logging is off. I've observed a rate of 11 times per second with logging on (as I could watch the log). 2. "boss" receives the sensor readings. A sensor reading triggers one cycle through the control loop. The control loop in "boss" therefore cycles at 22 times per second. So "boss" writes 12 * 20 = 240 bytes every second into the pipe to "motor" while in neutral. The system is working properly at this point although very near the edge. So I can walk over and pick the robot up, turn it in the air, and the feedback control to the front wheels will fight the turning ok. 3. "boss" is switched from neutral to forwards. From this point, the system is broken. In forwards drive, "boss" writes 28 * 20 = 560 bytes per second into the pipe to "motor". This is far in excess what "motor" can process. The pipe therefore backs up. 4. Eventually, the pipe between "boss" and "motor" backs up. When this occurs, boss, the writing thread in "boss" will block and wait for space in the pipe. 5. This causes a cascade failure. "boss" stops reading from the pipe with "sensor". The sensor pipe backs up too. 6. At this point, both pipes are backed up. The sensor pipe is full of several seconds of old readings. The motor control pipe is full of several seconds worth of old commands. Before I went to bed, my solution was reducing the output rate of the "sensor" process. This would have failed as it does not fix the problem which is really at the other end of the chain. In this case, the intuitive guess would have been completely wrong. Motor control is done quite inefficiently. So many more command frames are sent than are really necessary. It was just easier to program it this way and send all commands with every cycle. As it turns out, this will overload the bit banged serial 2400 baud rate that limits the system. At a higher baud rate, say 9600 or 19200, this kind of failure probably would never have occurred. So I'll have to go over the system and retune it so this kind of failure can never happen. The good thing is that there's a much deeper understanding of how the system works. Queueing service rates are very significant. It is interesting that something quite similar has occurred at work. A very large number of people were involved over the course of weeks trying to find a system bottleneck that caused intermittent cascading failures. Engineers first assumed that the system was overloaded. Work was arriving too fast. But in the end, a bottleneck at the filesystem level was found. Just as with this robot, the ultimately limiting factor is the service rate, not so much the arrival rate. If a service rate at the bottom is exceeded, then the system will back up for everything upstream, no matter the arrival rate. The bad thing is that again bit banging rears up. I'm limited to a low baud rate because of it. With a crystal and UART on both sides, the baud rate could be much higher. I had originally wired the electronics for TTL serial, I2C, or SPI. I2C and SPI are externally clocked from the sender. Had I used them, these kinds of timing issues would never have arisen. Bit banging would be fine. But I chose TTL serial which is problematic.