Agile software development has learned about informative workplaces and stop-the-line quality management from lean manufacturing systems. Here are my implementations of it at my last project:
All is well
This video shows how our project looked when all is well.
The large green light in the front shows the status of our build server. The green flashing lights show the status of our staging and production server.
For the production server, the frequency of green blinks indicates the load of the system for the last 10 minutes relative in % to highest load ever observed. The occasional yellow blink shows the frequency of user errors. In our case, this usually means that a user has submitted an invalid file. The frequency is also calibrated to the highest load ever observed. This means that the relative frequency of yellow and green blinks is correct.
The staging server gets a copy of the production traffic, but delayed by a few minutes. It runs the next version to be released of the software.
If the light blinks red, there is a bug or operations problem. I think. The thing is: The red light for the production and even staging servers hardly ever has a red light. Why? Because we weed out all the problems earlier in the process.
Fixing errors at the earliest possible time
The build light is key. When we check in code, it tries to build the code and run tests for it:
If an error was introduced, the light turns red within 20 minutes.
This process was good enough to catch the large majority of the defects the team ever introduced. To catch the next batch, we introduced an automated test that ran real production data over a fully installed system after every build. Sadly, we didn’t find a good way to monitor this.
The technical details
The lamps were purchased from Delcom Products Inc. At $90 a pop, they are pretty expensive, but they are the only option we’ve found that will let you control lights directly using the USB port.
Delcom ships the lights with a driver as a DLL (an open source Linux driver is available, but I’ve never tried it). We wrote a wrapper in Ruby using the Win32API Ruby package. We added a nice set of classes that let us write things like
visual_indicators.red.blink(12, 24) to have the 3rd lamp’s red led stay on for 12/100 second and off for 24/100 second.
The status polling was done over http. We had dedicated rules in the frontend web server to let the polling probe into the staging and production servers. These servers implemented the status as a servlet that would return something like
red flash 10%; yellow off; green solid. The most challenging part on the Ruby-side was converting 10% frequency flashes into meaningful on and off values for the blinks. The server would look at it’s work queue items (it was a batch server) for the last ten minutes to determine the status.
The build server was originally implemented using hudson. We found it difficult to distinguish building from error (red blinks) from building from ok (green blinks). This eventually led me to implement a 100-line Ruby script as our build server. The script would be triggered from cron every minute and if there was no other build going on (determined by a lock file) and the Subversion repository had been changed since the last build, but more than 3 minutes ago, it would start a build.
The lamp would get the build status by polling static files exposed over HTTP with Apache httpd.
An eye towards excellence
Seeing the status of the testing, staging and production environments gave us confidence that the systems were working. It also gave us a greater understanding of the problems of the operations people who have to take over responsibility for the solution at some time.
Towards the end of the project, putting new software into production each iteration after one week in staging seemed more and more feasible from a technical point of view. We didn’t quite get there before I had to leave the project, but it was a goal worth striving for.