An informative workplace

November 24, 2008

Agile software development has learned about informative workplaces and stop-the-line quality management from lean manufacturing systems. Here are my implementations of it at my last project:

All is well

This video shows how our project looked when all is well.

The large green light in the front shows the status of our build server. The green flashing lights show the status of our staging and production server.

For the production server, the frequency of green blinks indicates the load of the system for the last 10 minutes relative in % to highest load ever observed. The occasional yellow blink shows the frequency of user errors. In our case, this usually means that a user has submitted an invalid file. The frequency is also calibrated to the highest load ever observed. This means that the relative frequency of yellow and green blinks is correct.

The staging server gets a copy of the production traffic, but delayed by a few minutes. It runs the next version to be released of the software.

If the light blinks red, there is a bug or operations problem. I think. The thing is: The red light for the production and even staging servers hardly ever has a red light. Why? Because we weed out all the problems earlier in the process.

Fixing errors at the earliest possible time

The build light is key. When we check in code, it tries to build the code and run tests for it:

If an error was introduced, the light turns red within 20 minutes.

This process was good enough to catch the large majority of the defects the team ever introduced. To catch the next batch, we introduced an automated test that ran real production data over a fully installed system after every build. Sadly, we didn’t find a good way to monitor this.

The technical details

The lamps were purchased from Delcom Products Inc. At $90 a pop, they are pretty expensive, but they are the only option we’ve found that will let you control lights directly using the USB port.

Delcom ships the lights with a driver as a DLL (an open source Linux driver is available, but I’ve never tried it). We wrote a wrapper in Ruby using the Win32API Ruby package. We added a nice set of classes that let us write things like visual_indicators[2].red.blink(12, 24) to have the 3rd lamp’s red led stay on for 12/100 second and off for 24/100 second.

The status polling was done over http. We had dedicated rules in the frontend web server to let the polling probe into the staging and production servers. These servers implemented the status as a servlet that would return something like red flash 10%; yellow off; green solid. The most challenging part on the Ruby-side was converting 10% frequency flashes into meaningful on and off values for the blinks. The server would look at it’s work queue items (it was a batch server) for the last ten minutes to determine the status.

The build server was originally implemented using hudson. We found it difficult to distinguish building from error (red blinks) from building from ok (green blinks). This eventually led me to implement a 100-line Ruby script as our build server. The script would be triggered from cron every minute and if there was no other build going on (determined by a lock file) and the Subversion repository had been changed since the last build, but more than 3 minutes ago, it would start a build.

The lamp would get the build status by polling static files exposed over HTTP with Apache httpd.

An eye towards excellence

Seeing the status of the testing, staging and production environments gave us confidence that the systems were working. It also gave us a greater understanding of the problems of the operations people who have to take over responsibility for the solution at some time.

Towards the end of the project, putting new software into production each iteration after one week in staging seemed more and more feasible from a technical point of view. We didn’t quite get there before I had to leave the project, but it was a goal worth striving for.

flashembed(“build-ok”, “http://brodwall.com/johannes/video/FlowPlayerLight.swf", {config: { videoFile: “http://brodwall.com/johannes/video/build-ok.flv", hideControls: “true”, }}); flashembed(“building”, “http://brodwall.com/johannes/video/FlowPlayerLight.swf", {config: { videoFile: “http://brodwall.com/johannes/video/building.flv", hideControls: “true”, }}); flashembed(“build-fail”, “http://brodwall.com/johannes/video/FlowPlayerLight.swf", {config: { videoFile: “http://brodwall.com/johannes/video/build-fail.flv", hideControls: “true”, }});

Comments:

j pimmel - Nov 25, 2008

Yes, very clear thanks!

j pimmel - Nov 23, 2008

That is a cool lo-fi way to highlight build status! Everyone will want this now ;) I certainly do!

One thing I wanted to ask which wasn’t clear from your post - does the frequency of the light flashing, for the one linked to CPU load relative to past measurements - is that the load on the build server or is that flashing load factor for the App while under a load test run?

Automated load/performance testing which can report back with some consistency is something we have often struggled with..

jhannes - Nov 23, 2008

Hi, Jerome

Thank you for your positive feedback.

I guess my description of the flashing (under technical details) could be better. We have a total of four test stages, only three of which are monitored with the lamps (the last via email).

First, the build server runs the unit tests. The foreground lamp will flash while this is happening. It flashes with a set frequency which never changes. (But the flashing turns yellow when the build is running longer than usual, which is a manual setting)

Second, every build is installed in a test environment and we run canned production data over it. The replay happens much faster than production, so we get some indication of how much load the system can handle. This test is not monitored with a lot of automation, just emails from the error logs.

Third, at the end of every iteration, we install a new version in the staging environment. The staging environment receives an asynchronous copy of all requests going into production. The environment is monitored by one of the lamps. We have looked at historical data to find the peak number of requests per 10 minute period. We have determined the fastest practical blink rate and decided that this should correspond to peak. The flash rate of the lamp is the current traffic relative to the peak.

Lastly, we monitor production in the same way as staging.

Was this clearer?