Blogger: Ramon Krikken
A seemingly innocent question on a mailing list - which I paraphrased for brevity - set in motion a series of events with dire consequences. The specific code, which was generating error messages in a certain software quality assurance tool, happened to be a critical part of the random number generator in a cryptographic library package. By removing this code, the strength of the cryptographic key material was reduced to a point where cracking the key would take minutes instead of decades. The unfortunate thing about cryptography and randomness is that good and bad can be virtually indistinguishable, and in this case the result still looked so random that the problem went unnoticed for about two years. The impact - needing to regenerate two years worth of key material, and casting doubt on encrypted communication and access performed with those keys - has understandably led to some vigorous discussion and finger pointing. Search Google for "debian openssl" for more discussions than I can link to.
The action - making a change without following a standardized process - is certainly not unique to this situation, and "the system was slow so I turned off this feature", or "I just fiddled around with it and it just started working" are phrases all too commonly heard in many aspects of IT. Some might argue that a commercial development process would likely have prevented this occurrence, but to simply turn this into a comparison of open source and commercial development ignores some very important aspects. There are important lessons to be learned that could benefit any software development process, particularly when process parts are being adapted to encompass ever changing development and security landscapes. In the ideal world, source code would be based on well-documented requirements, consistently structured, well commented, and maintained by easy-to-reach teams that understand the code inside and out. The reality of dealing with the pressure of delivery deadlines, distributed development teams, and code written either long ago or by a third party can make coding a daunting task ... and quality assurance next to impossible, especially if breakdowns in process or communication occur. The myriad of testing tools, sometimes producing output that can run in the hundreds of pages, coupled with a lack of understanding about their testing coverage, doesn't make the task any easier.
Looking at how this specific event unfolded can lead us down many paths of analysis, all of which will provide valuable information in attempting to determine a root cause. Unfortunately - and this is something that is also not unique to any specific kind of environment - not all parties involved are neutral, and there can also be a tendency to fixate on symptoms rather than the cause. One reason for this may be the assumption that it's possible to fix specific process parts without necessarily re-evaluating the process as a whole; another is that risks and the resulting need for assurance, including process assurance, may be underestimated. Looking at the failures in the flaw finding process purportedly followed in the Therac 25 accidents it's easy to see how this can result in unacceptable consequences. And while likely not resulting in loss of life, the potential economic loss associated with a failure of a cryptographic module suggests that a critical security component can't be treated like just any other piece of software.
How ever unfortunate, this event presents a good opportunity to take a moment and look at our own development processes. Particularly as we start to embrace service orientation, where we loosely couple different business functions while relying on centralized, and often externally developed, security and reliability services, we increase the possibility of creating situations such as this. Using a risk-based process, and testing and revisiting the process itself to ensure it stays current, will be vital in providing appropriate levels of software, system, and information assurance. Building a high-assurance component using a low-assurance process just isn't worth the risk.





