Monday, March 25, 2013

Broad-Spectrum Dogfooding, or Why I Miss Jurassic.

I think most of you dozen readers know what I mean, when I refer to dogfooding. Some people think of Microsoft when they hear the term, but I first heard it from the same person via his being a Sun customer, AND via my old roommate, who worked for him.

I saw this Tweet last week:

I then checked out the blog post. It dealt with how an iSCSI LAN can be a failure point, partially due to the weakness of the ones-complement TCP/IP checksum

Reading this reminded me of an old bug we found in Sun with either NFS or an ethernet device driver, and the only way we caught it was by using IPsec (AH particularly) and seeing packets fail the authentication check. The corrupt NFS packets had 16-bits worth of 1 (0xffff), where it should have had 16-bits worth of 0 (0x0000). Using the standard TCP/IP checksum, there's no difference between those two values, no matter where they fall in the packet. Using IPsec, however, even with HMAC-MD5, showed the packet failure clearly when the packet authentication check failed. This bug wouldn't have been discovered were it not for the Solaris Team's big honking server, jurassic, and how its multiple concurrent uses interacted with each other.

Even before there was OpenSolaris, people knew about jurassic. Solaris people's (not any old Sun people... Solaris people) posts on IETF mailing lists often showed user@jurassic. Jurassic served as the NFS source of home directories, and until the early 2000s e-mail inboxes as well. Every two weeks the in-development Solaris build would be placed upon jurassic. As a Solaris developer, if your changes broke jurassic, you fixed those changes immediately, or risked getting your changes yanked out. Not breaking jurassic was a great motivator for code quality. Also, if you had a new feature, you wanted it used on jurassic, even if not by everyone.

Once the basic IPsec protocols - AH & ESP - went into Solaris 8, I convinced the jurassic maintainers to protect all traffic between jurassic and a couple of workstations. One was mine, naturally. I encrypted all of my traffic to jurassic. Since we only had 100Mbit in our building at that time, the performance hit wasn't too bad, relatively speaking. Another belonged to an NFS developer, who I'd somehow convinced to run AH, because I was already running ESP (and AH used less cycles for protection). It was this NFS developer, surprised he wasn't getting data corruption while other were, who helped suss out the bug in question.

At this point, I'd like to have a moment of silence for all of the made-public Solaris information that Oracle has since put back in its box. I could've had a bug id here, folks, A REAL BUG ID!!!

So for a few of us, jurassic also served as an IPsec testbed. It also was helpful in determining that nobody else's cleartext performance dropped while a few of us were running with network traffic (put more succinctly, connection policy latching worked). Other services would run on jurassic as well: DNS, IMAP, and others I'm sure I'm forgetting. Jurassic core dumps eventually would be used to test out the then-new mdb (oh, those early ::findleaks results...), and I'm sure more than a few DTrace scripts helped diagnose some jurassic-discovered bugs.

At Nexenta, we make a dedicated storage appliance. Naturally, we use them inside where appropriate. We Nexentians (especially the ones in Lowell) use Illumos from other distributions for even greater effect. My Illumos Home Data Center talk touches upon these at about 10:43 in. We use Illumos to host VMs (Thank you Joyent), we use it for site-to-site VPNs, we will be using it for public services at some point, and everything I mentioned all runs on Illumos. It's not quite the magnifying glass Jurassic was, but we do what we can.

I believe Oracle still has jurassic around, I know it did prior to my 2011 departure. I suspect it's helping Oracle Solaris even today. I suspect, however, that a less dense, but more widely instantiated broad-spectrum dogfooding continues on in Illumos today.

1 comment:

  1. Yep, still going strong:
    SunOS jurassic 5.12 s12_17 i86pc i386 i86pc