My cell phone cuts out for awhile during my drive to work, I lost
first draft of this blog posting when I accidentally pressed the wrong
button, and I could probably use up most of my free time if I fixed all
of my friends’ and relatives’ blue screens, spyware
infestations, and other computer woes. I can’t stand it! But it amazes
me that most people seem to accept these inconveniences as part of life
in the high-tech world.
I don’t think that’s right — we need to expect more from our
high-tech products. A few companies do deliver great service
(what was the last time you saw Google down?) but most seem to have an
attitude that a little downtime is okay. For example, my main financial
web site takes its systems offline for hours of maintenance on Sunday
nights — just the time I finally get around to taking care of my
I don’t accept this for QuickBase. QuickBase should be "always on",
available with your data when you need it. For example, if your team
meeting presentations are stored in QuickBase, you can’t afford to have
it go offline in the middle of your meeting.
We think a lot about QuickBase reliability and performance. I hope
you don’t. I hope you expect it to "just work"; even in the middle of
the night when you’re entering your latest inspiration into your
QuickBase To Do List (you do use QuickBase to manage your task list,
Over the last decade, I’ve developed Internet services using
thousands of servers and supporting hundreds of major web sites. Here are my 5 keys to creating successful 7×24 (7 days/week,
24 hours/day) services. We apply all of them here, day in and day out.
- No Tolerance for Downtime
You can’t fix a problem
until you accept that you have a problem. Your whole organization from
senior management through product developers to data center operators must detest
downtime. Must detest it more than their customers do. You need that kind
of emotion to get you out of bed when the pager goes off at 3AM.
few years ago, when we needed to restart QuickBase for a software
patch, it was offline for 10 minutes. Pretty good, but just not
acceptable to us, so we reduced it to less than a minute. Still too
long. As of our June release, most software upgrades can be done with
no downtime at all.
- Test, Test, Test (aka Don’t Trust the Vendors)
Nothing works the first time. Okay, that’s not always true but it’s a
good approximation to the truth. During my time with QuickBase, I think
we’ve endured failures to every single one of our "highly available"
and redundant pieces of equipment. One dirty secret of "high
availability" systems, is that the additional redundancy often just
adds complexity and more ways that things can break.
The only way to prevent this is to test every system with realistic
loads and (temporarily) break each component to see how the system reacts. We do this
on everything from the power grid (it’s quite impressive to hear the
semi-truck-sized diesel generators start up when the utility power is
shut off!), to servers, to network, to each software release. Before you see a new software release from us, we’ve already subjected it to full production-level loads.
- Log Everything
Stuff happens… However, you have to make sure that you know it happened and how it occurred. Good system status logging saves a lot of time and money: the faster you can identify a problem, the faster you can fix it.
QuickBase has the best logging of any web service I’ve seen. We’ve made it easy for the developers to add diagnostic logging into QuickBase without slowing performance. Because of this, we usually know about problems and potential problems long before they affect customers.
- Efficient Crash/Problem Recovery
When the rare problem does occur, we’ve worked to insure QuickBase minimizes its impact, and records the information we need to quickly find and fix the problem (including the exact source code line number for software problems). Although we prefer to have our operators and developers focused on building out ever faster servers and adding more new features they are a great team in a crisis and focus on safely restoring service ASAP. They save the "could’ves" and "should’ves" for later, at the "Lessons Learned" sessions we hold to review each significant outage.
- No Single Point of Failure
While the above keys set the stage for a reliable service, a successful 7×24 architecture must avoid any single point of failure. One of my jobs is to regularly review our architecture to make sure that ANY component can fail without compromising QuickBase’s reliability.
We don’t have one of anything in QuickBase and we generally have more than two. Everything is backed up from servers, to power cords, to network connections, to storage, to CPUs, to our diesel generators.
I hope you expect a lot from us! We expect a lot of ourselves.
P.S., Our Operations manager will hate this post. Every time I brag about QuickBase’s reliability, disaster seems to befall us. So if QuickBase goes down in the next few days, it’s my