5 Keys to 7 x 24

My cell phone cuts out for awhile during my drive to work, I lost
the
first draft of this blog posting when I accidentally pressed the wrong
button, and I could probably use up most of my free time if I fixed all
of my friends’ and relatives’ blue screens, spyware
infestations, and other computer woes. I can’t stand it! But it amazes
me that most people seem to accept these inconveniences as part of life
in the high-tech world.

I don’t think that’s right — we need to expect more from our
high-tech products.  A few companies do deliver great service
(what was the last time you saw Google down?) but most seem to have an
attitude that a little downtime is okay. For example, my main financial
web site takes its systems offline for hours of maintenance on Sunday
nights — just the time I finally get around to taking care of my
financial health.

I don’t accept this for QuickBase. QuickBase should be "always on",
available with your data when you need it. For example, if your team
meeting presentations are stored in QuickBase, you can’t afford to have
it go offline in the middle of your meeting.

We think a lot about QuickBase reliability and performance. I hope
you don’t. I hope you expect it to "just work"; even in the middle of
the night when you’re entering your latest inspiration into your
QuickBase To Do List (you do use QuickBase to manage your task list,
don’t you?)

Over the last decade, I’ve developed Internet services using
thousands of servers and supporting hundreds of major web sites. Here are my 5 keys to creating successful 7×24 (7 days/week,
24 hours/day) services. We apply all of them here, day in and day out.

  1. No Tolerance for Downtime
    You can’t fix a problem
    until you accept that you have a problem. Your whole organization from
    senior management through product developers to data center operators must detest
    downtime. Must detest it more than their customers do. You need that kind
    of emotion to get you out of bed when the pager goes off at 3AM.
    A
    few years ago, when we needed to restart QuickBase for a software
    patch, it was offline for 10 minutes. Pretty good, but just not
    acceptable to us, so we reduced it to less than a minute. Still too
    long. As of our June release, most software upgrades can be done with
    no downtime at all.

  2. Test, Test, Test (aka Don’t Trust the Vendors)
    Nothing works the first time. Okay, that’s not always true but it’s a
    good approximation to the truth. During my time with QuickBase, I think
    we’ve endured failures to every single one of our "highly available"
    and redundant pieces of equipment. One dirty secret of "high
    availability" systems, is that the additional redundancy often just
    adds complexity and more ways that things can break.
    The only way to prevent this is to test every system with realistic
    loads and (temporarily) break each component to see how the system reacts. We do this
    on everything from the power grid (it’s quite impressive to hear the
    semi-truck-sized diesel generators start up when the utility power is
    shut off!), to servers, to network, to each software release. Before you see a new software release from us, we’ve already subjected it to full production-level loads.   

     

  3. Log Everything
    Stuff happens… However, you have to make sure that you know it happened and how it occurred. Good system status logging saves a lot of time and money: the faster you can identify a problem, the faster you can fix it.
    QuickBase has the best logging of any web service I’ve seen. We’ve made it easy for the developers to add diagnostic logging into QuickBase without slowing performance. Because of this, we usually know about problems and potential problems long before they affect customers.
     
  4. Efficient Crash/Problem Recovery
    When the rare problem does occur, we’ve worked to insure QuickBase minimizes its impact, and records the information we need to quickly find and fix the problem (including the exact source code line number for software problems). Although we prefer to have our operators and developers focused on building out ever faster servers and adding more new features they are a great team in a crisis and focus on safely restoring service ASAP. They save the "could’ves" and "should’ves" for later, at the "Lessons Learned" sessions we hold to review each significant outage.
        
  5. No Single Point of Failure
    While the above keys set the stage for a reliable service, a successful 7×24 architecture must avoid any single point of failure. One of my jobs is to regularly review our architecture to make sure that ANY component can fail without compromising QuickBase’s reliability.
    We don’t have one of anything in QuickBase and we generally have more than two. Everything is backed up from servers, to power cords, to network connections, to storage, to CPUs, to our diesel generators.

I hope you expect a lot from us! We expect a lot of ourselves.

– Jim

P.S., Our Operations manager will hate this post. Every time I brag about QuickBase’s reliability, disaster seems to befall us. So if QuickBase goes down in the next few days, it’s my
fault. Smile3_1

Sorry!

  • Bradley

    Thanks for the posting. I cannot get into quickbase now.

    [Reply]

  • Bradley

    Thanks for the posting. I cannot get into quickbase now.

    [Reply]

  • Jim Salem

    Hmmm… QuickBase was up and running fine at the time of your message (2:14PM yesterday in all time zones).

    Most of the time, when customers report that QuickBase is unresponsive it’s due to problems in their local network or with their ISP. However, if you believe it is a problem on our end, I encourage you to report it via our support link so we can track it down. Helpful info is:
    * What kind of response (if any) did you get from QuickBase?
    * Were you able to connect to other Internet sites (e.g., http://www.google.com)
    * Were you able to connect to other Internet sites via HTTPS (e.g., https://www.fidelity.com)
    * Were you able to connect to other Intuit sites (e.g., http://www.intuit.com)
    * What is your web location (i.e., your Internet firewall’s IP address)
    * If you can, include the results of doing a “traceroute” from your location to http://www.quickbase.com. This can tell us where the network problem occurred. On Windows you can do this by bringing up a “Command Prompt” and typing ‘tracert http://www.quickbase.com‘. BTW, it’s normal that this report will end with a series of stars because traceroutes are blocked by our firewall.

    [Reply]

  • Jim Salem

    Hmmm… QuickBase was up and running fine at the time of your message (2:14PM yesterday in all time zones).

    Most of the time, when customers report that QuickBase is unresponsive it’s due to problems in their local network or with their ISP. However, if you believe it is a problem on our end, I encourage you to report it via our support link so we can track it down. Helpful info is:
    * What kind of response (if any) did you get from QuickBase?
    * Were you able to connect to other Internet sites (e.g., http://www.google.com)
    * Were you able to connect to other Internet sites via HTTPS (e.g., https://www.fidelity.com)
    * Were you able to connect to other Intuit sites (e.g., http://www.intuit.com)
    * What is your web location (i.e., your Internet firewall’s IP address)
    * If you can, include the results of doing a “traceroute” from your location to http://www.quickbase.com. This can tell us where the network problem occurred. On Windows you can do this by bringing up a “Command Prompt” and typing ‘tracert http://www.quickbase.com‘. BTW, it’s normal that this report will end with a series of stars because traceroutes are blocked by our firewall.

    [Reply]

  • Ted Stephens

    I have been looking on the quickbase site for any SLA in terms of uptime. I am trying to convince a client to use QB to replace some excel spreadsheets but they want to know what they can expect in terms of availability. I looked in the terms of service but didnt see anything. Can you point me in the right direction or do I need to speak to sales?

    [Reply]

  • Ted Stephens

    I have been looking on the quickbase site for any SLA in terms of uptime. I am trying to convince a client to use QB to replace some excel spreadsheets but they want to know what they can expect in terms of availability. I looked in the terms of service but didnt see anything. Can you point me in the right direction or do I need to speak to sales?

    [Reply]

  • Jim Salem

    Sales would be your best bet to discuss Service Level Agreements.

    [Reply]

  • Jim Salem

    Sales would be your best bet to discuss Service Level Agreements.

    [Reply]

  • http://www.BostonLawCollaborative.com/ David Hoffman

    My law and dispute resolution office (Boston Law Collaborative, LLC) has about a dozen people using Quickbase from time to time during business hours for conflict-of-interest checking and case management, and a few of us use it at night. I have never had trouble accessing it – ever. I have come to take for granted that, unlike many of the other systems that I use, Quickbase is 100% available all the time. Reading your post gives me some appreciation for why that is. Thank you for all the effort you put into it.

    [Reply]

  • http://www.BostonLawCollaborative.com David Hoffman

    My law and dispute resolution office (Boston Law Collaborative, LLC) has about a dozen people using Quickbase from time to time during business hours for conflict-of-interest checking and case management, and a few of us use it at night. I have never had trouble accessing it – ever. I have come to take for granted that, unlike many of the other systems that I use, Quickbase is 100% available all the time. Reading your post gives me some appreciation for why that is. Thank you for all the effort you put into it.

    [Reply]