Over the weekend, we had to quickly undo a mistake we made. We broke some usages of the "rdr" (redirect) option for QuickBase API calls. We fixed the problem as soon as we were able to reproduce it.
"What’s an rdr?", you ask. It’s a cool developer feature typically used when QuickBase provides backend database services for a customer’s application. It’s documented along with the other API features on the Developer Info Page. After going to that page, click on "API Reference Documentation".
I thought you might enjoy a "behind the scenes" peek at this example of our process for rolling out upgrades and fixing critical bugs. It’s fairly representative of how we fix any significant user-visible problems.
First, mea culpa. While multiple people and automated tests cross-check our software before it goes into production, the bug was in an area of the product I’m responsible for and it should never have happened. Fortunately, in these cases the QuickBase team focuses on fixing problems fast and learning the lessons that will prevent them from recurring.
Back in October and November, we upgraded to a new version of our web hosting platform. This version provides more reliability and scalability. Unfortunately, the vendor introduced a new limitation which prevents us from accurately logging performance. Although customers don’t directly see this data, performance logging is very important to our system health monitoring and capacity planning. After working with the vendor to resolve the issue, we decided to change the interface technology we use. In addition to fixing the performance log problem, the new technology shows a slight speed improvement and should be significantly more scalable.
The new technology was implemented in our code and tested and deployed in our pre-production environment at the beginning of January. The change was not expected to have any impact on customers, so the testing was limited primarily to our automated test suites and ad hoc testing done while examining other features. Unfortunately, the testing for the "rdr" feature was not included in our standard automated test suite which focuses on the most common user tasks.
We deployed the change on 1/3 of our web servers last Wednesday night. The release went smoothly (so we thought) and there were no reported problems. Given its success, we made the change to the rest of the web servers last Friday night. Even though we expected no downtime, we still like to install upgrades during off-peak hours (generally, nightime in the U.S., and more specifically late Friday night). This gives us a chance to hear of any bugs before the heavy usage starts on Monday.
SOMETHING’S NOT WORKING
We first received a report of an unexplained API problem at about 10:45pm on Saturday. I reviewed the issue, however, on the surface, it seemed unrelated to the upgrade. Unfortunately, there was not enough information in the report to reproduce the problem so I requested additional detail. We received a second problem report from a different customer at 5:30pm Sunday, this time with enough information to reproduce the issue.
To correct the problem, we rolled back to the earlier version of our software by 6:30pm — problem solved.
On Monday and Tuesday, we corrected the issue in our software and did some thorough testing of the "rdr" feature. Going forward we will test this feature as part of our automated test suite. Last night (Tuesday), we redeployed the new interface and so far everything looks good.
I want to highlight a few things about our process. First, we do a lot of testing before new software goes into production. We know customers rely on us 24×7 and work hard to insure upgrades do not change an end-user’s experience of a QuickBase application.
Second, we time our releases to have the least impact on customers. Even if we expect no customer impact, we still make software changes during off hours to leave time for problems to be discovered, reported, and fixed before our peak hours.
Third, we react quickly to problems. Our support team monitors QuickBase 24 hours a day and are not hestitant to call any team member at any hour to resolve serious issues. Most problems with broad impact are resolved within minutes or a few hours. More commonly, for bugs that have simple workarounds or only affect new product features we will typically wait for a minor patch release (every few weeks), or a major release (every few months).
Fourth, we build contingency plans into all our releases. We have a reliable "roll-back" strategy which lets us revert to back to an earler version if a catastrophic problem occurs.
HELP US TO HELP YOU
If you notice something suddenly changes with your application, let us know by opening a support case!
Only rarely can we fix something if we can’t reproduce it. So make sure to give us enough information. This includes data such as:
- The URL of the page that produced the incorrect data and information on how you got there.
- The date and time you noticed a change in the application’s behavior.
- The browser type and version you are using.
- Any recent changes you made to the application.
The easier you make it for us to reproduce the problem, the faster we can react. In some cases, we may ask you to grant us access to the application. Our support team has no ability to look at your application unless you specifically grant them access.
While most reported issues are not software bugs requiring immediate patches, they still often highlight areas of the product that are difficult or confusing and that we need to take a look improving in a future release.
Even though this problem affected few users, it is an example of a type of issue that we take very seriously. We know our customers expect us to be available at any time and we are committed to meeting or exceeding your expections. Therefore, we react extremely quickly to any issues that affect data integrity or security, or those that have a big impact on the end user.
Wishing you all clear sailing with your QuickBase applications.