Recent Posts

Inexpensive solid state drives and RAID open up exciting opportunities in performance - especially for databases.
more »

I wanted to say a few words about some system performance issues we had last Friday.  Our clients all know about this already, but in the spirit of transparency that we encourage in online communities, I thought I'd share the events from late Thursday night.   

Thursday night was a big step for us.  The data center upgrades we had planned were supposed to give us 7 times the horsepower of our existing data center hardware and configuration.  This is important to meet the needs of our fast-growing customer base. Our load testing confirmed that we would get the performance and capacity upgrade we were looking for.  Not only were we increasing hardware, but our database was being upgraded to the latest version with significant performance and scalability improvements.  

As some background, Awareness had recently upgraded our data center as part of our new July 23rd release.  As part of this release, memcached servers were put in place to take load from our database servers.  This is how companies like Facebook and others meet the demand of extremely dynamic content to millions of customers.   That upgrade went off without a hitch.    

Thursday night's upgrade required that we "redirect" our client URL's to a whole new set of servers.  This also allowed us to maintain the original servers "as-is" for a roll-back.  The switch-over went very quickly but early on we saw immediate performance issues arise.  At the same time we saw the desired performance improvements in other aspects of the system.  The performance characteristics that we experienced were so different than what we saw during our load testing that we wanted to work with our database vendor's support engineers to trouble shoot before committing to rolling back.  All indications were that this was some sort of configuration error but they still have not been able to pinpoint exactly what their issue is. 

We notified our customers ahead of time that there would be a maintenance window at this time, and during the event we sent out a number of communications to our customers.  Here is an excerpt from one of our client communications:  

Throughout Friday, we progressed as much as possible with every configuration, hotfix, and tuning change possible without taking the entire platform offline.  We have worked with our database vendor's support engineers on a series of off-hour tasks which are being performed starting midnight Friday night / early Saturday morning.  

This will take a couple of hours.  At the end of this set of changes we have our benchmark tests to validate and or isolate bottlenecks.  We will review this with our database vendor once we are complete.  As soon as we have completed our work we are able to re-establish support with our database vendor's support engineers on this case as a Severity 1 incident and will continue to have their support around the clock.

In the end, after watching the situation closely for 24 hours we took the decision to roll-back.   The impact varied by client.  In general, users creating "new content" were more impacted than those who were just reading it.  We are cleaning any outstanding issues up now.

Of course this kind of scenario is regrettable and something we try to avoid at all costs.  We will plan a future date to complete the upgrade, but will work with our database vendor to establish what the problem was, before attempting to bring that set of servers online again.  


more »

Doug Caldwell

Doug Caldwell has over 2 decades in the IT industry. Prior to joining Awareness Inc, Doug held the position of CIO and Vice President of Information T...

more »

Recent Posts

Archives

Calendar

« July 2008 »
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

RSS Feeds