Yesterday evening (Friday, 2/26/16) we experienced a Cloud Block Storage volume failure impacting some sites on The LexBlog Platform and all sites on Premier Managed Platform. Cloud block storage allows for additional storage on the environment as publisher and visitor demands increase.
This initial issue triggered an atypically long event which we want to fully clarify for you.
At approximately 3:10 p.m. Pacific time Friday, we received our first alerts from our system monitoring that some blogs were offline. This triggered our incident response team to immediately investigate the cause of the issue and work toward resolution.
Within 25 minutes, our team was able to restore service.
What we did not immediately realize, however, was the initial failure of the Cloud Block Storage volume caused a subsequent issue - a takedown of our database master server.
As many members know, we have multiple server redundancies in place for when there are issues like this – one reason we are able to quickly restore service whenever there is an issue. In this particular instance, these redundancies became out of sync because the database master server was down.
With the redundant application servers out of sync, assorted issues resulted for publishers and readers as we reported in an update, including publishing errors and redirect loops.
Our team then worked to re-sync all application servers. All but one re-synced quickly, and all were completely re-synced by 10:50 p.m. Pacific.
As a result of this issue, we recognize the need to fully test publishing and run automated checks after an outage of any kind. This is now a standard part of our incident response process.
Additionally, we are working on communication with our cloud server provider to better identify warning signs and fully restore performance of our publishing software for members as quickly as possible in the future.
Without the redundancy of our servers, strategic technology partnerships, and refined incident response process, this issue would have likely resulted in several hours of downtime instead of the initial 25 minutes.
That said, we know this outage and subsequent intermittent issues impacted many members ability to publish timely content. We are very sorry for this impact.
Thank you for your continued membership and understanding as our team worked to resolve this issue.
Sincerely,