Anatomy of Storage on the GRID
Released in October 2006, our GRID platform was a first-of-its-kind, market-improving hosting service, created to address the limits set forth by traditional shared-hosting technology.
The GRID was successful at allowing instantaneous increases in resources when needed, but in doing so it pushed the boundaries of what was possible for a $20 monthly service plan. The rapid adoption of this platform led to numerous growing pains in every aspect of the system. In 2007-2008 we fixed and stabilized hundreds of issues and produced stable clusters for the first time.
In 2009, new customers are being provisioned on 2nd-generation systems that are reliable and stable. We regret to admit however that we continue to have 1st-generation architecture in production for a large number of our older customers.
Last weekend’s storage-related incident exposed this condition very widely so we have taken some time to explain ourselves more thoroughly with this article.
A road to better transparency
Customers may remember that early generations of the GRID required significant changes to the way MySQL operated. The initial uptime on this service segment was not very good. Ultimately we solved the database problems and our discoveries led to the development of some highly unique auto-scaling technology. The series of events led us to become more transparent and write the article “Anatomy of MySQL,” which helped our customers understand our systems much better. We also made a full commitment to our Incident Status System, which has now tracked over 200 public-facing incidents.
We have been successful at improving transparency, but our customers are asking for more information. We intend to provide it. While our incidents have delivered a better level of accountability, they have fallen short in satisfying the deeper concerns our customers have regarding the ongoing storage problems.
Our oldest customers (the ones who tend to be early adopters and our most loyal) have been the group most seriously affected by the storage issues of our 1st-generation architecture. This doesn’t make us happy. Our original transition agenda has not worked out as planned and there have been many factors delaying us from migrating these customers to technology that is reliable.
We’d like to help you understand what’s going on now.
1st Generation vs. 2nd Generation
(mt) caters to intense customers so our storage systems need as much performance as possible. For the original GRID storage architecture we selected BlueArc’s Titan hardware, which continues to power our 1st-generation Clusters 1 and 2. Beginning with Clusters 3, 4, and 5 (mt) chose Sun Thumper and Thor equipment.
1st-Generation Architecture
(where last weekend’s incident occurred)
At the time, BlueArc Titan was the fastest storage technology available. Our research indicated that the system was extremely redundant internally — every cable, controller, disk, and front-end head was cloned. However, even with all of the failover protection we still had numerous issues with instability and crashes in firmware. Because every component is redundant, we assumed the system was protected from failures — however, there are 3 major reasons why, in our opinion, downtime still occurs.
- Unreliable Failover
In the case of a crash, our experience was that failover took an exceptionally long time (5-10 minutes.) Some of the crashes, such as the one last weekend, exhibited extra issues. Our assessment is that the bug that caused the first HEAD to crash (in this case, a corrupted filesystem) would cause the second HEAD to crash as well, essentially bringing the redundant system fully offline. This is not cheap equipment — we expected it to work. - Lack of OS Independence
Originally we created a massive storage pool to serve both the cluster node operating system as well as User Data. Our design trusted this safe coupling because of the internal redundancies inherent in the BlueArc Titan. This also served an efficiency goal by reducing each cluster’s power footprint. In the end, when there were storage problems, every public-facing server had a high crash probability. Engineers would have to address both storage issues along with cluster node recovery. This design was a mistake on (mt)’s part and the practice has been replaced with a much better method. - Complicated Upgrades & Maintenance
The firmware version in our BlueArc Titan makes upgrades take an extremely long time and require full-cluster downtime. This has led to maintenance windows far longer than we (or you) want for your services.
BlueArc Titan is an extremely robust system and it is fantastic at many things. The company’s engineering and support infrastructure is top-notch. However, we have had too many core issues and have consequently been forced to rethink our storage architecture completely. BlueArc is a tremendous company with a top-tier product, but, in our opinion, it is not the proper solution for our needs.
2nd Generation Architecture
Our new generation Clusters 3, 4 and 5 use a combination of a new storage design, along with more extensible storage technology powered by Sun Microsystems. Still fully hardware redundant on all levels, the combination of new design and the more flexible Sun equipment allows our new architecture to be more reliable. This architecture is currently in the process of being rolled out transparently to Cluster 1 and Cluster 2.
- Decoupled OS & Storage Segment Isolation
If one part of the storage network has a problem, such as a runaway user process causing high disk I/O, it is isolated from being able to affect customers on other segments. Also, the root OS remains totally isolated so there is no degradation to cluster node performance. This, combined with a smaller number of customers per storage segment, leads to a far more reliable system. The possibility of a problem with one segment (such as we had this weekend) has much less of a chance to cause global problems. - Better Caching
People look at your site a lot more often than you change it, so we can actually cache quite a bit of your content for you in in the Storage Segments. Spinning mechanical disks are slow. We have increased our levels of cache more than 20x across the storage network. Our customers have already seen notable performance and stability gains because of this. - Granular Diagnostics
Using DTrace, a very powerful diagnostic tool in Solaris 10, we are able to conduct highly sophisticated real-time monitoring to catch incorrectly coded scripts or other unintentional issues that put excessive load on a given storage segment. This level of insight is not available in closed platforms, where real-time diagnostic tools tend to be limited to the vendor’s engineers. - Quicker Backup Recovery
In the event of a serious filesystem failure, under the old architecture recovery from backup was possible but took a significant amount of time (even with fast disks and 10 gigabit networking, copying 15 terabytes of data from one disk system to another takes hours). In the new architecture, backup servers have the same performance capabilities as their data source and they are larger in size. Even in the unlikely case that we need to revert to a backup, engineers can perform the task in minutes.
Moving OLD customers to NEW technology
With all that being said, why are some of our customers still on the original architecture? It seems like they should have access to improved systems first right?
We needed to prove that our new designs were significantly better than the original designs. Even after receiving great results from our labs simulation, we elected to honor the lessons of the past. We have learned time and time again that real-world results always teach us things that are impossible to find in simulation. To this end we launched Cluster 3 and began rigorous observation. Second, the original Sun hardware platform also displayed some hardware-related glitches once it reached production. This delayed implementation until we were sure that its successor, to which we have upgraded, had eliminated these issues.
So how are we proceeding with getting the remainder of Cluster 1 and Cluster 2 to this new, proven design?
Two major ways.
Upgrading Cluster 1 and Cluster 2
First, we are well underway with the in-place upgrade. The most time consuming part of this process is migrating the vast amounts of data from one system to another, while keeping the transfer rates and load gentle enough not to cause any performance issues to everyday operation. At time of writing, 37% of Cluster 1 customers, and 44% of Cluster 2 customers have already been migrated.
About a month ago, we dramatically accelerated this process and have purchased 100% of the hardware needed to complete the project. We anticipate that the entire process should be completed by 06/2009. Most customers will be on the new architecture much sooner than that.
Next 30 days, Cluster-to-Cluster migration tools.
We have committed significant developer and administrator manpower to the development of Cluster-to-Cluster migration tools. Currently, it is possible to migrate yourself to a new cluster, utilizing the technique described in our Knowledge Base . This method is complex and not highly recommended. The first version of the migrator tool will eliminate a lot of the manual steps.
We have good technology today. But, there is more to come.
Our 2009 storage road map is exciting. As our new architecture continues to prove itself, we are not stopping development of new technologies.
- Storage segment fencing In our 2nd-generation system, storage segments are more individually isolated and overall less likely to cause system-wide disruptions. Additionally we are in the late development phase of special “fencing ” software which adds an additional layer of protection when storage malfunctions. This software keeps the cluster healthy and functional even during extreme cases of disk turbulence.
- Storage-Eye View Using the powerful insight given to us by DTrace, we are developing automators that actually solve storage issues without human intervention. These self-healing tools are also being leveraged to provide customers with new reports and details concerning the behavior of their applications. Awesome.
- SSD Sun is pioneering the integration of SSD (Solid State Disk ) technology in a very interesting way with their Hybrid Storage Pool products. We are currently experimenting with this technology in our labs. The results are looking fantastic.
A final note about redundancy
We would like to communicate the exact current high-availability (HA) status of clusters with in our GRID:
Currently HA:
- Every drive. We have 100% RAID through the system.
- Every server.
- For storage segments and all other critical servers we have full internal redundancy (power supplies, fans, etc.)
- Load balancer, networking, and hardware.
Currently Not HA:
- Intra-cluster networking equipment, including cables. We have hot spares that can be activated within 5-20 minutes, but it’s not HA. We are considering changing this in our (cs) product but we are still debating the uptime advantages.
- Storage segments. We can fail over to the backup if needed, and we can typically recover from any other non-catastrophic issue within 3-5 minutes.
- Individual MySQL servers and Containers.
Summary
We understand downtime may represent a once in a lifetime, non-retrievable instance, so we are committed to producing more stable, flexible and powerful hosting.
(mt) Media Temple has committed to communicating with our customers more effectively as well. Given our recent stumble we clearly need to improve our communication systems. Soon we will further integrate our information flows with rapid-broadcast systems like Twitter and VoIP. We’re going to keep looking for ways to get you information quickly.
(mt) Media Temple aims to be a proactive and agile company working to address the varied needs of our clients. This is a serious promise.
Thank you for your patience and continued business. We welcome your comments and feedback.