Telegram Downtime
Telegram, including all the sites hosted on Telegram, were down for about an hour from 17:45 GMT-8 to 18:45 GMT-8 November 12, 2012.
Here's a run-down of what happened.
How Telegram is structured
Currently, the process that converts a repository to a Telegram site runs as part of the main Telegram web application. This means that all the resources (memory and CPU) needed to render a site are shared with the process that serves the main https://telegr.am site (we are in the process of changing this).
The failure
Through the day on November 12, the Telegram web site (not other sites hosted by Telegram, just the main site at https://telegr.am), we intermittently unavailable.
The web app/rendering process was consuming all available CPU and all allocated JVM heap space. Doing some thread dumps to determine what was happening. There were a couple of long-running rendering processes that were consuming all system resources.
The first step was to increase heap space for the JVM process. This yielded no material improvement as the rendering processes were backed up and the pipeline was filling faster than it could be handled on the medium sized EC2 instance.
We need more power, captain
At 17:45 GMT-8 we took the entire machine that runs Telegram and all the Telegram sites down so that we could migrate to a larger EC2 container.
We created a new AMI (machine image) by snapshotting the existing EC2 instance and restarted the service in a much larger machine (8x the CPU and 4x the RAM).
The snapshotting process took about and hour.
The new instance came online with significantly more RAM and CPU available for the Telegram web app and rendering service. With the exception of one site that contains Markdown that's causing a problem with the Markdown parser, all other sites have moved through the rendering pipeline.
Future and Scalability
We have been running Telegram in alpha and allowing a limited number of users so that we could determine the system requirements for Telegram. We have learned about a limit.
It has been our plan to create separate virtual machines for rendering each Telegram site. We already create separate virtual machines each time we render a Dexy site. We've been working over the two weeks on moving the main Telegram rendering into separate virtual machines. This feature should be available by November 16th.
Sadly, we reached the capacity limit before we had the more scalable solution in place and thus suffered significant downtime in order to put a stop-gap solution in place.
Once we complete the migration to virtual machine-based Telegram rendering, we have all the scalability that Amazon has to offer as we can spawn new EC2 instances as we need more Telegram rendering capacity.