What happened?! (outage 10/8 to 10/13)

f0urtyfive » Sun Oct 13, 2019 10:17 pm

Hello all,

As you may have noticed, RadioCapture suffered from a fairly long outage where no audio was available from any receive site. This was because both SSDs that were supporting the database failed, most likely due to extreme write fatigue. RC's website runs on a single server hosted in downtown Denver that is mostly cobbled together from spare parts. Once the read and write load on the database became too much for it to perform well on spinning disks I bought 2 1 TB SSDs for it to live on. Unfortunately I couldn't afford real server grade SSDs, so I used some consumer grade SSDs which are cheaper (but with the downside of having lower terabytes written capacity, as SSDs have finite life).

Long story short one of those first two SSDs failed a while back, and I recovered the existing database VM off the remaining SSD and replaced both. While investigating that failure, I found that my single server virtualization and storage setup somehow had caused a very significant write amplification problem on these SSDs, and I made a mental note to fix it.

It seems that I forgot that sticky note somewhere in my brain until about 10 seconds after I went to login to the database VM after someone said the site wasn't working right, and I immediately remembered the urgency of the write amplification problem I had forgotten. I tried a few times to access the existing database VM in read only modes, and had some limited success trying to copy it off the SSD but when I pulled both SSDs from the datacenter to do a full recovery neither SSD would initialize at all, so the database was completely gone.

Radiocapture's current budget is only just able to cover the existing server and object storage, no significant backup equipment, infrastructure or storage were in the budget, but I did have an old backup from November 2018 (just about 11 months ago). I've loaded that backup into a new temporary database server in the cloud, and wiped out all the historical audio metadata. During the outage about 2 million calls were recorded, and these are currently ingesting into the now clean database. Going forward I will configure the system to automatically archive and purge audio data after a configurable time period.

On the good news side of things, this complete failure in process/outage/total shitshow/emergency cloud migration does provide an opportunity: with a bit of automation and software development I should be able to accept more receive site data, or possibly even accept user submitted data from other trunk radio system capturing software.

tl;dr bullet points:
  • database vanished into a puff of smoke
  • your views might be gone or rolled back
  • some systems & talkgroups & radios have been rolled back to before they existed, so they have no metadata, it will take admins a few days to re-configure all the talkgroups and permissions on these systems & talkgroups.
  • historical audio is no longer publicly available beyond a few weeks, until we can find more revenue to pay for the storage needed.

I want to thank our Patreon supporters for literally making RadioCapture's continued existence possible, please consider contributing to RadioCapture's Patreon if you'd like to see our coverage area or feature set grow.

Re: What happened?! (outage 10/8 to 10/13)

PC1309 » Mon Oct 14, 2019 8:00 pm

Thank You for all your hard work getting a great site up and running again!!!

