Cloud services and resilience: morbid

morbid_curious

Cloud services and resilience

Jul 02, 2023 23:12

After a Facebook discussion this RNZ article on Microsoft wanting to push AI systems in New Zealand schools, I got into a conversation with a local hydrologist (also Leon's mum) about cloud services and resilience, especially in light of the devastation wrought by Cyclone Gabrielle recently.

Marianne writes:
I'm genuinely curious as to what the fallback options are when your primary cloud services go down in a situation like Gabrielle and all your data and software are in the cloud and reliant on the internet working.

Monitoring sites are usually independently powered and have battery banks to keep them going. Critical sites are doubled-up. We used to run the comms networks with our own repeaters with backup generators, backup radio comms in case the copper or cell towers went down, backup generators at the EM centre, and a local PC with the necessary data and software installed so if we had to, we could run the entire system for several days from pretty much anywhere.

I keep hearing about cloud services dropping the ball in these situations. So as I said I am really curious and interested to know how cloud service providers can ensure this sort of resilience.

The very short answer: they can't. Not by themselves.

Cloud providers - especially the international ones - rely on economies of scale around processing, power usage and plentiful bandwidth in order to remain profitable. Which typically means centralised computing in giant datacentres placed at strategic geographical locations, and lots of high-bandwidth fibre optic cable to connect everything up.

Resilient solutions require investment at the edges of your network. Local storage, local computation, local services; local hardware, local power. Yes, that includes the radio and/or satellite links that some have deemed "too expensive" to maintain, kicking the risk down the line for everyone else to deal with.

They also require you to design for that: systems that will keep working when they can't dial home to a central server, that can communicate point-to-point across a mesh network when the trunk line isn't there, that work using interoperable open standards, that will prioritise the most important information getting through in a crisis when demand is high and bandwidth is neither plentiful nor reliable.

You also need to test that all of this actually works. Before you need it. Simulate a fault, ensure the disaster recovery kicks in. Make sure your backups really do work, and that you can run from them suddenly if you have to. Simulate partial faults, where some but not all of your connectivity works, to ensure that critical functions are prioritised correctly. Route everything through a satellite or radio link, and ensure that it still works as smoothly as possible when everything is slow. Ensure you're still getting and giving good data in low power mode, when you need to conserve available energy.

A cloud or software-as-a-service provider might be able to provide some expertise to support these processes, but I suspect the main challenges will still be in dragging governance folks along for the journey - getting them to agree that a resilient solution is worth putting money behind, rather than getting distracted by the shiny-yet-flimsy features that are so often a hallmark of how Silicon Valley "innovates".