Administration, Amazon AWS, Azure, Editorials

When The Flu = Better DB Practices

Like too many of us, I’ve been fighting a nasty bout of the cold or flu or whatever it is that’s going around.  And while I’ve been enjoying (!) the exhaustion and inability to think straight actually got me thinking about the many things that keep systems running so you’re not on the hook 24/7.

One of the things that is coming up more and more with different folks that we’re working with is that fact that the systems may be documented, backed-up, recoverable and such at the database level, but the inter-dependencies haven’t received the same care and feeding.  This isn’t just a cloud issue, particularly with so many on-premises services, and mixed/hybrid on-premises-cloud environments.  With so many component parts in place to make the systems work, it’s critical to, frankly, be a pain in the back-side to the teams using these services so you can make sure the recoverability knowledge of these systems is fully accounted for.

Some of the things to address this that have helped in our own systems include:

  • Create a system diagram, including processes – a nice, big, poster-style image of your infrastructure.  Using Azure hosted functions?  Great.  Include them.  Have Lambda processes?  Excellent.  Include them.
  • Think about things that can go wrong.  Not deep, intricate things, but just normal, run of the mill things.  “Email isn’t going out…,” “the database is throwing errors on the user interface…,” “the pages aren’t working because of SSL issues…” – then do a series of steps and things that can be checked down through the architecture.  Shine a light on the infrastructure and give yourself (and of course others) a starting point if things do go awry.
  • Start a new intranet site (we use Google Sites) and document whenever something happens, what you do to fix it, what the symptoms were.  One of the most important things you can do is add a paragraph of terms that SOMEONE ELSE would be searching for to find that page.  Not everyone will put in “404” but they may enter things like “page not found” or “couldn’t find the page” or whatever.  (Simple examples)  This has been one of the most helpful things we have done – to put everyday descriptive terms in the page so it can be found quickly.   Corollary: If you have an issue that IS documented, but the person couldn’t find it, ask what they searched on.  Add that to the page so next time, it’ll work correctly.
  • Think about how people ingest this type of content under stress.  Things won’t be pretty if something is going wrong, their key go-to person (you) is out sick and they’re trying to pick up the pieces.  Keep it extremely specific in terms of steps, extremely concise and make them winners if at all possible for following the steps.
  • Have review cycles.  Keep track of when pages are created (again, Google Sites does a great job of this) and updated.  Review them with the people that may rely on them, both technical and non-technical, on a regular basis.  The more everyone knows, the less panic there will be.  The more they know, the higher the likelihood is that they can troubleshoot whether it be a high-level issue (check the services status dashboard of our provider, or check with IT to make sure there isn’t a high-level delivery issue going on) or deeper where things may have to be tweaked/investigated.

If you can add transparency that spans the different services, all of them, you’ll be well on your way to having a calmer user-base, and a better chance of being able to lay around grumbling about being ill, rather than grumbling about being ill AND fighting some issue on your systems.