Observability Insights From the Hospital#

A few weeks ago my youngest was having a rough night and didn’t sleep much. His breathing was also deeper than usual and in the morning his pediatrician told us to take him to the ER. As soon as we entered the doctors and nurses descended on him running tests, providing oxygen and IVs. When the dust settled they told us his oxygen saturation was low and they’d keep us until he was keeping his numbers up on his own.

So we sat and stayed with him, made sure he was comfortable and loved, and watched the monitor showing his levels.

Sitting there watching his saturation numbers and having nothing else on my mind it struck me how similar it is to the observability projects I have been focusing on.

Think of Your Audience (Users/Clients)#

Watching the numbers was discouraging as he wasn’t as close to 100 as I thought he needed to be.
Later in the visit, they explained that while 100 is perfect, it wasn’t the goal. Anything around 95 was considered great.

Takeaway: Provide context to dashboards and alerts so it’s clear to others what they’re looking at and what they’re looking for.


Always Have Alerts#

Frequently, the number would drop. After a moment, the monitor would start beeping in his room and notify the nurses’ station, where someone would come to check on him.

Takeaway: Make sure alerts notify you of issues and just as importantly, have monitoring to ensure those alerts are reviewed and addressed.


Having Working Data Is Critical#

While the saying goes “no news is good news,” that doesn’t hold true when you’re using data to keep people or systems healthy. You need to have the data.

If you don’t - especially if it stops working, even temporarily - it’s critical that it be alerted on and fixed ASAP. Otherwise, you risk missing alerts on issues before or even during them happening.

Takeaway: Create alerts for missing data, at nearly the same severity as an incident itself.


Data Quality and Baselines#

After a couple of times when the number disappeared and was replaced with a concerning ?, I learned that it meant no data. Saturation monitors are finicky which doesn’t go well with an active toddler.

The nurses explained that the numbers were only usable when there was a clear, definitive pattern in the rise and fall of the graph.

Takeaway: Having a baseline for what your data should look like, and understanding your data, helps ensure you’re making informed decisions.


Maintain Your Alerts#

For the first little while I was with him, his monitor would alert us constantly about his low numbers. But that was exactly why he was there, it wasn’t providing new information.

After some time, I was able to have the nurse review it and adjust the thresholds to only alert if it was low for him.

Takeaway: Alerts shouldn’t be set in stone. During long-running incidents, adjust thresholds to avoid alert fatigue while you’re actively addressing the issue.


Other thoughts: On call has a very different meaning in a hospital.

Nothing like testing in production, taking him off external oxygen to let him breathe just room air was terrifying. Even though there was a rollback plan and staff to handle anything.


This experience reminded me that observability isn’t just about data, it’s about trust.
Trust in the numbers, trust in the alerts, and trust that when something goes wrong, someone will be notified and take action.

Whether you’re monitoring a child’s oxygen levels or a fleet of servers, the goal is the same: understand what’s normal, know when to intervene, and make sure you can act before it’s too late.