How we leveraged Intuit Platforms to build a self-healing capability at UIP!

Athitya Kumar
8 min readJul 18, 2022

Author: Athitya Kumar, Shivansh Maheshwari

We all know the importance of metrics, monitoring and alerting in the stability of any good software. But stepping into the shoes of someone who is “on-call”, it’s not too great of an experience to follow the alerts runbook like a robot. Have you felt the same way, while you were on call? We did too!

Live footage of on-call folks following the runbook

Do you know who’s great at following steps like a robot? Code. That’s exactly why we’ve implemented self-healing capability using the Intuit Platforms to reduce our repetitive on-call chores.

Context time

I’m currently working as a Software Engineer-2 with the Unified Ingestion Platform (UIP) at Intuit Inc. We’re building self-serve data lake ingestion capabilities in a platformed way, for the entire company to leverage!

That is, we support ingestion from many types of data sources (MySQL, Postgres, Oracle, DynamoDB, File, Kafka etc), and expose them to our data analysts (Kafka, Data Lake, Delta Lake etc).

For more info on UIP and the ingestion architecture, please refer to the prior articles published by the team: Self-Serve Data Ingestion Platform at Intuit.

The Monotonous Monstrosity

As UIP kept maturing as an ingestion Platform and more teams started onboarding their data lake via UIP, we started seeing an increase in the volume of our platform’s alerts.

There were a lot of intermittent alerts — which were highly frequent and fixed by a simple CLI command / API call etc. However, the on-call duty had become a chore of logging into the specific machine and running a contextual restart command.

We started recording these alert volumes and figured out that one of these “Hung state” alert clauses had started firing almost 3 to 4 alerts daily. Typically, this would happen when any network connection (say, Kafka brokers) with Kafka gets interrupted from the Kafka-Connect layer, or for some reason, the Java process has gone down.

“Within every adversity is an equal or greater benefit. Within every problem is an opportunity.” — Napoleon Hill

A couple of minutes for each alert may not seem huge. But keep adding it over weeks, sprints and quarters — suddenly, these innocent alerts start consuming a sizable chunk of developer bandwidth in perspective.

The 3-pronged Opportunity

When it comes to alerting, we’ve observed that this 3-prong approach has typically worked well for us:

  • Add extensive error codes, with contextual information
  • Build tooling via APIs & UI, so that the resolution doesn’t need to be contextual
  • Build a self-healing capability leveraging the APIs — so that the issue gets automatically resolved

Of course, the 3rd-prong is the ideal state for each alert. However, we need the 1st prong to be available in the first place, and the 2nd prong in case anything goes wrong in the self-healing logic.

Prong-1: Extensive error codes

Basically, for every failure scenario that’s possible in the code, throw out different error codes — to identify the place of failure by just looking at the error code.

Also, it’s usually preferable to categorise some of these error codes based on the type of failure. Typically, any enterprise software needs to integrate with multiple dependent/upstream services and any of these integration failures could result in errors.

  • E1001, E1002, E1003, … — could all refer to errors with service 1
  • E2001, E2002, E2003, … — could all refer to errors with service 2

And so on. Additionally, we also try to return additional context (relevant variables and their values) as a HashMap object — so that there’s no need to rummage through the logs for any immediate debugging.

And finally, this error code could also contain some boolean flags while being thrown by the code regarding whether it’s intermittent or not, whether it can be self-healed or not, etc.

Prong-2: Tooling via UI

The moment we have added error codes, we empower ourselves to get the data points soon enough. In a week or so, there can be reports generated based on how the alerts are split across the error codes, which error codes have high-frequency and high-confidence resolution steps, which error codes are low-hanging fruits, etc.

With these new data points, we can start adding tooling for the same. At UIP, we built a developer dashboard codenamed IMD (Ingestion Management Dashboard) for the same.

The basic advantage of building this prong is that we’d also be able to build powerful backend services for the tooling, which can be invoked by the UI for starters.

And when self-healable alerts with confident resolution steps have been identified, then the same tooling back-end services can be re-used and invoked there too!

Prong-3: Self-healing Capability

The moment we have well-documented error codes, their reliable resolution steps and a programmatic way to take these resolution steps— we’re ready to go ahead with self-healing!

Typically, these are some of the Enterprise tools used in the industry:

  • Splunk for logs
  • VMWare’s Wavefront for metrics
  • PagerDuty for triggering & handling alerts

The design of self-healing is such that, it can be extended to multiple such tools — as long as they have support for Webhooks!

High-Level Design of Self-Healing

To be able to debug / identity self-healing failures, we consciously decided that our metrics collection and self-healing capability would be 2 separate workflows — and not coupled together. Yup, Separation of Concerns!

Sequence diagram for self-healing capability

The implementation itself was easier than expected; as we were able to leverage Intuit Platforms for most of the components:

  • Kafka Topics: The Event Bus platform of Intuit, has made topic creation self-serve — which means creating and configuring our Kafka topics in multiple environments took around a grand 15 minutes in total!
    Also, by exposing a Kafka Producer API that supports the webhook payloads from the above-mentioned tools (PagerDuty, Wavefront, Splunk etc), the integration becomes even more seamless!
  • Kafka Consumer: The Stream Processing Platform (codenamed, SPP) of Intuit, already has smart functionality that comes up with boilerplate Kafka consumer code as per Intuit best practices. Hence, Intuit folks get a kick-start with both writing the code, as well as deploying the consumer code in multiple environments. End-to-end, it took a total of just 1-2 hours to get a sample consumer running in 3 different environments!

Low-Level Design

  • Define an interface that handles any known/unknown alert clauses, with 2 methods:
  • Implement the AlertClauseInterface for all known self-healable failure scenarios/alert clauses:
  • And finally, put it all together in the main method of the Kafka Consumer, which is the entry point to the Self-Healing logic:

And Voila! Once your alerts start flowing into your Kafka topics from the alerting tools (like PagerDuty, Splunk, Wavefront etc); they’ll be picked up by the Kafka consumer for taking Self-Healing actions.

Impact

Drumrolls! Alright, let’s quickly dive into the impact that self-healing and tooling have brought about for UIP.

Previously, the alerts that were getting manually resolved, looked like this — taking around 20 mins on average for resolution. And due to manual resolution, it also involved a risk of human mistakes (in case someone doesn’t understand the runbook, or genuinely missed out to fix one of the issues)

Before self-healing: Manual resolution of alerts takes 15–30 mins

After kick-starting self-healing, the same alerts now get resolved automatically within a minute by our Self-Healing Bot! All the while, having no scope of manual error.

After self-healing: Programmatic resolution of alerts takes < 1 min

Final Takeaways on Developer Productivity

At first sight, it might seem like — “hey, this is just saving me 15 minutes of my time. Is it even worth putting effort into developing a self-healing/tooling framework for my team?”

A moment of objective truth: Self-Healing has resolved 1200+ alerts automatically and potentially saved 400+ hours of engineering bandwidth so far!

With more alert clauses/failure scenarios being supported via such a self-healing framework, the benefits quickly become multifold. The best part is, that it’s a 1-time development that’ll have a compounding effect over time — that keeps saving developer bandwidth quarter over quarter!

Compounding effects of tooling and automation

There’s also an auxiliary benefit of self-healing and tooling. With the market being hot and a lot of folks switching for new opportunities, a lot of teams are going through a phase of providing context and knowledge transfer (KTs).

With more tooling and automation, the lesser context needs to be given — and the smoother the onboarding of new hires! 🎉

Psst psst, interested in applying for a career at Intuit? Have a look at our openings and apply for the position that interests you!

Authors’ Bio

Athitya Kumar is a Software Engineer-2 & Open-Source Community Lead at Intuit India. He has worked on the various ingestion & self-serve capabilities of the Unified Ingestion Platform, and also kick-started the self-healing capabilities at UIP. While not wearing the “work hat”, Athitya loves reading books, writing blogs, and binging TV series!

Shivansh Maheshwari is a Software Engineer-2 at Intuit and is currently working on the Self-Serve component of the Unified Ingestion Platform. He has worked on various self-serve capabilities and has always been passionate about learning new technologies. Besides work, Shivansh loves watching sports.

--

--