How We Improved MySQL Performance at Intuit’s Batch Processing Platform (BPP)!

6 min readApr 5, 2023

Any back-end service generally uses one or more data stores, for persisting & retrieving data — to be used for an amazing hip-and-cool UI, business functionality and metrics analytics with visualisations. Depending on the use case, developers typically have to choose between Relational (like MySQL, Postgres, MSSQL, etc) and Non-Relational Data Stores (like DynamoDB, MongoDB, etc)!

But irrespective of the chosen Data Store, the performance of persisting & retrieving data is very important for Technical & Operational Excellence, as well as Customer Experience! In this blog post, we’ll be sharing our experience of debugging & optimising our Relational MySQL DB performance.

Context time

I’m currently working as a Senior Software Engineer with the Batch Processing Platform (BPP) at Intuit. We’re building self-serve scheduling & orchestration capabilities in a platformed way; providing configurable runtimes (Data Bricks, Kubernetes, Amazon EMRs, etc) for the entire company to onboard & leverage for running Big-Data jobs, as well as trivial Cron jobs!

As a part of the architecture that powers BPP behind the scenes; we have a couple of different micro-services:

Services that interact with Open-Source projects (like Argo Events, and Argo Workflows)
Services for runtime-specific integrations (like Data Bricks Notebooks, and Amazon EMR SDKs)
Services with general CRUD functionality to persist/retrieve metadata regarding the code to execute, spark configurations, status/history of all previous executions etc.
Services to send out failure / alert notifications etc

Schrödinger’s Intermittent Failures

All the above services run in harmony — most of the time. However, there were a couple of times when the different micro-services would start throwing intermittent timeout failures related to running some SELECT / UPDATE queries on MySQL and taking more than 30 seconds for the APIs.

That’s not bad, we can fix it! But the issue is — these timeout failures were not deterministically reproducible, as similar queries succeeded for other instances an hour earlier as well as an hour later. :-/

Real footage of Developers encountering & debugging intermittent failures

Looking at the non-reproducible / intermittent nature of the failures, I had a hunch that it was more probable to be a resource issue rather than a code/business logic issue. By looking at the Amazon RDS metrics on the UI, we were able to correlate that the intermittent failures happened around the same time windows when the CPU utilisation spiked to more than 90% (which is a pretty high CPU utilisation for a Data Store!)

CPU utilization consistently spikes > 90%, for the Data Store — and looks eerily similar to the Stonks meme!

Another interesting observation: Even though the time windows of failure were intermittent, the queries that were failing were consistently recurring and not intermittent. This pointed in a related direction of debugging, that maybe some queries were slow & unoptimised — which would then timeout & fail when the CPU utilisation is also high (ie, a lot of unoptimised queries were being submitted concurrently by the different micro-service calls)

Screenshot from AWS RDS Query Insights, for queries with the highest latency

Windows of Opportunities

Now that we had a hunch that we had some queries to optimise, we wanted to dig deep & find out objectively what queries/parts of code had to be optimised further.

This is where typically tools like AppDynamics, Data Dog, Sentry, New Relic etc come into the picture — they maintain exact metrics on which part of code took how much time, how many errors were faced, when the errors came up, what the frequent API endpoints invoked are etc. We used one of these tools internally at Intuit for all our micro-services & we were able to pinpoint around 7 unoptimised queries in total.

For these 7 queries, it was easy to reproduce the relative slowness even without the high CPU utilisation — as these queries always took around 2+ seconds, compared to the expected couple of milliseconds!

Real footage of UI & API waiting for slow DB queries to be completed!

Additionally, we also had to check for a couple of different scenarios in our code — they were not causing any performance issues in our case, but they’d be good checks to do in general if you’re facing a similar issue:

Lookout for any custom queries you’re invoking through your back-end (ie, outside the ORM model of your web framework — @Query annotations in case of Java Springboot — especially sub-queries or joins)
Lookout for the N+1 queries issue, if there are one-to-many or many-to-many mappings used

A Developer’s Closure

With the debugging done, we’re almost 80% done with wrapping up the performance issue. The easy fix for improving the performance of unoptimised queries — is to either change the queries themselves to use indices, or to add the relevant indices if they’re missing.

For example, let’s say we have an emr_steps table with queries being made on step name and step status like:

SELECT * FROM emr_steps WHERE step_name = "Setup" and step_status = "RUNNING";

To optimise such a query — we can add the MySQL index like this:

ALTER TABLE emr_steps ADD INDEX name_with_status (step_name, step_status);

PS: While adding composite indices, please keep in mind that the sequence of indices is very important — the first index should compulsorily be present in the query, for the index to be used by the query.

After adding an index for optimising all 7 unoptimised queries, we saw a steep reduction of ~70% drop in CPU utilisation! We also haven’t seen any related intermittent errors in the last 2 weeks — which is especially awesome as we just crossed one of the crucial Tax Peaks in Jan 2023, for Intuit with better Operational Excellence! 🎉👏

CPU utilization dropped significantly, after making the above changes!

Huge shoutout to Narayanan & Ramez for pointing me towards the different ways of debugging & fixing this issue in parallel!

Takeaways!

Fix the root cause rather than fixing the issue: High CPU utilisation was a mere symptom of unoptimised use of memory resources, which is why query optimisation was a better solution rather than just blindly increasing the CPU memory resources / upgrading the RDS instance type!
Regular Metrics & Monitoring: Regularly keep measuring & improving your code metrics, to proactively figure out scenarios to optimise.

Continuous monitoring & improvement of metrics are what we need!

Incremental Tiny Gains: While creating any new tables, or columns or changing back-end API logic; always ensure that the queries performed are indexed AND the index will be used by the query.
Recursive Blog Takeaway: When you do cool stuff like this and learn in the process, don’t forget to blog about your learnings — so that it can help folks across the board.