Wondering how CI tests have been passing all this while

Put a 3 day old fire off on our CI env today and had to blog about it.

TLDR: Use SA_PASSWORD instead of MSSQL_SA_PASSWORD on the MSSQL docker image.

We use Azure Pipelines to run tests among other things. But the test setup for one particular repo is special because they run in self-hosted agents that reside in an Azure Kubernetes cluster. It's got external dependencies setup including SQL Server.

Everything's been working fine until recently, when we started seeing this:

One or more errors occurred. (A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server))

Something we weren't expecting. I spent at least 8 hours over 3 days (I had other things on my plate unfortunately) and finally figured it had to do with the MS SQL password environment variable when I checked its DockerHub page: it had to be SA_PASSWORD, not MSSQL_SA_PASSWORD.

I still fail to understand why it worked until Friday–but hey at least its working now.

CI began to fail on Friday on a PR after I merged master into it. I could put no time into it on Monday, and realized it was happening on other PRs too on Tuesday. Took me until Wednesday (today) evening to get the CI fixed. (I was lucky nothing was had to be merged into master urgently during this time.)

Thankfully, I moved the setup from terraform provisioned VMs to AKS about two months back–made it easier to spot and fix the issue.

But this really got me thinking about how I've been prioritizing tasks at work, and if I've really gotten it right.