Comments | grumbler: Обрывки мемуаров

grumbler

Обрывки мемуаров

May 01, 2023 16:08

Из разных дискуссий в ЖЖ

( Read more... )

Back to all threads

grumbler October 31 2023, 12:18:52 UTC

II. It Fucking Sucks

It's an insane dumpster fire spiderweb of technical debt and it's only like one week old. Here are some fun details.

I get a friend of mine hired (big fan of nepotism), and he finds, on day one, a file in the project's repository that deletes prod using our CI/CD pipelines if it is ever moved into the wrong folder. It comes complete with the key and password required for an admin account. It was produced by the former lead engineer, who has moved on to a new role before his sins catch up with him.

The entire thing is stitched together by spreadsheets that are parsed by Python, dropped into S3, parsed by Lambdas into more S3, the S3 files are picked up by MongoDB, then MongoDB records are passed by another Lambda into S3, the S3 files are pulled into Snowflake via Snowpipe, the new Snowflake data is pivoted by a Javascript stored procedure into a relational format... and that's how you edit someone's database access. That whole process is to upload like a 2KB CSV to a database that has people's database roles in it.

This is considered more auditable.

Everything is transformed into a CSV because the security team demanded something that could undergo easy scanning for malicious content, then they never deployed the scanning tool, so we have all the downsides of the CSVs and none of the upsides.

Every Lambda function, the backbone of all the ETL pipelines, starts with counter = 1 because one of the early iterations used to use a counter and people have just been copying that line over and over. Senior data engineers have been copying that line over and over.

The test suites in the CI/CD pipelines have been failing for months, because someone during debugging chose to use the Linux tee command to log any errors to both stdout and a file at the same time, but tee successfully executing was overwriting the error code from the failing tests.

To get access to the password for any API we need to hit, you search for something like service-password in an AWS service, which returns the value... service-password (as in, literally all the values are the same as the keys), then you use that to look up the actual password in a completely different service. No one knows why we do this.

The script that generates configuration files for our pipelines starts with 600 lines of comments, because senior engineers have been commenting the lines out in case they're needed later. The lines are just setting the same variables to different values, and they're all on GitHub anyway.

This is at an organization that some percentage of readers will recognize on sheer brand strength if they're in my country.

I'm not even getting started, but we have to stop for now because I am going to catch fire. These details are important because now you understand the kind of operational incompetence that allows you to waste so much money on processing <1TB of data per day that it dwarfs your team's salary.

Back to all threads