A few distractions… openwsman and SQL Server taming: rizwank

rizwank

A few distractions… openwsman and SQL Server taming

Feb 11, 2010 20:51

Tasks keeping me away from what I’d rather be doing this past few days :

Did some updates on tardis, our general purpose linux server (all the servers have names from Dr. Who) - and updated the Dell OpenManage tools. Reboot. Simple, right?

The machine responded to pings, and nothing else.

It was 2pm. Traffic in LA becomes a nightmare somewhere around 3:30pm. Jumped in the car to crash cart the server (because the tech on site didn’t know what a crash cart was and when I described it - said ‘that sounds complicated.’ Not happy with our colo right now.)

Arrived to find the server paused - not hung, just waiting, on a “Starting openwsmand…” message. Turns out, there’s a new service in the latest version of OpenManage that requires a new OpenSSL certificate for functioning. This, on it’s own, is okay - but two problems :

The cert is generated by the boottime init.d script if it’s missing.
The cert is generated using /dev/random as the entropy source

The latter is a security concern, but should never be paired with the top. I’ve had servers hang for hours waiting for /dev/random to generate enough entropy.

This was easily enough fixed, but the stress of driving like a madman, (and back) to make it before traffic shut down all routes to West LA was something I could do without. I’m going to patch against the project and see if they’ll fix this ridiculous behavior, and I’ve started looking into a remote KVM solution like kvm2ethernet - just call the colo and ask them to plug into a particular server. Thanks to this post for cutting the debugging time massively.

The other issue was that customers weren’t able to sign up for bits of today because there was a lock residing on one of our DB tables. We purge our Database (about 50G) monthly, but the cruft of leftover billing records take up huge amounts of space and deleting them can be a problem - long table scans, and Microsoft SQL Server does a table lock - and that’s the ball game. The credit card server can’t track that a valid charge was placed, so it terminates instead. We started with a query from our Marketer/Data Analysis guy looking like :

1 2 3 4
DELETE FROM billing WHERE AND start_date_time >= '18-JUL-2009' AND start_date_time < '24-JUL-2009' AND node_type NOT IN (3,4,5)
(He wanted to delete up until 30-AUG, but was slicing it up in the hopes of avoiding this problem.)

The above is about 300,000 rows.

A few issues, however :

billing has a clustered index around account_id. This makes perfect sense, the data is almost always referenced with respect to a particular customer, and ensuring that those records are all adjacent to each other on storage is common sense. However, the above query would be running all around the 9Gb table removing rows.
billing doesn’t have an index on node_type, meaning each row has to be fetched before it can be selected for deletion.

The latter is less of a problem if the select and the delete were separated, but the table lock existed throughout the query.

In looking to solve it, one approach was to force MSSQL to use ROWLOCK (and disable escalation from ROWLOCK to TABLELOCK), but this was going to be a performance hit. I considered trying NOLOCK, but I wasn’t sure of what the ramifications would be, and I really didn’t want to spend hours fixing a crash database/corrupted data.

The final solution was to carve up the deletes into more manageable bits. SQL Server Interactive can be set to execute on a limited number of rows - so we could delete, say, 1000 rows, pause (let other things have access to the table), and then continue. And, now that we know about the clustering, why make the DELETE run across the entire table - why not allow it to trim one section at a time? And thusly we have…
(I’m not really familiar with SQL Server Syntax, so this is a first effort. It’s lousy code, but a decent query)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
DECLARE @ACCOUNTSTEP int DECLARE @WINDOWMIN int DECLARE @WINDOWMAX int DECLARE @ACCOUNTMAX int DECLARE @WINDOWCOUNT int DECLARE @STEPS int SET @STEPS = 20 SET @ACCOUNTMAX = (SELECT max(account_id) FROM accounts) SET @ACCOUNTSTEP = (ceiling(@ACCOUNTMAX * 1/@STEPS)) SET @WINDOWMAX=0 SET @WINDOWCOUNT=0 PRINT CAST(@STEPS AS CHAR(10)) + '+1 Steps of Size ' + CAST(@ACCOUNTSTEP AS CHAR(10)) + ' leading up to ' + CAST(@ACCOUNTMAX AS CHAR(10)) SET ROWCOUNT 2000 moreaccounts: SET @WINDOWCOUNT = @WINDOWCOUNT + 1 SET @WINDOWMIN = @WINDOWMAX SET @WINDOWMAX = @ACCOUNTSTEP * @WINDOWCOUNT IF @WINDOWMIN > @ACCOUNTMAX GOTO done PRINT 'Now processing accounts between ' + CAST(@WINDOWMIN AS char(10)) + ' and ' + CAST(@WINDOWMAX AS char(10)) deletemore: PRINT 'Deleting 2000 rows.' waitfor delay '0:0:01' DELETE FROM billing WHERE account_id >= @WINDOWMIN AND account_id < @WINDOWMAX AND start_date_time >= '18-JUL-2009' AND start_date_time < '30-AUG-2009' AND node_type NOT IN (3,4,5) IF @@ROWCOUNT > 0 GOTO deletemore PRINT 'Done with this set. Sleeping.' waitfor delay '0:0:03' GOTO moreaccounts done:
Which effectively breaks out to :

Get the total number of accounts
For every twentieth window of accounts

Delete 2000 rows. Wait a second. Repeat until all rows gone.

Wait three seconds.

Ran the query, no locking issues at all, Table purged in about an hour. Victory!

I wanted to get Zabbix monitoring working for our Asterisk boxen - I went a bit overboard, and resulted in zasterisk.

Mirrored from The Second Order Effect.

tech articles