strange vmware network issues

Nov 01, 2009 03:40

Update... SOLVED.

The default set of IPS rules configured in the router interpreted my testing as a DoS attack and reset the connection. Grumble. Just one more reason why it's cool that my job is paying for me to go take two weeks of CCNA training over the next couple months.

I've been trying to debug a weird message in some Apache logs lately... "software caused connection abort at INSERT_NAME_OF_MOD_PERL_SCRIPT here at line XXX" - and in my travels, I've run into something really, really odd. I have a static HTML file - we'll call it "test.html" - that's identical on two servers. Server A is a low-end server machine on a decent network with no firewall or NAT in front of it. Server B is a VMware guest on a 10.0.0.0/24 network that IS behind a NATed and behind a firewall (Cisco 1812). Server B's network is decent, too. The host hardware for server B is much better than that of server A.

I'm running "ab" (ApacheBench) against both boxes, with the following command string:
ab -n 500 -c 100 http://www.my.server.name.example.com/test.html

500 requests, 100 concurrent, retrieve me a simple HTML file.

Server A has no problem with this, as you can see below:
Concurrency Level: 100
Time taken for tests: 2.392 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 977832 bytes
HTML transferred: 833974 bytes
Requests per second: 209.01 [#/sec] (mean)
Time per request: 478.450 [ms] (mean)
Time per request: 4.785 [ms] (mean, across all concurrent requests)
Transfer rate: 399.17 [Kbytes/sec] received

Server B, on the other hand, says "apr_socket_recv: Connection reset by peer (104)" and doesn't even complete a single request. WTF? So, I start poking around. First, I check netstat. There are no connections at all from my test box. Then, I check the Cisco's NAT table. There are a shit-ton of connections. There is no NAT rate limiting configured on the router. Watching the router's CPU while these 100 connections come in shows no spike in load. The router has over 50% free RAM. I've tried this test from different boxes on different networks so as to try to rule out any potential issues with my home ISP. Same result.

So, what's left? It has to be something with VMware, right? But let's think about what that might mean - VMware ESX/ESXi 4.0, their flagship product, cannot handle 100 simultaneous connections to a single VM? That can't possibly be true - if it were, ESX would be useless for anything more than hobby applications.

The VM in question is a 32-bit VM (CentOS 5.4), using the VMXNET virtual network adapter. Going to try a 64-bit VM running VMXNET3. If that still doesn't work, I'm going to need to take a machine down to the data center that's not virtualized and see what happens there. If I can't figure out how to get ESX to accept and route more than 100 simultaneous HTTP connections, that's going to really put the brakes on some virtualization stuff we're planning to roll out at work.

Update... 64-bit VM with E1000 == no good. VMXNET3 == no good.

Update 2... can't blame VMware. Tried a similar test hitting a VMware guest on a host on another network. No problems:
Concurrency Level: 100
Time taken for tests: 2.607053 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 143500 bytes
HTML transferred: 9500 bytes
Requests per second: 191.79 [#/sec] (mean)
Time per request: 521.411 [ms] (mean)
Time per request: 5.214 [ms] (mean, across all concurrent requests)
Transfer rate: 53.70 [Kbytes/sec] received

So what's left? A: Server B's host hardware. B: the router. C: the switch. I'm pretty sure it's not A.

I just found on Netgear's website that there's a firmware upgrade for the GS724TR, so I've applied it, but still no joy.

@$@!@#!@#!%#$R>!@>#>!@>>??!@#

wtf, ugh, vmware

Previous post Next post
Up