solarbird: (molly-thats-not-good-green)
[personal profile] solarbird
eta: A solution? timeout_req defaults to 2 seconds and is how long varnish waits for request header data from the client. Setting that to 10 seconds makes the problem go away. (I just picked a big number - most of the timeouts below ranged between >2sec and <6sec.)

This could sorta make sense except the problem didn't repro from off the LAN, an the LAN is much faster than our WAN uplink. Which implies that there is something our LAN is doing at random which radically slows down packet travel from client machines to server machines with fixed IP addresses.

This is so weird.

Here's pinging the webserver from this laptop:

round-trip min/avg/max/stddev = 1.543/4.381/7.130/1.321 ms

Here's pinging this laptop from the webserver:

rtt min/avg/max/mdev = 2.252/53.628/78.781/24.769 ms

Is this relevant? Possibly? I don't know. But I do know that the 'fast route' also goes from the network address translation section of the LAN over to the fixed-IP section. That's got to be part of it.

-----------

Okay, so, when I get these dropped server connections - which seem to happen only on the LAN, not in the outside world - I see a bunch of these blocks in varnish's log. But I don't know what these are, other than oh hi timeoutes.

From the client standpoint, there's no waiting; you click on the link and instantly get a "server has dropped connection" message. Boom. Much faster than actually loading a page.

Anybody know how to read varnish logs?

* << Session >> 65693
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53818 :80 173.160.243.43 80 1469083778.683280 14
- Link req 65694 rxreq
- Link req 65696 rxreq
- Link req 65698 rxreq
- Link req 65700 rxreq
- Link req 65702 rxreq
- Link req 32788 rxreq
- Link req 360472 rxreq
- SessClose RX_TIMEOUT 6.262
- End

* << Session >> 146
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53820 :80 173.160.243.43 80 1469083780.279525 16
- Link req 147 rxreq
- Link req 149 rxreq
- Link req 151 rxreq
- Link req 153 rxreq
- Link req 229406 rxreq
- Link req 327718 rxreq
- SessClose RX_TIMEOUT 4.675
- End

* << Session >> 294941
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53819 :80 173.160.243.43 80 1469083780.278725 15
- Link req 294942 rxreq
- Link req 294944 rxreq
- Link req 294946 rxreq
- Link req 196635 rxreq
- SessClose RX_TIMEOUT 4.685
- End

* << Session >> 360467
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53824 :80 173.160.243.43 80 1469083780.331048 20
- Link req 360468 rxreq
- Link req 65704 rxreq
- SessClose RX_TIMEOUT 4.641
- End

* << Session >> 98345
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53821 :80 173.160.243.43 80 1469083780.292602 18
- Link req 98346 rxreq
- Link req 32782 rxreq
- Link req 360470 rxreq
- SessClose RX_TIMEOUT 4.689
- End

* << Session >> 327711
- Begin sess 0 HTTP/1
- SessOpen 10.1.10.206 53823 :80 173.160.243.43 80 1469083780.331335 22
- Link req 327716 rxreq
- SessClose RX_TIMEOUT 4.660
- End


using varnishlog directly, it looks more like this:

 328293 SessClose      c RX_TIMEOUT 10.560
    328293 End            c
    196743 SessClose      c RX_TIMEOUT 5.183
    196743 End            c
    163890 SessClose      c RX_TIMEOUT 5.202
    163890 End            c
    360556 SessClose      c RX_TIMEOUT 3.808
    360556 End            c
    393291 SessClose      c RX_TIMEOUT 3.823
    393291 End            c
    131082 SessClose      c RX_TIMEOUT 3.837
    131082 End            c
         0 ExpKill        - EXP_Expired x=32809 t=-10
         0 ExpKill        - EXP_Expired x=393245 t=-10
         0 CLI            - Rd ping
         0 CLI            - Wr 200 19 PONG 1469085933 1.0
         0 ExpKill        - EXP_Expired x=393247 t=-10
    196758 SessClose      c RX_TIMEOUT 5.413
    196758 End            c
    229455 SessClose      c RX_TIMEOUT 5.409
    229455 End            c
    131083 SessClose      c RX_TIMEOUT 5.412
    131083 End            c
    163908 SessClose      c RX_TIMEOUT 5.413
    163908 End            c
    360559 SessClose      c RX_TIMEOUT 5.409
    360559 End            c
    262286 SessClose      c RX_TIMEOUT 5.525
    262286 End            c

Help?

Date: 2016-07-21 09:12 am (UTC)
vatine: Generated with some CL code and a hand-designed blackletter font (Default)
From: [personal profile] vatine
This looks like one of those cases where I would find me two spare switches (or network splits, IF you can find them, but a few-port switch with a SPAN port is probably cheaper...) and do simultaneous packet dumps as close to "server" and "client" as possible, trying to identify what's actually happening on the network layer.

Having the differential packet dumps (with well-synchronised clocks, of course) is usually enough to form a theory as to "what" and once you have that, it's usually just a matter of moving one network element at a time until you've identified "why".

I think the absolute weirdest thing I've ever seen was a "twisted pair to fibre" converter that had malfunctioned and dropped packets with a certain bit pattern. It just happened to be "UDP packets between 512 and 767 bytes long" and the initial symptom was "everything except VOIP works!".

Date: 2016-07-21 07:54 pm (UTC)
vatine: Generated with some CL code and a hand-designed blackletter font (Default)
From: [personal profile] vatine
I've had some real-life network problems I've used as interview troubleshooting questions, for people claiming to be Real Good at networking.

August 2017

S M T W T F S
  12 3 4 5
67 8 910 1112
1314 15 16 171819
20212223242526
2728293031  

Most Popular Tags

Page Summary