Okay, so, once I got enough buffers on the 100mb backbone to be able to make any sense out of the 100mb hub's lights to tell what was going on, I found that lodestone was flooding the LAN with some sort of packet spew. This was not enough flood to interfere with traffic between most machines. However, if that traffic was on the net while bringing up door's on-LAN card (eth0; eth1 is the WAN card), the driver would initialise, and appeared to be able to transmit something, but not talk. With lodestone off the LAN, I was able to get door's card up; it then would survive - at least for a bit - the lodestone onslaught.
(I say "at least for a bit," because I had to restart door's LAN card this morning, too. But it came up cleanly and we were online for a few hours before falling down again. door is our router.)
Meanwhile, newmoon's TCP stack fell over as well, but unlike door, the IP layer stayed up; ICMP worked. Restarting the entire network layer brought it back around.
Back at lodestone, shutting down the network did not stop the strange-activity onslaught. That stayed on until power-down, indicating to me that it was a card issue. As lodestone has been up for over a year (I think it said 380 days), I'm initially willing to ascribe this to the bogon flux reaching critical. (Similarly, a simple reboot was not enough to bring lodestone back up; however, a power-cycle brought it up normally.)
Currently, we are back up and seem reasonably normal. However, I've never been happy with the driver situation on door, which exhibits other flaky behaviours as well. (They're self-compiled Netgear FA312 drivers on Debian, with the source grabbed from a redhat compile package.) For example, almost all successful packets also increment the carrier-error flag. I don't even know what a carrier error is in this context. This appears to be a driver bug. Since our uptime on door is on the order of four months - with occasional incidents like this, but a little different each time - it's not exactly a first-tier issue. And this time, the problem was, I think, triggered by another machine. But it still bugs me.
Anyway, I'm not really asking for help. But if somebody has something specific to say about this, yay.
(I say "at least for a bit," because I had to restart door's LAN card this morning, too. But it came up cleanly and we were online for a few hours before falling down again. door is our router.)
Meanwhile, newmoon's TCP stack fell over as well, but unlike door, the IP layer stayed up; ICMP worked. Restarting the entire network layer brought it back around.
Back at lodestone, shutting down the network did not stop the strange-activity onslaught. That stayed on until power-down, indicating to me that it was a card issue. As lodestone has been up for over a year (I think it said 380 days), I'm initially willing to ascribe this to the bogon flux reaching critical. (Similarly, a simple reboot was not enough to bring lodestone back up; however, a power-cycle brought it up normally.)
Currently, we are back up and seem reasonably normal. However, I've never been happy with the driver situation on door, which exhibits other flaky behaviours as well. (They're self-compiled Netgear FA312 drivers on Debian, with the source grabbed from a redhat compile package.) For example, almost all successful packets also increment the carrier-error flag. I don't even know what a carrier error is in this context. This appears to be a driver bug. Since our uptime on door is on the order of four months - with occasional incidents like this, but a little different each time - it's not exactly a first-tier issue. And this time, the problem was, I think, triggered by another machine. But it still bugs me.
Anyway, I'm not really asking for help. But if somebody has something specific to say about this, yay.
no subject
Date: 2005-12-18 10:11 am (UTC)no subject
Date: 2005-12-18 05:43 pm (UTC)