Lux - Down

Incident Report for Darren Nathanael's Infra

Postmortem

So, the techs left the install media for the Junos in the unit. So when it rebooted, it booted into the installer and was waiting to be used. I can't believe it; they pulled it, rebooted, and went back into the OS. I literally told them to throw the USB key out 4 years ago. It has Junos 18 on it, absolute clownery.

Posted Oct 08, 2024 - 22:30 UTC

Resolved

We've found the core issue, everything is alive and happy now.
Posted Oct 08, 2024 - 22:21 UTC

Identified

We're live! Some of the routers own anti DDoS protection was acting up and causing it to drop ARP's; hence the CPU Spike.
Posted Oct 08, 2024 - 19:53 UTC

Update

LUX MX' Core router CPU usage before the traffic drop. https://lore.dpaste.org/g/xAmUhO.png
Posted Oct 08, 2024 - 15:00 UTC

Update

As we are nearing the 1-hour mark, here's a quick breakdown: (GMT -7)

- At around [13:20 UTC] 6:20 this morning, there was an issue with traffic to/from the router wasn't routing.

- Significant increase in router CPU usage leading up to this incident.

- Despite sending the router for a reboot, it has not yet returned online.

- Router may still be in the process of rebooting, it might still be in the midst of trying to reboot (stuck kernel thread, etc.).

- Luxconnect was contacted, but no immediate assistance was available they're like ("Dave's not here man"), But here's hoping a tech comes back ASAP

- Reached out to http://root.lu for support and awaiting a response.

- Requested the data center to pull the power in an effort to resolve the situation.
Posted Oct 08, 2024 - 14:27 UTC

Update

We’ve sent off an email to LuxConnect, since the router reboot obviously didn’t go happily.
Posted Oct 08, 2024 - 13:56 UTC

Update

The current issue appears to be related to the network, but we are still investigating to pinpoint the exact cause. We are trying to determine if the problem is limited to the router level or if it extends further upstream in the network stack.
Posted Oct 08, 2024 - 13:42 UTC

Update

We're waiting on Juniper MX204 core to reboot, MX' takes 5 min exact to reboot.
Posted Oct 08, 2024 - 13:33 UTC

Update

We're rebooting the core router.
Posted Oct 08, 2024 - 13:30 UTC

Investigating

We are currently investigating this issue.
Posted Oct 08, 2024 - 13:27 UTC
This incident affected: Public Infrastructure (cPanel Enterprise Shared Hosting - Lux, DA Enterprise Shared Hosting - Lux) and Core Infrastructure (Billing Panel).