Author: Skylar.W

Crashing again…

05/15 18:40EST:
We are still in communication with the data center to figure out what is
going on and are not really making any headway on the issue. For the time
being, if you are a project member and you notice that routing is suddenly
not working or that you cannot reach our game services, please reach out to
Skylar and she will fix things as quickly as she can.

The data center techs have swapped out the memory in our server. Our team will
be keeping an eye on the server logs for the next few days and switching out
loads to see if the fault crops up again. As of now, all services should be
operational again.


I have awakened and got out of bed for the day. A message has
been sent in to the data center about this issue. We are waiting to see what
they say and what our options are before our team does any further looking
into this.


I will reach out to the data center once I have woke up for the
day, but it appears we are crashing once again without warning.

May 15 00:46:24 adrian kernel: microcode: Current revision: 0x08701021
May 15 00:46:24 adrian kernel: mce: [Hardware Error]: Machine check events logged
May 15 00:46:24 adrian kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
May 15 00:46:24 adrian kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffa3138e02 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 15 00:46:24 adrian kernel: mce: [Hardware Error]: PROCESSOR 2:8

Outages and what not…

We had an extended outage today that started around 0400EST and lasted
until around 1500EST due to either a kernel bug or hardware fault that
resulted in the primary physical host rebooting without warning. When
these reboots happen, it requires that one of our volunteer staff logs
into the management console and restart some services on our routing
gear due to a bug in those VMs; meaning routing is lost until someone
wakes up to deal with it. Not good for anyone relying on our network
for transit or hosting.

We reached out to the data center and they tested both the memory and
the CPU in our “Adrian” host to see if either had a fault in them and
both came back good. So on a hunch, we updated the board firmware to
a newer release and will be monitoring Adrian very closely for the next
few days to see if this fixes the issue.

Operational Again

Our team has gotten the network operational once again
and will continue to keep an eye on things for the next
few hours. In the event that we experience down time
again, project members are encouraged to keep up with
what our team is doing over on our status page.