Category: Operations

Information about changes or proposals that affect the way our project operates.

[OPS]Updating Certsā€¦

Most of our team has been off doing their own summer time activities
and tending to their other projects, as such, our SSL certs expired
before we realized is was that time again.

Skylar will be working on getting updated certs pushed out across
the network in the next few days to get everything back up and
going as time permits.
ā€”
ā€”
Update:

  • The main website and our email server should now have updated
    SSL certs and should not be making errors anymore.
  • Hypervisor had its certs updated and services reloaded.
  • Support desk had its certs updated, OS updated and was rebooted.

Stable againā€¦we think

Upon receiving some information from a project member as to what
may be causing our server to crash, we have opted to limit cstates
on boot to see if our main server stays alive with everything up
and running.

So far, we appear to be stable once again.

Unexpected Reboot

UPDATE Two:
Combing through the physical host logs and reading things line by
line turns out a set of errors that might be concerning to our
operations, but we arenā€™t sure just yet.

Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: Machine check events logged
Apr 25 12:04:34 adrian kernel: microcode: CPU12: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU13: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU14: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU15: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffa5e675c2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1714061070 SOCKET 0 APIC 9 microcode 8701021
Apr 25 12:04:34 adrian kernel: microcode: CPU0: patch_level=0x08701021

We are continuing to check on the machine and watching for problems
that may be affecting us.

UPDATE One:
We got in touch with the data center and they are saying
ā€œAs these are unmanaged servers, we do not actively monitor
customer services, so we do not know when or why a services
goes offline.ā€ Which we understand, the operation of the
hardware is on us, but this is something outside of our
project from the looks of it.

Our team will continue to work on restoring network services
and will monitor the system for the next few hours to see
if any issues arise.

We got some alarms sent to our team around 1214EST this
afternoon that large portions of our network went offline
without warning. Upon logging into our management consoles
and looking things over, it seems that our physical host
ā€œAdrianā€ rebooted.

We have reached out to the data center to see what may have
happened and our team is in the process of restoring our
network functionality.