[Resolved] Power failure on Irideos Datacenter

Started on May 6, 2021 at 3:09 AM. Resolved after 8 days

Affected

SeFlow Cloud
  • Identified
    May 6, 2021 at 3:09 AM

    Gentili Clienti, indicativamente alle 21.40 Irideos ha avuto un guasto all' impianto elettrico causano lo spegnimento improvviso dell' intero datacenter 2 in Caldera Campus. Gli elettricisti hanno ripristinato la corrente alle 22.30 e han consentito l' ingresso allo staff SeFlow circa un 'ora dopo.

    Dal sopralluogo è emerso quanto segue:

    • Entrambi gli switch di core del Cloud per lo storage distribuito han mostrato errori al boot causando l' imopssibilità di avvio delle vm
    • Uno dei border router huawei ne8000 è in allarme non permettendo più il boot
    • 4 tor switch non sono più in grado di effettuare il boot o non si accendono più

    I prossimi passi? Il nostro staff sta lavorando per ripristinare la situazione il più velocemente possibile:

    • Infrastruttura Cloud: 75% Ripristinata - eta: 8 ore

    • Tor Switch: 4% ripristinati - eta: 12 Ore

    • Border Router: 0% ripristinato - eta 3 giorni

    Abbiamo deciso di dare massima priorità al ripristino del cloud e dei tor switch per portare tutti i clienti online

    #

    Dear Customers, at around 21.40 Irideos had a power failure in Milan DC2 that cause outages in our network. Most of our services were impacted. Electricians powered on data center again at about 22.30 and we were authorized to enter our rooms one hour later.

    What we discovered?

    • Unluckily both storage switch was unable to boot. This causing VM to fail to start
    • One of border ne8000 router failed to boot
    • 4 tor switch not boot anymore

    What to do? All our staff is working to restore any failure and this is the time frame

    • Cloud Infrastructure: 75% Restored - eta: 8 hours

    • Tor Switch: 4% done - eta: 12 Hours

    • Border Router: 0% done - eta 3 days

    We're giving the priority on cloud and tor switch to restore customer services

  • Identified
    May 6, 2021 at 4:37 AM

    New Progress:

    • Cloud Infrastructure: 100% Restored - Customers can now boot the vm

    • Tor Switch: 4% done - eta: 12 Hours

    • Border Router: 0% done - eta 3 days

  • Monitoring
    May 6, 2021 at 11:15 AM

    New Progress:

    • Cloud Infrastructure: 100% Restored

    • Tor Switch: 70% done - new done: 3 hours

    • Border Router: 0% done - eta 3 days

  • Monitoring
    May 6, 2021 at 8:53 PM

    New Progress:

    Cloud Infrastructure: 100% Restored

    Tor Switch: 100% done

    Border Router: 0% done - eta 3 days

  • Identified
    May 7, 2021 at 8:14 PM

    We fixed all servers, switches issue and are now working on network side.

    Broken border router has been replaced and we are cabling new one and reconfiguring it

  • Monitoring
    May 8, 2021 at 2:04 AM

    We started injecting traffic into new router. Latency should decrease and speed increase on most locations.

    NTP service has been restored from time.inrim.it server

    IPv6 connectivity has been restored

  • Identified
    May 8, 2021 at 10:59 AM

    We reactivated all PNI Connection, Cogent Uplink, and partially MIX link

    We expect to complete the configuration changes for Monday Evening

  • Identified
    May 9, 2021 at 4:25 PM

    We completed the setup of Minap IX and are working to complete the MIX link

    After that we will reactivate all AntiDDoS functionality (now is working at 70%).

    Network restoration status: 80% done

  • Monitoring
    May 10, 2021 at 10:21 PM

    All networks equipments has been repaired, we're now re-integrating the 3rd antiddos cluster into the network and will close incident

  • Resolved
    May 14, 2021 at 9:00 AM

    All components has been restored