PDA

View Full Version : Outage: DC-Hosting HSphere Cluster


othelloRob
2nd February 2009, 10:09
Early 31/January morning one of the dc-hosting routers blew a powersupply.

The unit tripped the breaker protecting the electrac powerbar running through the suite, turning off all the machines.

This was not picked up immediately as a major failure, as it conflicted with a planned maintenance migration of some of the systems to new kit - which has now been postponed until March.

An engineer went onsite to determine the cause of the outage, and found that the electricity was off.

Power was restored to the area at about 7am and systems started firing up in their sequence.

After the power was restored, the router failure was discovered and a spare unit obtained, this arrived onsite at approx 10.45am and had been fitted and configured by 11.15.

At this point all servers and services should have been accessible.

Unfortunately as is the case with any machine which is not cleanly shut-down, a number required some manual intervention and disk checks/recovery - specifically win3, win6, web1 and the webmail server

Numerous reports have flooded in of people having access problems seeing their websites or collecting their email - on extensive investigation of dozens of cases, IN THE MAIN these have been issues with the connectivity ISP or local cache on their machine
- it appears that several ISPs did not correctly pick up on the router replacement, are retaining the "arp" of the defunct router, or are somehow "filtering" the traffic
incorrectly
- if you are unable to see both your site and the www.dc-hosting.com site then it *WILL* be a problem with your setup or your ISP and you should contact whomever you pay for your internet connection for assistance.

We are unable to fix PlusNet, Entanet or AOL, so you should address your issues to their support teams.

As at 18.00 31/January the only machines/services still not 100% available are
* win6 which is crashing with a memory fault, and is being repaired and then windows 2k3 reinstalled
* webmail as the virtual appliance machine is unable to repair its disk since the outage.

:edit:
webmail is fixed as at 31/January 23:00
win6 is fixed ata at 1/February 03:15

If you believe your site is down, please ensure you have
a. tried a traceroute from a command prompt to your domain and to your ip address - if you can trace to the firewall by IP but not by domain then it is a problem with you/your-isp DNS - contact them for assistance
b. check with your ISPs first as they may be having routing problems - for at least 5 hours on Sunday 1/February several major ISPs were unable to see each other due to routing issues.
c. clear your local dns cache and browser cache and try again


After that if you still have issues, raise a ticket with all the details and do include the domain name - it's impossible to diagnose faults reported as

my web site has vanished - please fix it NOW !!!!!!!!!

when no other information has been provided :(

Rob