<<O>>  Difference Topic HardwareFailures (r1.23 - 10 Feb 2011 - MarcosASeco)

-- MarcosASeco - 14 Mar 2007
Node Failure Date of Failure Action Taken Date of Repair
Line: 33 to 33

lhcb066.usc.cesga.es Periodically the one bank of memory will be disable because too many 'Correctable ECC' ocurred after reboot things returned to normal. The reason of the failures was a faulty DIMM. The actual DIMM was discovered after moving half of the modules to another machine (lhcb064) and the failures apeared in the new machine. These failures were reproducible by running memtest long enough 08/04/2008 Faulty DIMM replaced 03/11/2008
lhcb065.usc.cesga.es On 20/10/2008 the memory of lhcb079 was exchanged with the memory of this machine and after around a week and several 'Correctable ECC' errors the machine will not reboot because it was unable to find any memory 01/11/2008 All problems in the Caton machines were related to the power supply. The power supplies were changed on all machines between August and September 2008 01/04/2009
lhcb027.usc.cesga.es faulty disk 17/04/2009 Disk replaced 20/04/2009
nodo077.inv.usc.es faulty disk 01/02/2011 Disk swapped with the one from nodo025 03/02/2011
nodo069.inv.usc.es faulty disk 01/02/2011 Disk swapped with the one from nodo026 03/02/2011
nodo065.inv.usc.es faulty motherboard and power suply 01/02/2011 motherboard and power supply swapped with those from nodo025 03/02/2001
nodo109.inv.usc.es faulty disk 07/02/2011 Disk replaced 10/02/2011

Revision r1.22 - 15 Jun 2009 - 15:35 - MarcosASeco
Revision r1.23 - 10 Feb 2011 - 17:32 - MarcosASeco