<<O>>  Difference Topic HardwareFailures (r1.22 - 15 Jun 2009 - MarcosASeco)

META TOPICPARENT HardAtUSC
-- MarcosASeco - 14 Mar 2007
Node Failure Date of Failure Action Taken Date of Repair
Line: 32 to 32

lhcb079.usc.cesga.es after too many 'Correctable ECC' errors the machine will not reboot because it was unable to find any memory 15/10/2008 after unplugging and plugging again all the memory the problem was disappeared 16/10/2008
lhcb066.usc.cesga.es Periodically the one bank of memory will be disable because too many 'Correctable ECC' ocurred after reboot things returned to normal. The reason of the failures was a faulty DIMM. The actual DIMM was discovered after moving half of the modules to another machine (lhcb064) and the failures apeared in the new machine. These failures were reproducible by running memtest long enough 08/04/2008 Faulty DIMM replaced 03/11/2008
lhcb065.usc.cesga.es On 20/10/2008 the memory of lhcb079 was exchanged with the memory of this machine and after around a week and several 'Correctable ECC' errors the machine will not reboot because it was unable to find any memory 01/11/2008 All problems in the Caton machines were related to the power supply. The power supplies were changed on all machines between August and September 2008 01/04/2009
Changed:
<
<
lhcb027.usc.cesga.es faulty disk 17/04/2009    
>
>
lhcb027.usc.cesga.es faulty disk 17/04/2009 Disk replaced 20/04/2009

Revision r1.21 - 20 Apr 2009 - 16:17 - MarcosASeco
Revision r1.22 - 15 Jun 2009 - 15:35 - MarcosASeco