Skip to topic | Skip to bottom
Home
LCGatUSC
LCGatUSC.LcgInformationr1.4 - 20 Jan 2006 - 17:07 - JuanJSaboridotopic end

Start of topic | Skip to actions
-- JuanJSaborido - 23 Nov 2005

USC-LCG2


service globus-mds restart

service lcg-bdii restart


Para cambiar la variable de entorno LCG_GFAL_INFOSYS en los worker nodes hago lo siguiente:

  • 1.- Me logo en el servidor de quattor (lhcb01) como lcgadmin.
  • 2.- Abro la base de datos bajo el usuario lcgadmin (desde un directorio tmp)
    • cdbop
    • open
    • get *

Busco ahora LCG_GFAL_INFOSYS entre los templates y la encuentro aquí:

pro_lcg2_config_lcgenv.tpl:"/software/components/profile/env/LCG_GFAL_INFOSYS" = BDII_HOST+":"+to_string(BDII_PORT);

Veo que se crea a través de BDII_HOST. Busco esta última y la encuentro en "pro_lcg2_config_site.tpl". Cambio la correspondiente línea:

define variable BDII_HOST = "lcg-bdii.cern.ch";

Luego hago en la base de datos, por este orden: get, update, commit


Testeo de CE y WN:

To test the working of the site GIIS:

ldapsearch -LLL -x -H ldap://lcg-ce.usc.cesga.es:2170 -b "mds-vo-name=usc-lcg2,o=grid"

Now verify that the GRIS on the CE is operating correctly

ldapsearch -LLL -x -H ldap://lcg-ce.usc.cesga.es:2135 -b "mds-vo-name=local,o=grid"



edg-job-submit --vo lhcb --resource lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-short testJob.jdl

Connecting to host lxn1188.cern.ch, port 7772
Logging to host lxn1188.cern.ch, port 9002

                               JOB SUBMIT OUTCOME
 The job has been successfully submitted to the Network Server.
 Use edg-job-status command to check job current status. Your job identifier
(edg_jobId) is:
 
 - https://lxn1188.cern.ch:9000/NPec0zZBpSjgAu5ta-U2kA


Resultado del output:
saborido@fpsunae2:~/LCG/test-UI$ edg-job-status
https://lxn1188.cern.ch:9000/NPec0zZBpSjgAu5ta-U2kA

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lxn1188.cern.ch:9000/NPec0zZBpSjgAu5ta-U2kA
Current Status:     Aborted
Status Reason:      Job RetryCount (3) hit
Destination:        lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-short
reached on:         Wed May 25 12:02:57 2005
*************************************************************

Preguntar a Manuel:

En la información publicada por ldap no veo la cola "lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-short" ... ¿está esto bien?

Resultado del output:

saborido@fpsunae2:~/LCG/test-UI$ edg-job-status
https://lxn1188.cern.ch:9000/NPec0zZBpSjgAu5ta-U2kA

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lxn1188.cern.ch:9000/NPec0zZBpSjgAu5ta-U2kA
Current Status:     Aborted
Status Reason:      Job RetryCount (3) hit
Destination:        lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-short
reached on:         Wed May 25 12:02:57 2005
*************************************************************

Otra prueba, poniendo el requirement dentro del testJob.jdl:



saborido@fpsunae2:~/LCG/test-UI$ edg-job-submit --vo lhcb testJob.jdl

Selected Virtual Organisation name (from --vo option): lhcb
Connecting to host lxn1188.cern.ch, port 7772
Logging to host lxn1188.cern.ch, port 9002


***********************************************************************
                               JOB SUBMIT OUTCOME
 The job has been successfully submitted to the Network Server.
 Use edg-job-status command to check job current status. Your job identifier
(edg_jobId) is:

 - https://lxn1188.cern.ch:9000/7VKmXP8OClYxL-7x7z4fQg


************************************************************************

saborido@fpsunae2:~/LCG/test-UI$ edg-job-status
https://lxn1188.cern.ch:9000/7VKmXP8OClYxL-7x7z4fQg


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lxn1188.cern.ch:9000/7VKmXP8OClYxL-7x7z4fQg
Current Status:     Aborted
Status Reason:      Cannot plan: BrokerHelper: no compatible resources
reached on:         Wed May 25 16:54:37 2005
*************************************************************

Otro intento, esta vez con la cola de test:


saborido@fpsunae2:~/LCG/test-UI$ edg-job-submit --vo lhcb -r
lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-test testJob.jdl

Selected Virtual Organisation name (from --vo option): lhcb
Connecting to host lxn1188.cern.ch, port 7772
Logging to host lxn1188.cern.ch, port 9002


*********************************************************************************************
                               JOB SUBMIT OUTCOME
 The job has been successfully submitted to the Network Server.
 Use edg-job-status command to check job current status. Your job identifier
(edg_jobId) is:

 - https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ


*********************************************************************************************


saborido@fpsunae2:~/LCG/test-UI$ edg-job-status
https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ
Current Status:     Running 
Status Reason:      Job successfully submitted to Globus
Destination:        lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-test
reached on:         Wed May 25 17:20:32 2005
*************************************************************


*************************************************************
BOOKKEEPING INFORMATION:
 
Status info for the Job : https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ
Current Status:     Done (Success)
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-test
reached on:         Wed May 25 17:22:47 2005
*************************************************************

saborido@fpsunae2:~/LCG/test-UI$ edg-job-get-output
https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ

Retrieving files from host: lxn1188.cern.ch ( for
https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ )

*********************************************************************************
                        JOB GET OUTPUT OUTCOME

 Output sandbox files for the job:
 - https://lxn1188.cern.ch:9000/gr2uN-kHzCgWPrzJMPoctQ
 have been successfully retrieved and stored in the directory:
 /tmp/jobOutput/saborido_gr2uN-kHzCgWPrzJMPoctQ

*********************************************************************************

Sitio web donde se encuentran los GIIS URL de los sites:

http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf

RPM population in Quattor:

Para conseguir los rpms con los que hacer el update en la base de datos de Quattor hacemos lo siguiente en un máquina que tenga instalado el apt-get (por ejemplo el lhcb42 o el lcg-ce...)

apt-get dist-upgrade -y --print-uris

luego esto (en el home de lcgadmin@lhcb01:rpm/ ../Scripts/guri.py

Luego hacemos:

swrep-client put i386_slc3 lcg-CA-0.29-lcg.noarch.rpm /lcg

Si son muchos hay que hacer:

for file in `ls *.rpm` ; do swrep-client put i386_slc3 $file /lcg; done

Aunque es posible hacer "bulk" operations (mirar documentación)

Y luego:

swrep-client template i386_slc3 > fixes/repository_usc_lcg_i386_slc3.tpl

Luego hay que abrir la "cdbop" y hacer:


cdb> get repository_usc_lcg_i386_slc3.tpl
(sustituir este fichero por el otro)
cdb> update repository_usc_lcg_i386_slc3.tpl
cdb> get pro_software_lcg2_service_security_accepted_cas.tpl
(modificar adecuadamente este fichero).
cdb> update pro_software_lcg2_service_security_accepted_cas.tpl
cdb> commit
cdb> close
cdb> exit

Y esperar que funcione....

Otros comandos útiles:

ncm-query --dump / | less

ncm_wrapper.sh spma

Éste último lo hemos hecho para "triggerear" la instalación de los paquetes nuevos en los worker nodes. De hecho, hemos hecho esto desde fpsunae2:

for a in `seq -w 9 42`; do ssh root@lhcb$a.usc.cesga.es ncm_wrapper.sh spma; done

Para mirar si el monitoring de gridICE funciona haced lo siguiente en el CE:

ldapsearch -x -H ldap://lcg-ce.usc.cesga.es:2136 -b 'mds-vo-name=local,o=grid' | grep GlueBatchSystemType?

Power failure

Durante la noche del 13/06/2005 ha habido un "power failure" en el CESGA. En el lcg-ce no se arranca el servicio "lcg-bdii", así que lo arranco a mano. Por otro lado, parece haber un acoplamiento entre los servicios lcg-bdii e iptables. De hecho, el servicio "lcg-bdii" parece arrancar también el "iptables". Éste último no arranca bien por si sólo.

service lcg-bdii restart

Y luego mirar:

ldapsearch -x -H ldap://lcg-ce.usc.cesga.es:2170 -b mds-vo-name=usc-lcg2,o=grid | grep -i free

Importante: ahora el servicio es bdii, no lcg-bdii

Wiki donde mirar FAQs sobre los fallos de monitoring en los sites LCG:

http://goc.grid.sinica.edu.tw/gocwiki/FrontPage

El problema


saborido@lcg-ce:~/tests$ edg-job-submit -r
lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-short testJob.jdl

**** Error: UI_NO_VO_CONF_INFO ****
Unable to find configuration information for VO "lhcb"

**** Error: UI_NO_VOMS ****  
Unable to determine a valid user's VO

Lo arreglé copiando este fichero "/opt/edg/etc/lhcb/edg_wl_ui.conf" que tenía en fpsunae2 al lcg-ce:

scp -r lhcb root@lcg-ce.usc.cesga.es:/opt/edg/etc/lhcb

INSTALLING QUATTOR 1.1.X AT LHCB03.USC.CESGA.ES:*


saborido@lhcb03:~$ swrep-client addplatform i386_slc3
Scientific Linux CERN Release 3.0.5 (SL)
Platform i386_slc3 successfully added

saborido@lhcb03:~$ swrep-client addarea i386_slc3 /lcg
Scientific Linux CERN Release 3.0.5 (SL)
Area /lcg successfully created in platform i386_slc3

saborido@lhcb03:~$ swrep-client addarea i386_slc3 /quattor
Scientific Linux CERN Release 3.0.5 (SL)
Area /quattor successfully created in platform i386_slc3

saborido@lhcb03:~$ swrep-client addarea i386_slc3 /updates
Scientific Linux CERN Release 3.0.5 (SL)
Area /updates successfully created in platform i386_slc3

saborido@lhcb03:~$ swrep-client addarea i386_slc3 /base
Scientific Linux CERN Release 3.0.5 (SL)
Area /base successfully created in platform i386_slc3

platform: i386_slc3
areas:    /lcg    /quattor    /updates    /base

How to close the site so it won't receive anymore jobs from the RBs*

If you want to stop the RB from sending you jobs (for example as you want to do some update on your CE), an atribute exists in the ldif Schema which is consulted by the RB to check the availability of your site. This page explains how to publish a closed status on your farm. It's about the information system.

The right place:

The attributes GlueCEStateStaus? can take some values for which the RB will look. These attributes may be :

  • Queueing: the queue can accept job submission, but can't be served by the scheduler
  • Production: the queue can accept job submissions and is served by a scheduler
  • Closed: The queue can't accept job submission and can't be served by a scheduler
  • Draining: the queue can't accept job submission, but can be served by a scheduler

This attribute is published under the dn : GlueCEUniqueId?=hostname... And such a dn exists for each queue.

Now we are going to change tehe value of this attribute.

Changes:

You'll have to edit the /opt/lcg/var/gip/lcg-info-generic.conf Find the line whith the right dn. If it doesn't allready exist, add the line :

GlueCEStateStatus?
Closed for closing your site.

else, you'll only have to change the value of this attribute. Be carefull to remove any space at the end of the line. Do this for each queue you have to change. You should find a dn for each of these queues.

Taking the changes

The command : /opt/lcg/sbin/lcg-info-generic-config /opt/lcg/var/gip/lcg-info-generic.conf should do it.

Don't forget that, if you're using a BDII as GIIS, you have to wait until the BDII refreshes itself or refresh it manually. Rollback

If you want to remove the closed status of your site, simply remove the line you added or change the value at will.


Me he cargado el rpm "rpm -e edg-rgma-gin-4.0.6-1" en lhcb03, porque estaba activado el cron horario famoso que provocaba el siguiente error:

/etc/cron.hourly/edg-rgma-gin-monitor.cron: chown: `rgma': invalid user

Sobre certificados:

Es necesario hacer un "service globus-gatekeeper restart" en el CE después de instalar el nuevo certificado, de lo contrario el test

openssl s_client -ssl3 -connect lcg-ce.usc.cesga.es:2119 | openssl x509 -noout -dates

No retorna las fechas correctas... porque debe estar cargado en algún sitio todavía el certificado antiguo.

Por cierto, en el SE el servicio globus-gatekeeper estaba parado... lo acabo de arrancar.... (¿?)

Para convertir un certificado tipo pem a uno pkcs12 (p12) para importarlo a un web browser se hace:

openssl pkcs12 -export -in usercert.pem -inkey userkey.pem -out usercert.p12

Accounting en el MON:

Pequeñas recetas sobre como modificar la base e datos de accounting en el MON:

mysql -u root -p accounting
show tables;
select * from RepublishInfo; (para ver lo que hay)
update RepublishInfo set MeasurementTime='01:00:00';
update RepublishInfo set MeasurementDate='2005-11-09';

CPU Normalization results:

lhcb10: PIV Xeon 2.4 GHz:

     [root@lhcb10 cpu_normalization_standalone_test]# ./get_spec_int.pl

     RESULTS:
       Current SpecInt = 1078
 ===>  Proposed SpecInt = 980
       Current cputmult = 0.855288
       Proposed cputmult = 0.90988085106383
       Current wallmult = 0.855288
       Proposed wallmult = 0.90988085106383

     SUMMARY:
       * The cputmult factor seems to be OK
       * The wallmult factor seems to be OK

lhcb15: PIV Xeon 2.66 GHz:

     [root@lhcb15 cpu_normalization_standalone_test]# ./get_spec_int.pl

     RESULTS:
       Current SpecInt = 1078
 ===>  Proposed SpecInt = 1098
       Current cputmult = 0.876623
       Proposed cputmult = 1.01932906976744
       Current wallmult = 0.876623
       Proposed wallmult = 1.01932906976744

     SUMMARY:
       * The cputmult factor seems to be OK
       * The wallmult factor seems to be OK

lhcb20: PIV Xeon 2.8 GHz:

     [root@lhcb20 cpu_normalization_standalone_test]# ./get_spec_int.pl

     RESULTS:
       Current SpecInt = 1078
 ===>  Proposed SpecInt = 1295
       Current cputmult = 1.08163
       Proposed cputmult = 1.20181111111111
       Current wallmult = 1.08163
       Proposed wallmult = 1.20181111111111

     SUMMARY:
       * The cputmult factor seems to be OK
       * The wallmult factor seems to be OK


Si hacemos el "weighted average" para las 74 CPUS que tenemos resulta:

(14*980.+14*1098.+46*1295) / 74. = 1198.1351351351352

Nueva versión de cpu_normalization_standalone_test:

=========
=========
Nueva versión del cpu_normalization_standalone_test:
=========
=========
lhcb10: PIV Xeon 2.4 GHz:
       [root@lhcb10 cpu_normalization_standalone_test]# ./get_spec_int.pl

       RESULTS:
         Current SpecInt = 1198
         Proposed SpecInt = 1039  <========= ¿? 2.4 GHz
         Current SpecFloat = 800
         Proposed SpecFloat = 951
         Current cputmult = 0.81803
         Proposed cputmult = 0.87
         Current wallmult = 0.81803
         Proposed wallmult = 0.87

=========
lhcb15: PIV Xeon 2.66 GHz:
       [root@lhcb15 cpu_normalization_standalone_test]# ./get_spec_int.pl

       RESULTS:
         Current SpecInt = 1198
         Proposed SpecInt = 945  <========= ¿? 2.6 GHz
         Current SpecFloat = 800
         Proposed SpecFloat = 898
         Current cputmult = 0.916528
         Proposed cputmult = 0.79
         Current wallmult = 0.916528
         Proposed wallmult = 0.79

=========
lhcb50: PIV Xeon 2.8 GHz:
       [root@lhcb50 cpu_normalization_standalone_test]# ./get_spec_int.pl

       RESULTS:
         Current SpecInt = 1198
         Proposed SpecInt = 1286
         Current SpecFloat = 800
         Proposed SpecFloat = 1496
         Current cputmult = 1.08097
         Proposed cputmult = 1.07
         Current wallmult = 1.08097
         Proposed wallmult = 1.07

Corriéndolo en lhcb20 obtengo lo mismo. Los resultados cambian un poco
respecto a la anterior versión del test, pero ya cambiaré los specint cuando
se haga el update a LCG-2_7_0.

=====

Con los números antiguos, tenemos:

(14*980. + 14*1098. + 68*1295) / 96. = 1220 SI2K promedio
(14*980. + 14*1098. + 68*1295) = 117152 SI2K total = 117 kSI2K

Si ponemos los los números nuevos, tenemos:

(14*1039. + 14*945. + 68*1286) / 96. = 1200 SI2K promedio
(14*1039. + 14*945. + 68*1286) = 115224 SI2K total = 115 kSI2K

Problems with RGMA.

To switch off RGMA, go to the MON BOX (usually the SE in small sites) and do the following:

/sbin/service tomcat5 stop

Números de serie y MAC addresses de las 2 máquinas nuevas:

  • HR8J12J 00:14:22:B0:B5:98 00:14:22:B0:B5:99 lhcb54.usc.cesga.es
  • GR8J12J 00:14:22:B0:B5:95 00:14:22:B0:B5:96 lhcb55.usc.cesga.es

to top

You are here: LCGatUSC > Saborido? > LcgInformation

to top

Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding this material Send feedback