Skip to topic | Skip to bottom
Home
LCGatUSC
LCGatUSC.ProblemsFoundAtUSC-LCG2r1.14 - 30 Aug 2015 - 08:13 - MarcosASecotopic end

Start of topic | Skip to actions
-- MarcosASeco - 28 Mar 2007

pbs_mom can not connect to pbs_server

  • Symptoms:
    • Jobs on Queue are not submitted even if there are empty nodes
    • The following string appears in log files:
      DateStamp lcg-ce PBS_Server: is_request, bad attempt to connect from ip_number:1020 (address not trusted)

  • Solution:
    • pbs-mom should not be running on the CE

Maradona Problem

  • Line "set queue ops resources_default.nodes = 1:lcgpro" missing from pbs configuration. Added by hand with "qmgr"

Jobs not reaching the queue

  • Symptoms:
    • The queue works fine when submitting jobs via qsub.
    • No job reach the queue.
    • In the account of the mapped user a file named gram_job_mgr_JOBID.log

  • Hints:
    • Check the gram_job... file. It contains useful information.
    • Once the problem was due to a empty /opt/globus/libexec/globus-script-initializer.

PBS_Server: is_request, bad attempt to connect from 127.0.0.1:1020 (address not trusted)

  • PBS_Server: is_request, bad attempt to connect from 193.144.34.68:1020 (address not trusted)

  • Solution: Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file

Jobs are kept in the waiting queue

  • This can be due to a wrong order initialization of the pbs componets.

  • Solution: Restart maui.

GRAM gatekeeper[???]: GSS failed Major:01090000 Minor:00000000 Token:00000003

  • This error occurs when a user could not be authenticated

  • Causes:
    • CRL's are not up to date
    • With the new yaim, the authorization for using the queues checks the primary group of the user, so with the new pool accounts this is a problem since the primary group for sgm/prd users does not coincide with the name of the queue.

  • Solutions:
    • Update the CRL's
    • Add the VO description of the user to the $(VO_NAME)_GROUP_ENABLE:
      • LHCB_GROUP_ENABLE="lhcb /VO=lhcb/GROUP=/lhcb/sgm /VO=lhcb/GROUP=/lhcb/lcgprod"

RGMA-host-cert-valid ops check timeout

  • The error happens when the rgma server can not be contacted from computers outside the local network.

  • Causes:
    • RGMA ports closed by the firewall

  • Solution:
    • Open the necessary ports in the firewall

Wrong user mapping

  • The user is not mapped to the right pool of accounts

  • Causes:
    • An old mapping is present in /etc/grid-security/gridmapdir/

  • Solution:
    • Remove the wrong mapping or erase the directory clean.

Internal server error when querying the voms server

  • On the logs appear an internal server error when querying the voms server throught VOMSCompatibility module

  • Causes:
    • The client is trying to map a non existing ROLE

  • Solution:
    • Reconfigure the systems with the right roles for the VO.

Got a job held event, reason: Unspecified gridmanager error.

  • In the logging info you see the following:
    • Got a job held event, reason: Unspecified gridmanager error.
    • Job got an error while in the CondorG queue.

There are a variety of reasons for that. You can find interesting information at http://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error.

One of the possible reasons (USC-LCG2 failure on 14/11/2007) is a full disk in a worker node. We found this out by looking at the mail inbox of the CE pool account in game. In this particular case: /var/spool/mail/opssgm004. PBS ususally sends error reporting messages by e-mail.

Cannot read JobWrapper output, both from Condor and from Maradona

APEL: cannot service request, client hostname is currently being blocked

  • There are several possible causes:
    • The IP of the monitoring box was changed. This happens because the jdk at the server side caches the IP's forever so if changes are produced the server needs to be restarted.
    • An old version of the r-gma servlet is in use and causes problems to the service.
  • Solution:
    • In the second case update the r-gma version.
    • Contact with Alastair Duncan or the r-gma support team explaining the problem because the server probably needs to be restarted.

The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.

  • From the globus toolkit documentation:
    • Verify with the service administrator that your certificate is signed by a certificate authority that is trusted by the service.
  • Solution:
    • Check that the times in the worker nodes are accurate.

to top

You are here: LCGatUSC > ProblemsFoundAtUSC-LCG2

to top

Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding this material Send feedback