<<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.14 - 30 Aug 2015 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 80 to 80

  • On the logs appear an internal server error when querying the voms server throught VOMSCompatibility module

  • Causes:
Changed:
<
<
    • The query ask for a non existing ROLE
>
>
    • The client is trying to map a non existing ROLE

  • Solution:
    • Reconfigure the systems with the right roles for the VO.
Line: 110 to 110

    • In the second case update the r-gma version.
    • Contact with Alastair Duncan or the r-gma support team explaining the problem because the server probably needs to be restarted.
Added:
>
>

The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.

  • From the globus toolkit documentation:
    • Verify with the service administrator that your certificate is signed by a certificate authority that is trusted by the service.
  • Solution:
    • Check that the times in the worker nodes are accurate.
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.13 - 02 Apr 2008 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 101 to 101

Added:
>
>

APEL: cannot service request, client hostname is currently being blocked

  • There are several possible causes:
    • The IP of the monitoring box was changed. This happens because the jdk at the server side caches the IP's forever so if changes are produced the server needs to be restarted.
    • An old version of the r-gma servlet is in use and causes problems to the service.
  • Solution:
    • In the second case update the r-gma version.
    • Contact with Alastair Duncan or the r-gma support team explaining the problem because the server probably needs to be restarted.
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.12 - 21 Mar 2008 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 27 to 27

  • Hints:
    • Check the gram_job... file. It contains useful information.
Changed:
<
<
    • Once the problem was due to a empty /opt/globus/libexec/globus-script-initialize.
>
>
    • Once the problem was due to a empty /opt/globus/libexec/globus-script-initializer.

PBS_Server: is_request, bad attempt to connect from 127.0.0.1:1020 (address not trusted)

Line: 89 to 89

  • In the logging info you see the following:
    • Got a job held event, reason: Unspecified gridmanager error.
Changed:
<
<
    • Job got an error while in the CondorG? queue.
>
>
    • Job got an error while in the CondorG queue.

There are a variety of reasons for that. You can find interesting information at http://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error.

One of the possible reasons (USC-LCG2 failure on 14/11/2007) is a full disk in a worker node. We found this out by looking at the mail inbox of the CE pool account in game. In this particular case: /var/spool/mail/opssgm004. PBS ususally sends error reporting messages by e-mail.

Added:
>
>

Cannot read JobWrapper output, both from Condor and from Maradona

 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.11 - 29 Feb 2008 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 11 to 11

    • The following string appears in log files:
      DateStamp lcg-ce PBS_Server: is_request, bad attempt to connect from ip_number:1020 (address not trusted)
Changed:
<
<
  • Solution ??:
    • Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file
>
>
  • Solution:
    • pbs-mom should not be running on the CE

Maradona Problem

 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.10 - 14 Nov 2007 - JuanJSaborido)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Deleted:
<
<

pbs_mom can not connect to pbs_server


%RENDERLIST{ depth="1" }%
Added:
>
>

pbs_mom can not connect to pbs_server


  • Symptoms:
    • Jobs on Queue are not submitted even if there are empty nodes
    • The following string appears in log files:
Line: 85 to 85

  • Solution:
    • Reconfigure the systems with the right roles for the VO.
Added:
>
>

Got a job held event, reason: Unspecified gridmanager error.

  • In the logging info you see the following:
    • Got a job held event, reason: Unspecified gridmanager error.
    • Job got an error while in the CondorG? queue.

There are a variety of reasons for that. You can find interesting information at http://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error.

One of the possible reasons (USC-LCG2 failure on 14/11/2007) is a full disk in a worker node. We found this out by looking at the mail inbox of the CE pool account in game. In this particular case: /var/spool/mail/opssgm004. PBS ususally sends error reporting messages by e-mail.

 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.9 - 23 Oct 2007 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 65 to 65

  • Solution:
    • Open the necessary ports in the firewall
Added:
>
>

Wrong user mapping

  • The user is not mapped to the right pool of accounts

  • Causes:
    • An old mapping is present in /etc/grid-security/gridmapdir/

  • Solution:
    • Remove the wrong mapping or erase the directory clean.

Internal server error when querying the voms server

  • On the logs appear an internal server error when querying the voms server throught VOMSCompatibility module

  • Causes:
    • The query ask for a non existing ROLE

  • Solution:
    • Reconfigure the systems with the right roles for the VO.
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.8 - 21 Sep 2007 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 55 to 55

    • Add the VO description of the user to the $(VO_NAME)_GROUP_ENABLE:
      • LHCB_GROUP_ENABLE="lhcb /VO=lhcb/GROUP=/lhcb/sgm /VO=lhcb/GROUP=/lhcb/lcgprod"
Added:
>
>

RGMA-host-cert-valid ops check timeout

  • The error happens when the rgma server can not be contacted from computers outside the local network.

  • Causes:
    • RGMA ports closed by the firewall

  • Solution:
    • Open the necessary ports in the firewall
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.7 - 27 Jul 2007 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 35 to 35

  • Solution: Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file
Changed:
<
<

Jobs are keeped in the waiting queue

>
>

Jobs are kept in the waiting queue


  • This can be due to a wrong order initialization of the pbs componets.

  • Solution: Restart maui.
Added:
>
>

GRAM gatekeeper[???]: GSS failed Major:01090000 Minor:00000000 Token:00000003

  • This error occurs when a user could not be authenticated

  • Causes:
    • CRL's are not up to date
    • With the new yaim, the authorization for using the queues checks the primary group of the user, so with the new pool accounts this is a problem since the primary group for sgm/prd users does not coincide with the name of the queue.

  • Solutions:
    • Update the CRL's
    • Add the VO description of the user to the $(VO_NAME)_GROUP_ENABLE:
      • LHCB_GROUP_ENABLE="lhcb /VO=lhcb/GROUP=/lhcb/sgm /VO=lhcb/GROUP=/lhcb/lcgprod"
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.6 - 28 Jun 2007 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007

pbs_mom can not connect to pbs_server

%RENDERLIST{ depth="1" }%

Changed:
<
<
  • symptoms:
>
>

  • Symptoms:

    • Jobs on Queue are not submitted even if there are empty nodes
    • The following string appears in log files:
      DateStamp lcg-ce PBS_Server: is_request, bad attempt to connect from ip_number:1020 (address not trusted)
Changed:
<
<
  • solution ??:
>
>

  • Solution ??:

    • Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file

Maradona Problem

Line: 16 to 19

  • Line "set queue ops resources_default.nodes = 1:lcgpro" missing from pbs configuration. Added by hand with "qmgr"

Jobs not reaching the queue

Changed:
<
<
  • symptoms:
>
>

  • Symptoms:

    • The queue works fine when submitting jobs via qsub.
    • No job reach the queue.
    • In the account of the mapped user a file named gram_job_mgr_JOBID.log
Line: 27 to 32

PBS_Server: is_request, bad attempt to connect from 127.0.0.1:1020 (address not trusted)

  • PBS_Server: is_request, bad attempt to connect from 193.144.34.68:1020 (address not trusted)
Added:
>
>


*Solution
Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file
Added:
>
>

Jobs are keeped in the waiting queue

  • This can be due to a wrong order initialization of the pbs componets.

  • Solution: Restart maui.
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.5 - 21 Jun 2007 - JuanJSaborido)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 24 to 24

    • Check the gram_job... file. It contains useful information.
    • Once the problem was due to a empty /opt/globus/libexec/globus-script-initialize.
Added:
>
>

PBS_Server: is_request, bad attempt to connect from 127.0.0.1:1020 (address not trusted)

  • PBS_Server: is_request, bad attempt to connect from 193.144.34.68:1020 (address not trusted)
    *Solution
    Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.4 - 20 Jun 2007 - MarcosASeco)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 15 to 15

Maradona Problem

  • Line "set queue ops resources_default.nodes = 1:lcgpro" missing from pbs configuration. Added by hand with "qmgr"
Added:
>
>

Jobs not reaching the queue

  • symptoms:
    • The queue works fine when submitting jobs via qsub.
    • No job reach the queue.
    • In the account of the mapped user a file named gram_job_mgr_JOBID.log
  • Hints:
    • Check the gram_job... file. It contains useful information.
    • Once the problem was due to a empty /opt/globus/libexec/globus-script-initialize.
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.3 - 18 Jun 2007 - JuanJSaborido)

META TOPICPARENT WebHome
-- MarcosASeco - 28 Mar 2007
Line: 12 to 12

  • solution ??:
    • Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file
Added:
>
>

Maradona Problem

  • Line "set queue ops resources_default.nodes = 1:lcgpro" missing from pbs configuration. Added by hand with "qmgr"
 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.2 - 28 Mar 2007 - MarcosASeco)

META TOPICPARENT WebHome
Added:
>
>
-- MarcosASeco - 28 Mar 2007

Added:
>
>

pbs_mom can not connect to pbs_server


Changed:
<
<
-- MarcosASeco - 28 Mar 2007
>
>

  • symptoms:
    • Jobs on Queue are not submitted even if there are empty nodes
    • The following string appears in log files:
      DateStamp lcg-ce PBS_Server: is_request, bad attempt to connect from ip_number:1020 (address not trusted)
  • solution ??:
    • Add the hostname of the CE and 'localhost' to /var/spool/pbs/server_priv/nodes file

 <<O>>  Difference Topic ProblemsFoundAtUSC-LCG2 (r1.1 - 28 Mar 2007 - MarcosASeco)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

-- MarcosASeco - 28 Mar 2007

Revision r1.1 - 28 Mar 2007 - 16:49 - MarcosASeco
Revision r1.14 - 30 Aug 2015 - 08:13 - MarcosASeco