HTCondor batch system for job submission to a cluster

Emmanuel · Post by **Emmanuel** » Wed 30 Oct 2024 16:42

Hello Didier,

The batch system that allows us to use the servers dedicated to simulation in my lab is HTCondor.

In the TraceWin help, at the section "For all other clusters", twserver is meant to be started with the argument called "cluster", that is ess or isipic as I guess from the other clusters examples. With HTCondor, I don't have such a name to feed twserver with. So I simply type "cluster"; I also tried to use the name of cluster node server, but "Incorrect argument" is then returned when trying to run twserver.

And, by the way, when I try to submit the job to the cluster in the way recommended by our IT team, no job is launched. To submit a job I created a test_TraceWin.submit file that is an argument for condor_submit command - download it here https://mycore.core-cloud.net/index.php ... rgWd0qALVw . In the twserver.log (attached), I don't find any information that would help.

I'm not sure that this kind of job submission is allowed with TraceWin. Can you tell me how to proceed, if you ever had to use HTCondor, or would it be possible to allow such a job submission method ?

Thank you in advance for any help.

Emmanuel

Post by **Didier** » Thu 31 Oct 2024 11:02

Hi Emmanuel,

The 3 scripts "tw_job_run.sh", "tw_job_status.sh", "tw_job_kill.sh", given as examples in the manual must be modified to suit your cluster.
Then, the first step is to check that the 3 routines work correctly by hand without going through twserver.
- Can you launch a job via ‘tw_job_run.sh’ and give me the screen output.
- Check its status with ‘tw_job_status.sh’ and give me the screen output.
- Be able to kill the job via ‘tw_job_kill.sh’

Regards,

Didier

Emmanuel · Post by **Emmanuel** » Mon 4 Nov 2024 08:23

Well, there are no equivalent standard commands to sbatch, squeue or scancel that come with the slurm package. I can retrieve the job ID or put a kill -9/-15 instead of scancel (I'm not sure that's equivalent). But squeue has strictly no similar commands.

At least, if I simply write the following lines in tw_job_run.sh, I can launch the code and retrieve the job ID :
...
$1 $2 $3 $4 &
ID=$(pidof $full_name_code)

But I'm not sur that the code runs the servers listed : I couldn't see any sign of use of the other servers of the cluster's servers, even if they appear in the computer list. It seems that the job I can see running, run only on the submission server.

The submit file include a list of servers. The code isn't meant to choose the servers where it runs. This is HTCondor's job. The cluster's servers aren't reachable with the base communication protocole of TraceWin ssh. Finally, the .submit file does not run TraceWin.
Then I don't know if HTCondor and the slurm package are compatible.
I'm still clueless on the way to use this cluster for TraceWin

Emmanuel

Post by **Didier** » Mon 4 Nov 2024 14:33

Dear Emmanuel,

I don't think we understand each other, so if you could give me a call, it would be easier,

Regards,

Didier

Emmanuel · Post by **Emmanuel** » Mon 4 Nov 2024 14:35

It seems I was considering things the wrong way.

I had an answer from the IT team saying that slurm and htcondor are two totally different batch systems.

However, they tell me that there might exist ways to convert the scripts from one system to the other (check the web site here https://portal.osg-htc.org/documentatio ... _HTCondor/).

I'm working on it.

Emmanuel

Post by **Didier** » Mon 4 Nov 2024 14:53

Dear Emmanuel,

Cripts, I've given are only examples, they obviously need to be adapted keeping the inputs and outputs compatible.

Regards,

Didier

Emmanuel · Post by **Emmanuel** » Mon 4 Nov 2024 15:57

Ok, this something I understood from the very beginning

So it requires from me a learning stage.

Emmanuel

CEA Codes forum

HTCondor batch system for job submission to a cluster

HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster

Re: HTCondor batch system for job submission to a cluster