HTCondor batch system for job submission to a cluster
HTCondor batch system for job submission to a cluster
Hello Didier,
The batch system that allows us to use the servers dedicated to simulation in my lab is HTCondor.
In the TraceWin help, at the section "For all other clusters", twserver is meant to be started with the argument called "cluster", that is ess or isipic as I guess from the other clusters examples. With HTCondor, I don't have such a name to feed twserver with. So I simply type "cluster"; I also tried to use the name of cluster node server, but "Incorrect argument" is then returned when trying to run twserver.
And, by the way, when I try to submit the job to the cluster in the way recommended by our IT team, no job is launched. To submit a job I created a test_TraceWin.submit file that is an argument for condor_submit command - download it here https://mycore.core-cloud.net/index.php ... rgWd0qALVw . In the twserver.log (attached), I don't find any information that would help.
I'm not sure that this kind of job submission is allowed with TraceWin. Can you tell me how to proceed, if you ever had to use HTCondor, or would it be possible to allow such a job submission method ?
Thank you in advance for any help.
Emmanuel
The batch system that allows us to use the servers dedicated to simulation in my lab is HTCondor.
In the TraceWin help, at the section "For all other clusters", twserver is meant to be started with the argument called "cluster", that is ess or isipic as I guess from the other clusters examples. With HTCondor, I don't have such a name to feed twserver with. So I simply type "cluster"; I also tried to use the name of cluster node server, but "Incorrect argument" is then returned when trying to run twserver.
And, by the way, when I try to submit the job to the cluster in the way recommended by our IT team, no job is launched. To submit a job I created a test_TraceWin.submit file that is an argument for condor_submit command - download it here https://mycore.core-cloud.net/index.php ... rgWd0qALVw . In the twserver.log (attached), I don't find any information that would help.
I'm not sure that this kind of job submission is allowed with TraceWin. Can you tell me how to proceed, if you ever had to use HTCondor, or would it be possible to allow such a job submission method ?
Thank you in advance for any help.
Emmanuel
- Attachments
-
- twserver.log
- (888 Bytes) Downloaded 21 times
Re: HTCondor batch system for job submission to a cluster
Hi Emmanuel,
The 3 scripts "tw_job_run.sh", "tw_job_status.sh", "tw_job_kill.sh", given as examples in the manual must be modified to suit your cluster.
Then, the first step is to check that the 3 routines work correctly by hand without going through twserver.
- Can you launch a job via ‘tw_job_run.sh’ and give me the screen output.
- Check its status with ‘tw_job_status.sh’ and give me the screen output.
- Be able to kill the job via ‘tw_job_kill.sh’
Regards,
Didier
The 3 scripts "tw_job_run.sh", "tw_job_status.sh", "tw_job_kill.sh", given as examples in the manual must be modified to suit your cluster.
Then, the first step is to check that the 3 routines work correctly by hand without going through twserver.
- Can you launch a job via ‘tw_job_run.sh’ and give me the screen output.
- Check its status with ‘tw_job_status.sh’ and give me the screen output.
- Be able to kill the job via ‘tw_job_kill.sh’
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Well, there are no equivalent standard commands to sbatch, squeue or scancel that come with the slurm package. I can retrieve the job ID or put a kill -9/-15 instead of scancel (I'm not sure that's equivalent). But squeue has strictly no similar commands.
At least, if I simply write the following lines in tw_job_run.sh, I can launch the code and retrieve the job ID :
...
$1 $2 $3 $4 &
ID=$(pidof $full_name_code)
But I'm not sur that the code runs the servers listed : I couldn't see any sign of use of the other servers of the cluster's servers, even if they appear in the computer list. It seems that the job I can see running, run only on the submission server.
The submit file include a list of servers. The code isn't meant to choose the servers where it runs. This is HTCondor's job. The cluster's servers aren't reachable with the base communication protocole of TraceWin ssh. Finally, the .submit file does not run TraceWin.
Then I don't know if HTCondor and the slurm package are compatible.
I'm still clueless on the way to use this cluster for TraceWin
Emmanuel
At least, if I simply write the following lines in tw_job_run.sh, I can launch the code and retrieve the job ID :
...
$1 $2 $3 $4 &
ID=$(pidof $full_name_code)
But I'm not sur that the code runs the servers listed : I couldn't see any sign of use of the other servers of the cluster's servers, even if they appear in the computer list. It seems that the job I can see running, run only on the submission server.
The submit file include a list of servers. The code isn't meant to choose the servers where it runs. This is HTCondor's job. The cluster's servers aren't reachable with the base communication protocole of TraceWin ssh. Finally, the .submit file does not run TraceWin.
Then I don't know if HTCondor and the slurm package are compatible.
I'm still clueless on the way to use this cluster for TraceWin
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
I don't think we understand each other, so if you could give me a call, it would be easier,
Regards,
Didier
I don't think we understand each other, so if you could give me a call, it would be easier,
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
It seems I was considering things the wrong way.
I had an answer from the IT team saying that slurm and htcondor are two totally different batch systems.
However, they tell me that there might exist ways to convert the scripts from one system to the other (check the web site here https://portal.osg-htc.org/documentatio ... _HTCondor/).
I'm working on it.
Emmanuel
I had an answer from the IT team saying that slurm and htcondor are two totally different batch systems.
However, they tell me that there might exist ways to convert the scripts from one system to the other (check the web site here https://portal.osg-htc.org/documentatio ... _HTCondor/).
I'm working on it.
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
Cripts, I've given are only examples, they obviously need to be adapted keeping the inputs and outputs compatible.
Regards,
Didier
Cripts, I've given are only examples, they obviously need to be adapted keeping the inputs and outputs compatible.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Ok, this something I understood from the very beginning
So it requires from me a learning stage.
Emmanuel
So it requires from me a learning stage.
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Hello Didier,
I'm still trying to run TraceWin on a cluster that runs under HTCondor, but with no success till now.
the script tw_job_run.sh is written as follow :
If someone has any hint or sees any mistake, please help.
Emmanuel
I'm still trying to run TraceWin on a cluster that runs under HTCondor, but with no success till now.
the script tw_job_run.sh is written as follow :
Code: Select all
#!/bin/bash
export PATH=/pole_acc_ssi/froidef/TW:$PATH
full_name_code=$1
arg1=$2
arg2=$3
arg3=$4
cat > TraceWin_htc.submit << EOL
Universe = vanilla
executable = $full_name_code
Arguments = $arg1 $arg2 $arg3
output = TraceWin_job.out
log = TraceWin_job.log
error = TraceWin_job.err
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
+JobDurationCategory = "Long"
request_cpus = 1
request_memory = 3GB
request_disk = 5GB
Queue 1
EOL
# Launch job with argument en return job id in ID variable
# Launch job with the appropriate command from you batch system
condor_submit TraceWin_htc.submit
ID=$(condor_q -format "%d." ClusterId -format "%d\n" ProcId | tail -n 1)
Emmanuel
Last edited by Emmanuel on Thu 14 Nov 2024 11:21, edited 1 time in total.
Re: HTCondor batch system for job submission to a cluster
Hello Didier,
While I'm testing a script to send jobs to the submission server, I find in the TWServer directory 149 Remote_TW_*** directories whereas I set the number of run to 20 for X & Y magnet element rotation statistic static error study.
Emmanuel
While I'm testing a script to send jobs to the submission server, I find in the TWServer directory 149 Remote_TW_*** directories whereas I set the number of run to 20 for X & Y magnet element rotation statistic static error study.
Emmanuel
Re: HTCondor batch system for job submission to a cluster
In addition, TraceWin finally indicates (after a few minutes) that all the processors on the submission server are on the status "failled". The directories Remote_TW_*** are still visible in the TWServer directory.
Emmanuel
Emmanuel