Hello Didier,
The batch system that allows us to use the servers dedicated to simulation in my lab is HTCondor.
In the TraceWin help, at the section "For all other clusters", twserver is meant to be started with the argument called "cluster", that is ess or isipic as I guess from the other clusters examples. With HTCondor, I don't have such a name to feed twserver with. So I simply type "cluster"; I also tried to use the name of cluster node server, but "Incorrect argument" is then returned when trying to run twserver.
And, by the way, when I try to submit the job to the cluster in the way recommended by our IT team, no job is launched. To submit a job I created a test_TraceWin.submit file that is an argument for condor_submit command - download it here https://mycore.core-cloud.net/index.php ... rgWd0qALVw . In the twserver.log (attached), I don't find any information that would help.
I'm not sure that this kind of job submission is allowed with TraceWin. Can you tell me how to proceed, if you ever had to use HTCondor, or would it be possible to allow such a job submission method ?
Thank you in advance for any help.
Emmanuel
HTCondor batch system for job submission to a cluster
HTCondor batch system for job submission to a cluster
- Attachments
-
- twserver.log
- (888 Bytes) Downloaded 3 times
Re: HTCondor batch system for job submission to a cluster
Hi Emmanuel,
The 3 scripts "tw_job_run.sh", "tw_job_status.sh", "tw_job_kill.sh", given as examples in the manual must be modified to suit your cluster.
Then, the first step is to check that the 3 routines work correctly by hand without going through twserver.
- Can you launch a job via ‘tw_job_run.sh’ and give me the screen output.
- Check its status with ‘tw_job_status.sh’ and give me the screen output.
- Be able to kill the job via ‘tw_job_kill.sh’
Regards,
Didier
The 3 scripts "tw_job_run.sh", "tw_job_status.sh", "tw_job_kill.sh", given as examples in the manual must be modified to suit your cluster.
Then, the first step is to check that the 3 routines work correctly by hand without going through twserver.
- Can you launch a job via ‘tw_job_run.sh’ and give me the screen output.
- Check its status with ‘tw_job_status.sh’ and give me the screen output.
- Be able to kill the job via ‘tw_job_kill.sh’
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Well, there are no equivalent standard commands to sbatch, squeue or scancel that come with the slurm package. I can retrieve the job ID or put a kill -9/-15 instead of scancel (I'm not sure that's equivalent). But squeue has strictly no similar commands.
At least, if I simply write the following lines in tw_job_run.sh, I can launch the code and retrieve the job ID :
...
$1 $2 $3 $4 &
ID=$(pidof $full_name_code)
But I'm not sur that the code runs the servers listed : I couldn't see any sign of use of the other servers of the cluster's servers, even if they appear in the computer list. It seems that the job I can see running, run only on the submission server.
The submit file include a list of servers. The code isn't meant to choose the servers where it runs. This is HTCondor's job. The cluster's servers aren't reachable with the base communication protocole of TraceWin ssh. Finally, the .submit file does not run TraceWin.
Then I don't know if HTCondor and the slurm package are compatible.
I'm still clueless on the way to use this cluster for TraceWin
Emmanuel
At least, if I simply write the following lines in tw_job_run.sh, I can launch the code and retrieve the job ID :
...
$1 $2 $3 $4 &
ID=$(pidof $full_name_code)
But I'm not sur that the code runs the servers listed : I couldn't see any sign of use of the other servers of the cluster's servers, even if they appear in the computer list. It seems that the job I can see running, run only on the submission server.
The submit file include a list of servers. The code isn't meant to choose the servers where it runs. This is HTCondor's job. The cluster's servers aren't reachable with the base communication protocole of TraceWin ssh. Finally, the .submit file does not run TraceWin.
Then I don't know if HTCondor and the slurm package are compatible.
I'm still clueless on the way to use this cluster for TraceWin
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
I don't think we understand each other, so if you could give me a call, it would be easier,
Regards,
Didier
I don't think we understand each other, so if you could give me a call, it would be easier,
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
It seems I was considering things the wrong way.
I had an answer from the IT team saying that slurm and htcondor are two totally different batch systems.
However, they tell me that there might exist ways to convert the scripts from one system to the other (check the web site here https://portal.osg-htc.org/documentatio ... _HTCondor/).
I'm working on it.
Emmanuel
I had an answer from the IT team saying that slurm and htcondor are two totally different batch systems.
However, they tell me that there might exist ways to convert the scripts from one system to the other (check the web site here https://portal.osg-htc.org/documentatio ... _HTCondor/).
I'm working on it.
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
Cripts, I've given are only examples, they obviously need to be adapted keeping the inputs and outputs compatible.
Regards,
Didier
Cripts, I've given are only examples, they obviously need to be adapted keeping the inputs and outputs compatible.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Ok, this something I understood from the very beginning
So it requires from me a learning stage.
Emmanuel
So it requires from me a learning stage.
Emmanuel