HTCondor batch system for job submission to a cluster
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
No need to add 20, 2 was enough for a test!
What is in the Remote_TW_xxx directories?
In the remote machine management tool, you can view the communication details using the "Detail" button, what does that say?
Regards,
Didier
No need to add 20, 2 was enough for a test!
What is in the Remote_TW_xxx directories?
In the remote machine management tool, you can view the communication details using the "Detail" button, what does that say?
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Yes, I understand, but I wanted to be sure to have enough time to check what's happening.
In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)
The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.
Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)
Here are the messages I get
[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]
Emmanuel
In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)
The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.
Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)
Here are the messages I get
[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Just in case it would be useful, I also find in .../TW/TWSERV~3/ the following files that are regularly updated since their timestamp evolves each time I check the directory (even tracelx64) :
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)
Emmanuel
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.
Regards,
Didier
Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
You're right, the init seems to be missing, here is what's written in the TraceWin.log file :
Cannot find init file
From->[PROCESS_INIT] : Constructor
I saw this previously, but don't know what to do with it.
Emmanuel
Cannot find init file
From->[PROCESS_INIT] : Constructor
I saw this previously, but don't know what to do with it.
Emmanuel
Last edited by Emmanuel on Thu 14 Nov 2024 07:29, edited 3 times in total.
Re: HTCondor batch system for job submission to a cluster
Dear emmanuel,
That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.
Regards,
Didier
That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
Dear Didier,
What I see, when I launch TraceWin is that tracelx64 and tw_files.merged first have a size different from zero (5.4 Mo and 24.36 Mo respectively). Then their size becomes zero. I think this happens when the script tries to run tracelx64, because what I see in the submit script is that the path indicated is different from the directory where the submit script actually is. For exemple, if the script is located in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, then the variable "executable" equals .../TW/TWSERV~3/Remote_TW_73_lpsc*****/twlx64. So TraceWin does not find the files related to tracelx64 in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, and might get lost.
The problem is then : how to get rid of the directory in the full_code_name argument of tw_job_run.sh ?
I modified tw_job_run.sh by replacing full_name_code=$1 by these two lines :
It only works for once in the firs directory, then while TraceWin creates the other directories, in all .../TW/TWSERV~3/Remote_TW_xx_lpsc*****/ the path written in the submit script in the variable "executable" is different from the directory where the submit script is located.
I think that TraceWin doesn't keep the link between the path where submit script is written and the path where tracelx64 is launched. I don't know how to address this issue as I cannot modify anything else than tw_job_run.sh or TraceWin_htc.submit, and I can't send a command to TraceWin to lock down the link between the path of full_code_name and the location of TraceWin_htc.submit.
I also notices that TraceWin_htc.submit is constantly modified in .../TW/TWSERV~3/ (its timestamp varies during the run)
Emmanuel
What I see, when I launch TraceWin is that tracelx64 and tw_files.merged first have a size different from zero (5.4 Mo and 24.36 Mo respectively). Then their size becomes zero. I think this happens when the script tries to run tracelx64, because what I see in the submit script is that the path indicated is different from the directory where the submit script actually is. For exemple, if the script is located in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, then the variable "executable" equals .../TW/TWSERV~3/Remote_TW_73_lpsc*****/twlx64. So TraceWin does not find the files related to tracelx64 in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, and might get lost.
The problem is then : how to get rid of the directory in the full_code_name argument of tw_job_run.sh ?
I modified tw_job_run.sh by replacing full_name_code=$1 by these two lines :
Code: Select all
current_dir=$(pwd)
full_name_code="$current_dir/tracelx64"
I think that TraceWin doesn't keep the link between the path where submit script is written and the path where tracelx64 is launched. I don't know how to address this issue as I cannot modify anything else than tw_job_run.sh or TraceWin_htc.submit, and I can't send a command to TraceWin to lock down the link between the path of full_code_name and the location of TraceWin_htc.submit.
I also notices that TraceWin_htc.submit is constantly modified in .../TW/TWSERV~3/ (its timestamp varies during the run)
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
I'm finding it hard to understand this story about a repertoire that changes for no reason. To simplify things, I suggest you start by using just one core.
Regards,
Didier
I'm finding it hard to understand this story about a repertoire that changes for no reason. To simplify things, I suggest you start by using just one core.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
This morning, I modified tw_job_run.sh as follows hereafter without any progress on the issue. It seems that TraceWin doesn't use it. The submit script that I find in the remote directory is not modified. I restarted twserver, but issue remains.
I also started TraceWin with one core, but it failed at running tracelx64, even if the path in the submit script was correct.
Emmanuel
I also started TraceWin with one core, but it failed at running tracelx64, even if the path in the submit script was correct.
Emmanuel
Code: Select all
#!/bin/bash
export PATH=/pole_acc_ssi/froidef/TW:$PATH
current_dir=$(pwd)
full_name_code="$current_dir/tracelx64"
arg1=$2
arg2=$3
arg3=$4
file_name="TraceWin_htc.submit"
full_file_name="$current_dir/$file_name"
out_file_name=="$current_dir/TraceWin_job.out"
log_file_name=="$current_dir/TraceWin_job.log"
err_file_name=="$current_dir/TraceWin_job.err"
cat > $full_file_name << EOL
Universe = vanilla
executable = $full_name_code
Arguments = $arg1 $arg2 $arg3
output = $out_file_name
log = $log_file_name
error = $err_file_name
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
+JobDurationCategory = "Long"
request_cpus = 1
request_memory = 3GB
request_disk = 5GB
Queue 1
EOL
# Launch job with argument en return job id in ID variable
# Launch job with the appropriate command from you batch system
condor_submit $full_file_name
ID=$(condor_q -format "%d" ClusterId | tail -n 1)
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
I don't understand this sentence:
The server uses the scipt you provide, so you haven't really modified it !
Regards,
Didier
I don't understand this sentence:
Code: Select all
This morning, I modified tw_job_run.sh as follows hereafter without any progress on the issue. It seems that TraceWin doesn't use it. The submit script that I find in the remote directory is not modified. I restarted twserver, but issue remains.
Regards,
Didier