Dear Emmanuel,
No need to add 20, 2 was enough for a test!
What is in the Remote_TW_xxx directories?
In the remote machine management tool, you can view the communication details using the "Detail" button, what does that say?
Regards,
Didier
HTCondor batch system for job submission to a cluster
Re: HTCondor batch system for job submission to a cluster
Yes, I understand, but I wanted to be sure to have enough time to check what's happening.
In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)
The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.
Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)
Here are the messages I get
[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]
Emmanuel
In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)
The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.
Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)
Here are the messages I get
[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Just in case it would be useful, I also find in .../TW/TWSERV~3/ the following files that are regularly updated since their timestamp evolves each time I check the directory (even tracelx64) :
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)
Emmanuel
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)
Emmanuel
Re: HTCondor batch system for job submission to a cluster
Dear Emmanuel,
Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.
Regards,
Didier
Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.
Regards,
Didier
Re: HTCondor batch system for job submission to a cluster
You're right, the init seems to be missing, here is what's written in the TraceWin.log file :
Cannot find init file
From->[PROCESS_INIT] : Constructor
I already saw this, but don't know what to to with it.
Emmanuel
Cannot find init file
From->[PROCESS_INIT] : Constructor
I already saw this, but don't know what to to with it.
Emmanuel
Last edited by Emmanuel on Wed 13 Nov 2024 16:26, edited 1 time in total.
Re: HTCondor batch system for job submission to a cluster
Dear emmanuel,
That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.
Regards,
Didier
That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.
Regards,
Didier