HTCondor batch system for job submission to a cluster

https://www.dacm-logiciels.fr/tracewin
User avatar
FranceDidier
Administrator
Administrator
Posts: 965
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

No need to add 20, 2 was enough for a test!
What is in the Remote_TW_xxx directories?
In the remote machine management tool, you can view the communication details using the "Detail" button, what does that say?

Regards,

Didier
FranceEmmanuel
Apprentice
Apprentice
Posts: 29
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

Yes, I understand, but I wanted to be sure to have enough time to check what's happening.

In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/

remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)

The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.

Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)

Here are the messages I get

[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]


Emmanuel
FranceEmmanuel
Apprentice
Apprentice
Posts: 29
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

Just in case it would be useful, I also find in .../TW/TWSERV~3/ the following files that are regularly updated since their timestamp evolves each time I check the directory (even tracelx64) :
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)

Emmanuel
User avatar
FranceDidier
Administrator
Administrator
Posts: 965
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.

Regards,

Didier
FranceEmmanuel
Apprentice
Apprentice
Posts: 29
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

You're right, the init seems to be missing, here is what's written in the TraceWin.log file :

Cannot find init file
From->[PROCESS_INIT] : Constructor

I already saw this, but don't know what to to with it.

Emmanuel
Last edited by FranceEmmanuel on Wed 13 Nov 2024 16:26, edited 1 time in total.
User avatar
FranceDidier
Administrator
Administrator
Posts: 965
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear emmanuel,

That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.

Regards,

Didier
Post Reply