HTCondor batch system for job submission to a cluster

https://www.dacm-logiciels.fr/tracewin
User avatar
FranceDidier
Administrator
Administrator
Posts: 1140
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

No need to add 20, 2 was enough for a test!
What is in the Remote_TW_xxx directories?
In the remote machine management tool, you can view the communication details using the "Detail" button, what does that say?

Regards,

Didier
FranceEmmanuel
Initiated
Initiated
Posts: 45
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

Yes, I understand, but I wanted to be sure to have enough time to check what's happening.

In each directory, I find a list of files as for example in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/

remote.ini_24 (44,8 ko)
tracelx64 (o octet)
Trace_Win.log (x ko)
TraceWin_htc.submit
TraceWin_job.err (o octet)
TraceWin_job.log (x ko)
TraceWin_job.out (x ko)
tw_files.merged (0 octet)

The TraceWin.* files are generated by the TraceWin_htc.submit script file. What's in this script is in my previous message.

Whereas the tw_job_run.sh runs in the command line, and even if the directory is marked for deletion, there can remain the list of files. In some of the directories there are only 3 files
remote.ini_24 (44,8 ko)
tracelx64 (o octet)
tw_files.merged (o octet)

Here are the messages I get

[38] Exec 'tracelx64' ->Failed [Run Failed]
[38] Delete file 'Remote_DATA.dat_37' -> Ok
[41] Exec 'tracelx64' ->Failed [Run Failed]
[43] Exec 'tracelx64' ->Failed [Run Failed]
[45] Exec 'tracelx64' ->Failed [Run Failed]
[49] Exec 'tracelx64' ->Failed [Run Failed]
[49] Delete file 'Remote_DATA.dat_48' -> Ok
[51] Exec 'tracelx64' ->Failed [Run Failed]
[52] Exec 'tracelx64' ->Failed [ConnectToHost : TimeOut]
[53] Exec 'tracelx64' ->Failed [Run Failed]
[53] Delete file 'Remote_DATA.dat_52' -> Ok
[55] Exec 'tracelx64' ->Failed [Run Failed]
[55] Delete file 'Remote_DATA.dat_54' -> Ok
[57] Exec 'tracelx64' ->Failed [Run Failed]
[57] Delete file 'Remote_DATA.dat_56' -> Ok
[59] Exec 'tracelx64' ->Failed [Run Failed]
[59] Delete file 'Remote_DATA.dat_58' -> Ok
[61] Exec 'tracelx64' ->Failed [Run Failed]
[61] Delete file 'Remote_DATA.dat_60' -> Ok
[63] Exec 'tracelx64' ->Failed [Run Failed]
[63] Delete file 'Remote_DATA.dat_62' -> Ok
[70] Exec 'tracelx64' ->Failed [Run Failed]
[74] Exec 'tracelx64' ->Failed [Run Failed]
[74] GetActif -> Ok
[76] Exec 'tracelx64' ->Failed [Run Failed]
[76] GetActif -> Ok
[78] Exec 'tracelx64' ->Failed [Run Failed]
[78] GetActif -> Ok
[80] Exec 'tracelx64' ->Failed [Run Failed]
[80] GetActif -> Ok
[82] Exec 'tracelx64' ->Failed [Run Failed]
[82] GetActif -> Ok
[83] Exec 'tracelx64' ->Failed [Run Failed]
[83] GetActif -> Ok
[86] Exec 'tracelx64' ->Failed [Run Failed]
[86] GetActif -> Ok
[88] Exec 'tracelx64' ->Failed [Run Failed]
[88] GetActif -> Ok
[97] Exec 'tracelx64' ->Failed [Run Failed]
[97] GetActif -> Ok
[99] Exec 'tracelx64' ->Failed [Run Failed]
[103] Exec 'tracelx64' ->Failed [Run Failed]
[103] GetActif -> Ok
[105] Exec 'tracelx64' ->Failed [Run Failed]
[105] GetActif -> Ok
[126] Exec 'tracelx64' ->Failed [Run Failed]
[43] GetActif -> Ok
[45] GetActif -> Ok
[47] Exec 'tracelx64' ->Failed [Run Failed]
[51] GetActif -> Ok
[52] GetActif -> Ok
[70] GetActif -> Ok
[124] Exec 'tracelx64' ->Failed [Run Failed]


Emmanuel
FranceEmmanuel
Initiated
Initiated
Posts: 45
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

Just in case it would be useful, I also find in .../TW/TWSERV~3/ the following files that are regularly updated since their timestamp evolves each time I check the directory (even tracelx64) :
tracelx64 (5,4 Mo)
Trace_Win.log (x octet)
TraceWin_htc.submit (367 octet)
TraceWin_job.err (o octet)
TraceWin_job.log (x Mo)
TraceWin_job.out (x octet)
twserver (772,6 ko)
twserver.log (x ko)

Emmanuel
User avatar
FranceDidier
Administrator
Administrator
Posts: 1140
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

Please show me "TraceWin.log" file, it contains raison why the run failed normally. I think some files are missing.

Regards,

Didier
FranceEmmanuel
Initiated
Initiated
Posts: 45
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

You're right, the init seems to be missing, here is what's written in the TraceWin.log file :

Cannot find init file
From->[PROCESS_INIT] : Constructor

I saw this previously, but don't know what to do with it.

Emmanuel
Last edited by FranceEmmanuel on Thu 14 Nov 2024 07:29, edited 3 times in total.
User avatar
FranceDidier
Administrator
Administrator
Posts: 1140
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear emmanuel,

That means ini file is missing. For example, in directory, .../TW/TWSERV~3/Remote_TW_24_lpsc/ you should have "remote.ini_24".
It's very strange because, you got it.

Regards,

Didier
FranceEmmanuel
Initiated
Initiated
Posts: 45
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

Dear Didier,

What I see, when I launch TraceWin is that tracelx64 and tw_files.merged first have a size different from zero (5.4 Mo and 24.36 Mo respectively). Then their size becomes zero. I think this happens when the script tries to run tracelx64, because what I see in the submit script is that the path indicated is different from the directory where the submit script actually is. For exemple, if the script is located in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, then the variable "executable" equals .../TW/TWSERV~3/Remote_TW_73_lpsc*****/twlx64. So TraceWin does not find the files related to tracelx64 in .../TW/TWSERV~3/Remote_TW_24_lpsc*****/, and might get lost.
The problem is then : how to get rid of the directory in the full_code_name argument of tw_job_run.sh ?
I modified tw_job_run.sh by replacing full_name_code=$1 by these two lines :

Code: Select all

current_dir=$(pwd)
full_name_code="$current_dir/tracelx64"
It only works for once in the firs directory, then while TraceWin creates the other directories, in all .../TW/TWSERV~3/Remote_TW_xx_lpsc*****/ the path written in the submit script in the variable "executable" is different from the directory where the submit script is located.

I think that TraceWin doesn't keep the link between the path where submit script is written and the path where tracelx64 is launched. I don't know how to address this issue as I cannot modify anything else than tw_job_run.sh or TraceWin_htc.submit, and I can't send a command to TraceWin to lock down the link between the path of full_code_name and the location of TraceWin_htc.submit.

I also notices that TraceWin_htc.submit is constantly modified in .../TW/TWSERV~3/ (its timestamp varies during the run)

Emmanuel
User avatar
FranceDidier
Administrator
Administrator
Posts: 1140
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

I'm finding it hard to understand this story about a repertoire that changes for no reason. To simplify things, I suggest you start by using just one core.

Regards,

Didier
FranceEmmanuel
Initiated
Initiated
Posts: 45
Joined: Thu 22 Sep 2022 08:45
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceEmmanuel »

This morning, I modified tw_job_run.sh as follows hereafter without any progress on the issue. It seems that TraceWin doesn't use it. The submit script that I find in the remote directory is not modified. I restarted twserver, but issue remains.

I also started TraceWin with one core, but it failed at running tracelx64, even if the path in the submit script was correct.

Emmanuel

Code: Select all

#!/bin/bash
export PATH=/pole_acc_ssi/froidef/TW:$PATH

current_dir=$(pwd)
full_name_code="$current_dir/tracelx64"

arg1=$2
arg2=$3
arg3=$4

file_name="TraceWin_htc.submit"
full_file_name="$current_dir/$file_name"
out_file_name=="$current_dir/TraceWin_job.out"
log_file_name=="$current_dir/TraceWin_job.log"
err_file_name=="$current_dir/TraceWin_job.err"

cat > $full_file_name << EOL
Universe = vanilla
executable = $full_name_code
Arguments = $arg1 $arg2 $arg3
output = $out_file_name
log = $log_file_name
error = $err_file_name
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
+JobDurationCategory = "Long"
request_cpus = 1
request_memory = 3GB
request_disk   = 5GB
Queue 1
EOL

# Launch job with argument en return job id in ID variable
# Launch job with the appropriate command from you batch system

condor_submit $full_file_name

ID=$(condor_q -format "%d" ClusterId | tail -n 1)
User avatar
FranceDidier
Administrator
Administrator
Posts: 1140
Joined: Wed 26 Aug 2020 14:40
Country:
France (fr)
France

Re: HTCondor batch system for job submission to a cluster

Post by FranceDidier »

Dear Emmanuel,

I don't understand this sentence:

Code: Select all

This morning, I modified tw_job_run.sh as follows hereafter without any progress on the issue. It seems that TraceWin doesn't use it. The submit script that I find in the remote directory is not modified. I restarted twserver, but issue remains.
The server uses the scipt you provide, so you haven't really modified it !

Regards,

Didier
Post Reply