Welcome, Guest
Username: Password: Remember me

TOPIC: Bug with the maximum amount of chors allowed.

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43312

  • Youenn
  • Youenn's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 15
Hi everyone,

I run telemac on my Linux system, and to use all of my chors I use the mpirun --oversubscribe -np x command, which seems to work differently through my meshes.
Sometimes, I can use 23, sometimes only 18... over and over
Here is one of the error that i get if it crash.
The exact same configuration is always working if i ask for less chors
The administrator has disabled public write access.

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43318

  • jtravert
  • jtravert's Avatar
  • OFFLINE
  • Junior Boarder
  • Posts: 36
  • Thank you received: 23
Hello Youenn,

I guess you tried to add an attachment, but it did not work. The forum only supports a few extension types (.cas, .cli, .zip, .png, .txt, etc.), so try to reupload it again.

Are you experiencing the issue on the same mesh? For a given mesh sometimes it runs with 23 cores and sometimes 18 or is it on different meshes?

How many nodes do you have in your meshes? If you have too few points the parallelisation might be tricky.

Best regards,
Jean-Paul
The administrator has disabled public write access.
The following user(s) said Thank You: Youenn

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43321

  • Youenn
  • Youenn's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 15
Hi,
indeed it didn't work, here are two files describing the bugs encountered.
The first one is for a 23 chores simulations, and comes from the terminal because the .sortie files has not even be written. The second is from a 21 chores simulations, and is the .sortie file.

I am experiencing this since I installed telemac on my linux. At the time, I had to add the oversubscribe in order to use all the chores from my thread (I also tried the usethwthread method, same effect.)

I experienced it on multiple meshes: the biggest I did was about 400 000 nodes, and worked perfectly with 23 chores but not with 22. on this one, I am trying with 100 000 and t work at only 19. It is never the same
Attachments:
The administrator has disabled public write access.

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43332

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1460
  • Thank you received: 562
Hello Youenn,

Have you tried to run an example of the TELEMAC database with the same number of cores you have failed to run with your own example (e.g. examples/telemac2d/malpasset/t2d_malpasset-fine.cas)? Does it work or not?

If not, your issue is not a TELEMAC-2D issue but a Linux issue and should have been posted in the dedicated topic (I am not sure Windows users may help you).

Anyway, have you tried for mpi_cmdexec command in your systel.cfg file to add --use-hwthread-cpus (is this what you mean by "usethwthread method") ?

Chi-Tuan
The administrator has disabled public write access.
The following user(s) said Thank You: Youenn

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43334

  • Youenn
  • Youenn's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 15
Hi pham,

Yes, that's why I don't understand this issue: sometimes, all of my chores are available, sometimes not. The fact is, the number of chores available are directly dependant from the mesh. If it is 17 for exemple, all of my simulations will not be allowed to overpass 17 chores no matter what I do.
And indeed, the mpi_cmdexec is currently mpirun --oversubscribe -n^p x, but I also tried your command and the results were the same

Ok, I will move this topic in Linux, I wasn't sure it was because of my setup.
The administrator has disabled public write access.

Bug with the maximum amount of chors allowed. 7 months 1 week ago #43344

  • pham
  • pham's Avatar
  • OFFLINE
  • Administrator
  • Posts: 1460
  • Thank you received: 562
Hello Youenn,

To test your Linux installation, can you try to run the enclosed steering file in the examples/telemac2d/break folder with various number of cores, e.g. 16, 32 (it is a very short computation, just to try to run a few time steps and possibly try to reproduce your issue)?

I have tried it on my laptop (16 cores, up to 16 cores) but also a cluser (up to 1,152 cores) and it runs without any error message as yours.

How many available cores do you have?

Chi-Tuan

File Attachment:

File Name: t2d_break_test.cas
File Size: 1 KB
The administrator has disabled public write access.
Moderators: pham

The open TELEMAC-MASCARET template for Joomla!2.5, the HTML 4 version.