failsafe operation/state save-restore

Discussion about new features and development support

Moderator: thorsten

Post Reply
Posts: 173
Joined: Fri 13 May 2016, 02:54

failsafe operation/state save-restore

Post by Hale_812 » Tue 24 Jul 2018, 09:54

Since computation takes great amounts of time and disk space, I often have problems of restarting week-long simulations once and once again.

There is a pair of simple scenarios:
1)Some system error or power shortage (like someone decided to "reduce" power usage before going home).
2)HDD is full, and upgrade/migration is needed.

It would be great to have some mechanism of restarting a partially saved simulation from the intermediate step.
Any approach will do:

I see it like saving the whole memory state every 1-6 hours. Or restoring the simulation state, basing on the (same) scripted model, and data, interpolated from the dump. Instead of interpolation, saving complete field dump every hour would do the thing.

So, every time we run the script, OpenEms could check for this state-data, and continue from the saved step. Not necessarily the last step, but falling one hour back is acceptable. (of course, field dumps between backup point and the "crash" event have to be deleted, according to the configuration in the same script)

And consider this new scenario of using the feature:
3)I am sick of slow PC in my lab and take the "work" to home on removable SSD... and then take it back to work in the morning.
In short, it is a migration of the simulation state between machines. I see it as updating(overwriting) the state and dump files every time I bring the updated simulation.

Thanks for reading.
I feel like many people examining signal propagation in time without HPC cluster resources would find such feature very useful.

Technically, at the moment saving a virtual machine state can be performed, but any virtual machine means enormous overhead.

Posts: 1528
Joined: Mon 27 Jun 2011, 12:26

Re: failsafe operation/state save-restore

Post by thorsten » Sat 18 Aug 2018, 09:54


I can certainly understand your idea. But I don't think it will happen.
Why? Well it would be really not easy to implement and at the same time rarely used which means in the rare event you need it it would probably fail due to an error.
Additionally I would rather spend more time (which is close to zero at the moment) improving the FDTD speed. I know very well that still a order of magnitude would be possible... but very difficult too...

May I ask why the hell do you run simulations that take weeks? Are you sure you have optimized your structure and mesh? In my opinion a simulation should never take longer than a day.
Maybe you should even consider a different tool that is faster?
I can't imagine any kind of useful research to be done with simulations running for weeks...


Posts: 173
Joined: Fri 13 May 2016, 02:54

Re: failsafe operation/state save-restore

Post by Hale_812 » Mon 20 Aug 2018, 02:27

I am modelling pulse scattering in the open space with obstacles (1-2 meters in every direction). Some guys asked us for demonstration why we do things one way and not another, as they suggested.
As a time-domain method, OpenEMS gives great intelligible images. But it takes time. Rendering is also very slow in ParaView, and it does not even use CPU at full...
At other times, when I model antennas and lenses with oblique, and curved surfaces, it also takes time, because of excessive meshing.
When modelling anti reflective (deflector) surfaces metal sheets also take double, or triple area comparing to object in focus. So, sometimes, if PC stops, I really like to restart from the savepoint, but can't.
For comparison, HFSS, and other FEMs operate by converging iterations, and the last iteration is saved for that purpose. I thought, why can't we save a time-step once in a while for the same purpose?

>improving the FDTD speed.
I think, the best improvement would be implementing nested mesh, as you did it in cylindrical coords.
It would allow cutting time in open space, and improve accuracy at oblique object vicinity.

Post Reply