failsafe operation/state save-restore
Posted: Tue 24 Jul 2018, 09:54
Since computation takes great amounts of time and disk space, I often have problems of restarting week-long simulations once and once again.
There is a pair of simple scenarios:
1)Some system error or power shortage (like someone decided to "reduce" power usage before going home).
2)HDD is full, and upgrade/migration is needed.
Request:
It would be great to have some mechanism of restarting a partially saved simulation from the intermediate step.
Any approach will do:
I see it like saving the whole memory state every 1-6 hours. Or restoring the simulation state, basing on the (same) scripted model, and data, interpolated from the dump. Instead of interpolation, saving complete field dump every hour would do the thing.
So, every time we run the script, OpenEms could check for this state-data, and continue from the saved step. Not necessarily the last step, but falling one hour back is acceptable. (of course, field dumps between backup point and the "crash" event have to be deleted, according to the configuration in the same script)
And consider this new scenario of using the feature:
3)I am sick of slow PC in my lab and take the "work" to home on removable SSD... and then take it back to work in the morning.
In short, it is a migration of the simulation state between machines. I see it as updating(overwriting) the state and dump files every time I bring the updated simulation.
Thanks for reading.
I feel like many people examining signal propagation in time without HPC cluster resources would find such feature very useful.
Technically, at the moment saving a virtual machine state can be performed, but any virtual machine means enormous overhead.
There is a pair of simple scenarios:
1)Some system error or power shortage (like someone decided to "reduce" power usage before going home).
2)HDD is full, and upgrade/migration is needed.
Request:
It would be great to have some mechanism of restarting a partially saved simulation from the intermediate step.
Any approach will do:
I see it like saving the whole memory state every 1-6 hours. Or restoring the simulation state, basing on the (same) scripted model, and data, interpolated from the dump. Instead of interpolation, saving complete field dump every hour would do the thing.
So, every time we run the script, OpenEms could check for this state-data, and continue from the saved step. Not necessarily the last step, but falling one hour back is acceptable. (of course, field dumps between backup point and the "crash" event have to be deleted, according to the configuration in the same script)
And consider this new scenario of using the feature:
3)I am sick of slow PC in my lab and take the "work" to home on removable SSD... and then take it back to work in the morning.
In short, it is a migration of the simulation state between machines. I see it as updating(overwriting) the state and dump files every time I bring the updated simulation.
Thanks for reading.
I feel like many people examining signal propagation in time without HPC cluster resources would find such feature very useful.
Technically, at the moment saving a virtual machine state can be performed, but any virtual machine means enormous overhead.