OpenEMS multi-processor performance

How to use openEMS. Discussion on examples, tutorials etc

Moderator: thorsten

vinu
Posts: 1
Joined: Tue 14 May 2013, 00:00

OpenEMS multi-processor performance

Post by vinu » Tue 14 May 2013, 00:49

Hi,

I am running the hyp2mat example.
By default the performance seems to vary from 4-11 MC/s.
With --numThreads=2 on the openEMS command line, I get:
~30 MC/s.
with --numThreads=4
~30 MC/s.
with --numThreads=8
~10-20 MC/s.

My machine has two E5649 processors with HT enabled.
Is this expected?

Thanks,
Vinu

thorsten
Posts: 1393
Joined: Mon 27 Jun 2011, 12:26

Re: OpenEMS multi-processor performance

Post by thorsten » Tue 14 May 2013, 09:08

Hi Vinu,

yes this is something I have seen.

FDTD is pretty much not limited by the speed of your cpu, but the speed/bandwidth of your memory. That means, sometimes less threads fighting for the limited memory bandwidth can be better...
But this even depends on the simulation itself (size and what extension are activ, e.g. large pml's)

But 30MC/s seems a bit slow anyhow... maybe your memory is somewhat slow? Or the current simulation you are running is rather not ideal... (e.g. large PML compared to the sim domain)
I have a Core i7 920 running 150MC/s on a cavity example (no pml's etc)

I was thinking about doing some automatic testing and choosing how many threads give the best speed for any simulation, but I haven't got around to it...

In your case you should run on 2 threads for this simulation...

regards
Thorsten

thorsten
Posts: 1393
Joined: Mon 27 Jun 2011, 12:26

Re: OpenEMS multi-processor performance

Post by thorsten » Tue 14 May 2013, 14:46

Hi again,

I have resurrected my benchmark script. It simulates a simple cavity with an homogeneous mesh and increasing size and number of cells.
Cavity_Speed.png
openEMS speed on a Core i7 920 with 12GB RAM
Cavity_Speed.png (9.34 KiB) Viewed 13059 times
Discussion:
  • 1 thread is always slow, the memory bandwidth is not at its limit (CPU bound)
  • small simulations mostly fit into the CPU Cache, 4 threads (physical number without HT) is fastest
  • larger simulations: 2 threads are enough to hit memory bandwidth limit and remain the fastest
I think I'm going to clean this script up a little more and post it here later and include it in the next release...

regards
Thorsten

montanaviking
Posts: 90
Joined: Mon 30 Sep 2013, 22:27

Re: OpenEMS multi-processor performance

Post by montanaviking » Tue 22 Apr 2014, 08:05

Hi Thorsten and others,
I'm thinking that a two cpu (say two E5-2630v2 Xeons) machine's performance could vary depending on how the job is split between the 12 cores. I have such a machine on order. Assuming we're running OpenEMS multithreaded, I'm wondering if it would be better to:
1. Turn off hyperthreading?
2. Distribute the threads somewhat evenly between the two CPUs (each CPU being one of the 6-core Xeons - not one of the cores)?

From results posted here - which agree of course to what I've seen in the literature about FDTD - it appears that an excess number of threads can degrade performance, depending on the memory speed and problem. Could effective memory bandwidth/thread be improved by distributing the threads between the two cores and turning off hyperthreading?
From what I've seen, the motherboard I plan on getting supports a separate memory bus for each Xeon. Also, I get the feeling, that two threads running on a single core via hyperthreading, would have less memory bandwidth than those two threads running on separate cores or better, a core in each of the two Xeons. This assumes that the Xeons are lightly loaded.

If the above are true, is there a way to configure the thread allocation in OpenEMS to obtain a more-desireable CPU loading?
Thanks,
Phil

thorsten
Posts: 1393
Joined: Mon 27 Jun 2011, 12:26

Re: OpenEMS multi-processor performance

Post by thorsten » Tue 22 Apr 2014, 17:23

Hi,

I never had a two CPU machine available, thus I can only assume that it should be better to split it. But maybe not? After all in each iteration they need to share data as well...
I think disabling hyper-threading should help more than it does harm, but again, I didn't test that much...

There is no (built in) way to allocate how the cores are used... But maybe you can tell the OS how to do it??

regards
Thorsten

Hale_812
Posts: 171
Joined: Fri 13 May 2016, 02:54

Re: OpenEMS multi-processor performance

Post by Hale_812 » Mon 16 May 2016, 04:08

1) HT is not desired in most streamed math computations. It is purely logical, multiple-access nonuniform-code CPU optimization. In order for anything FPU based to operate, HT should be disabled if not avoided by the code.
2) Every MP system has problems with inter-CPU communication delays, especially when accessing remote-cpu memory addresses. Especially when the data can not fit into the processor cache (that's why cache is critical in MP workstations)
Practically, when using HFSS FEM in single domain/single task MP configuration, system performance grows on single CPU rapidly up to 6 cores and very slowly when reaching 8 cores.
But in DP system the performance almost stops at 10 (8+2 threads/ 6+4 etc) and visibly degrades in 16-core configuration. So DP 4x2 (6x2) high-clock configuration is preferred before low-clock many-cores (8-16 x2...) configuration.
Here we must take into account that HFSS FEM solves giant matrices filling hundreds of gigabytes. Thats why we use DP and QP workstations there.

FDTD is memory-efficient and well paralleled method, so I expect these 16x2 core DP systems to be quite promising. But I would like to see real reports, of course.

And, please, use Intel XTU, or better TMonitor_x64 in order to monitor real CPU clocks.

It is possible that due to bad thermal design, or conservative BIOS configuration (generally predefined by the motherboard developer) the CPU clocks cad drop fatally when all 16(2x8) cores operate under high math load.
You can see this only using specially designed monitoring software.

Hale_812
Posts: 171
Joined: Fri 13 May 2016, 02:54

Re: OpenEMS multi-processor performance

Post by Hale_812 » Mon 16 May 2016, 04:20

P.S.

If you need HT for parallel tasks responsiveness, in SINGLE(ONLY) CPU system you can apply AMD scheduling patch for Windows 7 in order for the scheduler to assign cores in 1-3-2-4 order, like on AMD Buldozzer, instead of normal 1-2-3-4, where 2&4 are HT clones of 1&3. (!)This DOES NOT work in MP (DP, QP) configuration - you can damage your OS fatally without a way to fall back.

Hale_812
Posts: 171
Joined: Fri 13 May 2016, 02:54

Re: OpenEMS multi-processor performance

Post by Hale_812 » Mon 16 May 2016, 10:22

There is no (built in) way to allocate how the cores are used...
Theoretically, WMI should tell how many processors (nodes) there, how many cores and how many units there, which belong to which. I just don't know how to do that, but Ive seen tools telling the mapping.
Of course you can not tell which HT is virtual because both of them are equivalent. And you can not tell which is loaded, because the counter becomes unbalanced when counting HT units...
But I believe, it is possible to lock on specific units when initializing a thread. Maybe. It should be, because kernel does this.

Garias
Posts: 14
Joined: Mon 13 Feb 2017, 19:11

Re: OpenEMS multi-processor performance

Post by Garias » Sat 27 May 2017, 21:24

Hello Hale_812

I'm following your thread because although I'm not a computer literate, I've managed (by means of some researching over the weekends) to put together a Quad-Socket AMD Opteron 6274 which features quad memory channel (aiming to increase BW) with 16GB per socket (memory is KVR16R11S8/4 4GB 1Rx8 512M x 72-Bit PC3-12800 CL11 Registered w/Parity 240-Pin DIMM). A total of 64 cores and a balanced 64Gb of RAM equally distributed and covering all channels to get optimum/maximum BW performance. This project is aimed to see what can I get from this configuration (I have a quite complex subwavelength structure to deal with).

Once I run OpenEMS, during the first day or so, all 64 cores go to 100% (which makes me quite happy they are taking the computational load). Using the Windows server 2008's Performance and Resouce monitor I don't see cache errors (cache misses I guess...) so I assume the workspace somehow fit into the caches or being feed from RAM in a way caches never run our of pre-fetched data to feed the cores with.

Then, suddenly (on day 2) the core load reduces drastically to very low values (it seems to be a different stage in the FTDT, or the openEMS code? ) and I see an alternate use of the cores. They are not solidly loaded as in the first day of computation but rather used on at a time (all 64 participate) taking turns to work (while not parked)

What do interpret from these two stages behaviors? What is being done en each of them? It seems the second stage (after the full intensive day 1) the cores start struggling with memory BW?



Kindest Regards

German

thorsten
Posts: 1393
Joined: Mon 27 Jun 2011, 12:26

Re: OpenEMS multi-processor performance

Post by thorsten » Sun 28 May 2017, 08:42

Hi,

there are only different stages during pre-processing (simulation setup), once the main engine runs it should not change.
During setup there are stages that run on all cores (or try too) and stages where only 1 thread is possible (e.g. because it needs the overview)...
Are you telling me your FDTD setup takes some days?? That would be insane! ;)

The main engine tries to run on all cores. But I really never tried on a machine with more than 4 cores.
I've managed (by means of some researching over the weekends) to put together a Quad-Socket AMD Opteron 6274
If you have these kind of resources (money) it would make sense to think about a commercial FDTD solver (e.g. Empire)?
On the machine you describe they would probably run at a couple of GC/s (1000 MC/s)
I just do not have the time and resources for engine optimization as these tools do...
Furthermore my goal with openEMS was never the fastest speed possible, but an engine that allows an easy way to extend with new experimental features.
But this open and flexible approach does not go well with as fast as possible as you can imagine...
If raw speed is what you care about you really should consider a commercial solver?

Let me know what you think...

regards
Thorsten

Post Reply