CUDA GPU Acceleration

Discussion about new features and development support

Moderator: thorsten

Post Reply
qfp64
Posts: 1
Joined: Sun 28 Aug 2022, 12:27

CUDA GPU Acceleration

Post by qfp64 » Sun 28 Aug 2022, 12:44

There's been discussion on the forum before about adding GPU acceleration to openEMS (http://openems.de/forum/viewtopic.php?t=185), but as far as I know, nobody has tried doing it yet. The main objection to attempting this was supposedly that the limitation was memory bandwidth, not processing power.

On my PC with an overclocked Ryzen 9 5950X with dual channel 3600MHz DDR4 RAM, I ran a memory benchmark and got between 15-20GB/s. In the same system I have an RTX 3070 GPU that has a quoted 448GB/s memory bandwidth, over 20x (!) more than the CPU, not including cache, which both processors have.

The fastest I could get openEMS to run on this CPU was with 8 threads (out of 32), maxing out at only 16% CPU usage, clearly memory bandwidth-bound.

I have a little bit of experience messing with CUDA, so I spent a day working on a new CUDA-based engine: https://github.com/aWZHY0yQH81uOYvH/openEMS-CUDA

It uses "managed" memory, which makes life a lot easier in interfacing with the rest of the non-GPU code (the GPU driver will copy back and forth over PCIe seamlessly based on access to the same pointers). So far, I've only ported the field propagation routines (no extensions), and it's very inefficient with GPU utilization, with a huge amount of overhead from kernel launches and other stuff. The code on that repo is literally the first thing that worked.

To compare to the CPU, I commented out everything but the main field updates in the multithreaded CPU engine so both engines would do the same work. With my test file with 132k grid cells, the CPU was running at ~2300MC/s and the GPU was running at ~1400MC/s. Not great, but I also observed the following:
  • One CPU core and all the GPU cores were pinned at 100%, presumably because it was launching two kernels per grid cell per iteration, creating and destroying billions of kernels per second.
  • NVIDIA Night said reads were occupying 43% of the GPU memory bus and writes were occupying 12%. I assume the reads are so much higher because of huge numbers of cache misses due to the threads being spread across thousands of GPU cores.
It should be possible to improve the cache hits and kernel launch overhead by running multiple grid cells per thread to keep relevant data on the same core's cache. By synchronizing all the running threads between voltage and current updates on the GPU (I believe there are facilities to do this), it may be possible to run large numbers of time steps with absolutely no CPU intervention (only one kernel launch at the start). In another project these kinds of optimizations got 5x improvements in performance relative to the naïve implementation, possibly even more if the data can remain in the CUDA core caches. There is certainly more memory bandwidth to be utilized. I think this is promising, but please tell me if I'm overlooking something.

I'm going to be resuming my undergrad studies soon, so I won't have a lot of time to continue work on this at the moment. I wanted to share my findings so far so others can play with it, or tell me I'm wasting my time because I don't know what I'm doing. :)

thorsten
Posts: 1528
Joined: Mon 27 Jun 2011, 12:26

Re: CUDA GPU Acceleration

Post by thorsten » Mon 29 Aug 2022, 20:14

The numbers you need to compare are not the RAM speeds but the L1-L3 Cache (and size) of a CPU.
A GPU simply does not have the same amount of central cache as a CPU.

The fastest FDTD to my knowledge on a GPU (Tesla P100) is <4.5GC/s (see here)

But that pales in comparison to the 10+GC/s possible on a Ryzen 9 3950X or the 50+GC/s on a Dual Epyc CPU (see here). In that chart the P100 (curve from the pdf above) is the slowest of all testes architectures :D

regards
Thorsten

thorsten
Posts: 1528
Joined: Mon 27 Jun 2011, 12:26

Re: CUDA GPU Acceleration

Post by thorsten » Mon 29 Aug 2022, 20:18

But to make it clear, if you are able to speedup openEMS using CUDA, I'm sure many people would be very happy to give it a try...
That said, keep at it. Would be very cool to see...

Coto
Posts: 11
Joined: Mon 28 Feb 2022, 02:26

Re: CUDA GPU Acceleration

Post by Coto » Wed 31 Aug 2022, 19:44

I have a little bit of experience messing with CUDA, so I spent a day working on a new CUDA-based engine
Super exciting efforts, please keep us updated!
In that chart the P100 (curve from the pdf above) is the slowest of all testes architecture
I had looked at the papers describing EMPIRE's technology (e.g. https://ieeexplore.ieee.org/document/7228423 - more papers under "XPU FDTD" here: https://ieeexplore.ieee.org/search/sear ... xpu%20fdtd - let me know if you need access), and my (uncertain) understanding is they've implemented a clever method of speeding things up on the CPU. However, I'm not sure if Accelware (their GPU reference) has a fantastic GPU implementation compared to market leaders like CST (no idea if that's the case, just a thought that this could be a possibility). Even then though, would it really be easier trying to speed things up using the paper's (complicated?) CPU-based method?

I personally believe it might be better to try working with GPUs, which may naturally (and therefore more easily) provide speedup. It may not achieve world record speeds, but it appears that FDTD acceleration on GPUs is very popular on the literature, and likely easier than trying to figure out how to figure out and practically replicate XPU's method. Even if there are faster CPU solutions using tricky techniques, GPUs may still offer a very meaningful speedup compared to the current CPU implementation of openEMS.

A comment by @jmw who had taken a brief look at the openEMS code some time ago:
It looks like the 'Operator' and 'Engine' classes allocate memory as nested arrays. This isn't great for cache performance, a better way is allocate contiguous memory with row-major addressing. STL doesn't have this built-in but there are many examples of how to build this on modern C++.

Another possible improvement is moving from SSE2 to AVX2 vector instructions.
I just ran a solvetime comparison test on a 4 million cell model on my laptop (CPU vs GPU) for CST:

Code: Select all

                        Number of mesh cells:             	3969756
                        Excitation duration:              	2.36969917e-001 ns 
                        Number of calculated pulse widths:	1.49898 
                        Steady state accuracy limit:      	-40	dB 
                        Simulated number of time steps:   	4171 
                        Maximum number of time steps:     	55651 
                        Time step width:
                           without subcycles:             	8.516236745e-005 ns
                           used:                          	8.516236745e-005 ns
CPU (6 threads - i7 10750H):

Code: Select all

                        Adaptive port meshing time:          	77     	s  ( = 0 h, 1 m, 17 s )
                        Total solver time (all cycles):      	307    	s  ( = 0 h, 5 m, 7 s )
                        solver initialization and clean-up:  	2      	s
                                                             	------------
                        Total simulation time:               	[b]386    	s[/b]  ( = 0 h, 6 m, 26 s )
GPU (CUDA, on GTX 1650 Ti):

Code: Select all

                        Adaptive port meshing time:          	0      	s
                        Total solver time (all cycles):      	62     	s  ( = 0 h, 1 m, 2 s )
                        solver initialization and clean-up:  	2      	s
                                                             	------------
                        Total simulation time:               	[b]64     	s[/b]  ( = 0 h, 1 m, 4 s )
Overall, 6x speedup with GPU over CPU (at least on my laptop - more practical examples here). You could of course also argue that perhaps CST might have poor CPU implementation compared to GPU, but I don't know if that's necessarily the case.

Just my two cents - really interested to see if a CUDA implementation can speed openEMS up a bit! :D

thorsten
Posts: 1528
Joined: Mon 27 Jun 2011, 12:26

Re: CUDA GPU Acceleration

Post by thorsten » Wed 31 Aug 2022, 20:26

However, I'm not sure if Accelware (their GPU reference) has a fantastic GPU implementation compared to market leaders like CST
It is my understanding that there implementation is used by one of the bigger vendors and is comparable in speed to that of CST. But I do not really know.
Overall, 6x speedup with GPU over CPU
If you have access to CST, why not run openEMS too and compare the CPU speeds? Would be very interesting. It always was my understanding that their CPU speed was not very good, but again I have never used or even seen CST.
I personally believe it might be better to try working with GPUs, which may naturally (and therefore more easily) provide speedup.
I'm not really sure but I would partially agree. Only downside I see is that my primary goal with openEMS always was to make it as extendable and maintainable as possible and I fear that a GPU implementation would not really help on this front. But I may be wrong.

Further more I do not really like Nvidia and it's proprietary drivers ;) I''m not even sure the CUDA license would be compatible with openEMS or that I would feel comfortable distributing software containing CUDA (e.g. for a Windows version).
Ideally I therefore would prefer an OpenCL or maybe Vulkan solution?
A comment by @jmw who had taken a brief look at the openEMS code some time ago:
I'm not sure where he looked but I'm well aware that memory needs to be aligned and openEMS has always done it. See her for example...
my (uncertain) understanding is they've implemented a clever method of speeding things up on the CPU.
Well as you all know for many years now I only work in my spare time on openEMS (which is unfortunately very limited).
My day job is working on said Empire and therefore I know for a fact that this XPU technology is a very clever and extremely advanced approach indeed ;)
That said it is clear that if anybody want to try this approach, I cannot really help, as that would not be very fair towards my employer who already generously allows me to (continue to) work on openEMS in my spare time ;)
But I would not really have the time for it anyhow...

In summary, it will be interesting where this CUDA approach takes us, but I would prefer a pure open source solution if possible and CUDA unfortunately is the complete opposite.

best regards
Thorsten

Coto
Posts: 11
Joined: Mon 28 Feb 2022, 02:26

Re: CUDA GPU Acceleration

Post by Coto » Wed 21 Sep 2022, 08:34

thorsten wrote:
Wed 31 Aug 2022, 20:26
If you have access to CST, why not run openEMS too and compare the CPU speeds? Would be very interesting. It always was my understanding that their CPU speed was not very good, but again I have never used or even seen CST.
CST (344,250 cells):
3420 timesteps simulated (-30 dB decay was reached then)
CPU: 18 s, GPU: 8 s

Limiting openEMS to 3420 timesteps (because the timestep is likely different due to non-ideal mesh etc.) at 325,467 cells seems to run in 23 s.

Not 100% sure if my comparison was accurate (I expected a bigger difference), but perhaps CST might be a slow-starter to initiate the simulation, so in longer simulations (e.g. 5 min+), maybe the difference is bigger. All about the scaling of time complexity I guess. ;)

Not as relevant, but a big deal is definitely the mesher: if you accidentally have a tiny cell, your simulation can take forever due to the enforced timestep. That's what I'm working on with my automatic mesher. I've seen huge speedups, just by improving the mesh itself. I'll probably share a preprint of the paper soon (and code - have to finish some more stuff there).

Anyway, here's a sneak peek to remind of openEMS' accuracy :D (this was to test my automatic mesher against CST's mesh + solver):

Image

Edit: here's the preprint showing some first results: https://arxiv.org/abs/2209.10260
Further more I do not really like Nvidia and it's proprietary drivers ;) I''m not even sure the CUDA license would be compatible with openEMS or that I would feel comfortable distributing software containing CUDA (e.g. for a Windows version).
Ideally I therefore would prefer an OpenCL or maybe Vulkan solution?

[...]

In summary, it will be interesting where this CUDA approach takes us, but I would prefer a pure open source solution if possible and CUDA unfortunately is the complete opposite.
Regarding license, legally, I don't think there's a restriction with GPL - you'd basically just be distributing source code. E.g. a very similar project that employs FDTD with CUDA (https://github.com/gprMax/gprMax) also appears to use GPLv3.

On the topic of practical development, I totally get and respect that. I certainly wouldn't mind if a faster-than-CPU solution comes up with OpenCL/Vulkan - in fact it would be an extra bonus in terms of the open-source idea of openEMS. That said, if a CUDA implementation comes up, I personally wouldn't mind giving it a shot, especially if the speedup looks promising. It doesn't necessarily have to be released/distributed as part of openEMS itself, but even a separate implementation would be nice to have, at least as an option in my opinion. I'll personally be looking into all three options in the following weeks to see if anything interesting can come up. :)
Well as you all know for many years now I only work in my spare time on openEMS (which is unfortunately very limited).
My day job is working on said Empire and therefore I know for a fact that this XPU technology is a very clever and extremely advanced approach indeed ;)
That said it is clear that if anybody want to try this approach, I cannot really help, as that would not be very fair towards my employer who already generously allows me to (continue to) work on openEMS in my spare time ;)
But I would not really have the time for it anyhow...
Absolutely! I think it's really really cool and respectable that Empire even allows you to continue working with the openEMS project from time to time. It has been an incredible tool for learning FDTD in great detail, and is far more robust than most seem to believe (as long as you understand the basics of FDTD and know how to use it :D). Let's see where GPUs (the "less advanced" speedup solution candidate) take us...

Cheers,
Apostolos

Post Reply