RELION GPU Benchmarks
Last Updated February 21, 2019 - Charles Bowman
CUDA compute capability has been functionally implemented in the popular cryo-EM processing package RELION since early 2016. This allowed cryoEM researchers to harness the parallel processing power of GPUs to not only speed up refinement and classification tasks, but also reduce computational footprint due to the high-density nature of GPU computing. NVIDIA's GPU offerings are split into three (ish) tiers:
- Geforce GTX are consumer-grade cards designed to render 3D environments for video games. These are typically priced the lowest, but also are the most limited in terms of memory and are only effective in single-precision compute. High-end cards in this category - such as the GTX 1080 and GTX 1080ti - have been used very effectively with RELION and other CUDA accelerated applications, and are the norm in many labs with GPU workstations. The recently introduced RTX line also falls into this category.
- Quadro cards are professional-grade cards that are specialized for workstation applications such as CGI and CAD rendering. These cards can technically be used for GPU computing, but this is uncommon as they are slower than their consumer counterparts at single-precision compute and have a higher price point.
- Tesla cards are enterprise-grade cards that are specialized for GPGPU computing 24/7 in datacenters. They are passively cooled, have high amounts of memory compared to GTX cards, are specialized in single and double-precision math, and can interface through either PCI-Express or Mezzanine (SXM2) connections.
Purpose of the Page
Here we are aiming to provide reference benchmarks for NVIDIA GPUs in various configurations to support cryoEM scientists looking to purchase equipment for themselves. We plan to update this page's benchmarks over time with relevant technologies. If you have local systems available for benchmarking, are a vendor looking to have your benchmarks listed here, or would like to see any other specifics, please contact firstname.lastname@example.org. I am actively seeking systems that use the following GPUs for testing:
- GTX 1070
- GTX 1070ti
- GTX 1080ti
- Titan Xp
- Titan Volta
- Titan RTX
- RTX 2070
Benchmarks below are 3D classifications performed using the Plasmodium ribosome dataset featured in Wong et al, eLife 2014 using the same methods from the RELION Benchmarks and Computer Hardware wiki page with some small modifications:
- If runs are performed using the --scratch_dir option in RELION, the time to copy to scratch is subtracted from the total runtime of the command. This allows for comparison of runs independent of storage pool variability.
- "GPU Time" is reported as the time spent on the Expectation step in each of the 25 iterations of classification, which provides a solid snapshot of GPU speed as this is the primary GPU compute step in RELION.
- All benchmarks in this chart are performed using 5 MPI ranks (-np 5) and two threads per process (--j 2), with the exception of the Tesla K80 using 9 MPI ranks due to the dual-GPU nature of the card. Systems in this list have four GPU cards installed.
The plot below shows the GPU compute speedup relative to the slowest card (Tesla K80) of all of the cards from the above graph (with the exception of the 16 GB Tesla V100s as they perform identically to their 32 GB counterparts) against their compute speed rating from NVIDIA in Single-precision TFLOPS. (Tera FLoating-point Operations Per Second)
In the graph above, there is a positive correlation as expected, however this correlation is not direct. As we add more datapoints, it is becoming clear that this is not a tight correlation. We hypothesize that GPU memory standards (HBM, GDDR6, GDDR5) are a major cause of the variance.
As we benchmark more cards we will update this graph and it's interpretation. Also, it is important to note here that many of the "bell and whistle" new features of NVIDIA GPUs, such as Tensor Cores and NVLink, do not offer significant benefits to RELION processing.
These 3D classification runs are performed using the same data as above with different parameters and represent my attempts at pushing available hardware to the limit! A big thing to notice here is that as we minimize the time spent on GPU compute, the CPU overhead becomes a significant factor. RELION's GPU acceleration is compatible with any CUDA-capable card with compute level 3.0 or higher, so yes you can even dust off your CS department's DGX-1 machines once they upgrade to DGX-2!
- Testing_8x1080ti is a server sporting 8 1080ti's that currently holds the record for fastest GPU compute time in the list.
- Torhild_4x1080 is a custom workstation out of the Yoshioka group that sports liquid cooling and overclocked GTX 1080s. This machine performs on par with some of the Tesla systems above!
- Hyperion_4x2080 is my under-desk workstation sporting 4 ASUS RTX 2080 Turbo "blower" style cards. It is doubles as a leg warmer.
- DGX-1_V100_scratch is an updated DGX-1 benchmark performed with RELION 3.0 release, CUDA 10, and more optimized parameters (--preread-images, --j 4), which pushes the DGX-1 to the top of the list!
- Download raw data (including system specifications) for the GPU Benchmarks here.
- Download raw data (including system specifications) for the Fastest Runs here.
- Some computational analyses were performed using shared instrumentation funded by 1-S10OD021634.
- Thanks to @LanderLab for access to GPU workstation Azathoth.
- Thanks to Art Kenney, Kihoon Yoon, Deepthi Cherlopalle, Joseph Stanfield and the rest of the Dell EMC HPC team for access to C4140 Nodes.
- Thanks to George Vacek, Mark Adams, and William Beaudin at DDN for access to the NVIDIA DGX-1.
- Thanks to Craig Yoshioka at OHSU for Torhild and 8x1080ti benchmarks.
- Thanks to Ting-Wai Chu at National Taiwan University for faster DGX-1 benchmarks.
- Figures generated using d3.js.