RELION GPU Benchmarks
Last Updated February 21, 2019 - Charles Bowman

Introduction

CUDA compute capability has been functionally implemented in the popular cryo-EM processing package RELION since early 2016. This allowed cryoEM researchers to harness the parallel processing power of GPUs to not only speed up refinement and classification tasks, but also reduce computational footprint due to the high-density nature of GPU computing.

Purpose of the Page

Here we are aiming to provide reference benchmarks for NVIDIA GPUs in various configurations to support cryoEM scientists looking to purchase equipment for themselves. We plan to update this page's benchmarks over time with relevant technologies. If you have local systems available for benchmarking, are a vendor looking to have your benchmarks listed here, or would like to see any other specifics, please contact bowman@scripps.edu.

GPU Benchmarks

Benchmarks below are 3D classifications performed using the Plasmodium ribosome dataset featured in Wong et al, eLife 2014 using the same methods from the RELION Benchmarks and Computer Hardware wiki page with some small modifications:

  • If runs are performed using the --scratch_dir option in RELION, the time to copy to scratch is subtracted from the total runtime of the command. This allows for comparison of runs independent of storage pool variability.

  • "GPU Time" is reported as the time spent on the Expectation step in each of the 25 iterations of classification, which provides a solid snapshot of GPU speed as this is the primary GPU compute step in RELION.

  • All benchmarks in this chart are performed using 5 MPI ranks (-np 5) and two threads per process (--j 2), with the exception of the Tesla K80 using 9 MPI ranks due to the dual-GPU nature of the card. Systems in this list have four GPU cards installed.

Discussion

The plot below shows the GPU compute speedup relative to the slowest card (Tesla K80) of all of the cards from the above graph (with the exception of the 16 GB Tesla V100s as they perform identically to their 32 GB counterparts) against their compute speed rating from NVIDIA in Single-precision TFLOPS. (Tera FLoating-point Operations Per Second)

In the graph above, there is a positive correlation as expected, however this correlation is not direct. As we add more datapoints, it is becoming clear that this is not a tight correlation. We hypothesize that GPU memory standards (HBM, GDDR6, GDDR5) are a major cause of the variance.

As we benchmark more cards we will update this graph and it's interpretation. Also, it is important to note here that many of the "bell and whistle" new features of NVIDIA GPUs, such as Tensor Cores and NVLink, do not offer significant benefits to RELION processing.

Fastest Runs

These 3D classification runs are performed using the same data as above with different parameters and represent my attempts at pushing available hardware to the limit! A big thing to notice here is that as we minimize the time spent on GPU compute, the CPU overhead becomes a significant factor. RELION's GPU acceleration is compatible with any CUDA-capable card with compute level 3.0 or higher, so yes you can even dust off your CS department's DGX-1 machines once they upgrade to DGX-2!

Recent Additions

  • Testing_8x1080ti is a server sporting 8 1080ti's that currently holds the record for fastest GPU compute time in the list.
  • Torhild_4x1080 is a custom workstation out of the Yoshioka group that sports liquid cooling and overclocked GTX 1080s. This machine performs on par with some of the Tesla systems above!
  • Hyperion_4x2080 is my under-desk workstation sporting 4 ASUS RTX 2080 Turbo "blower" style cards. It is doubles as a leg warmer.
  • DGX-1_V100_scratch is an updated DGX-1 benchmark performed with RELION 3.0 release, CUDA 10, and more optimized parameters (--preread-images, --j 4), which pushes the DGX-1 to the top of the list!

Raw Data

  • Download raw data (including system specifications) for the GPU Benchmarks here.
  • Download raw data (including system specifications) for the Fastest Runs here.

Acknowledgements

  • Some computational analyses were performed using shared instrumentation funded by 1-S10OD021634.
  • Thanks to @LanderLab for access to GPU workstation Azathoth.
  • Thanks to Art Kenney, Kihoon Yoon, Deepthi Cherlopalle, Joseph Stanfield and the rest of the Dell EMC HPC team for access to C4140 Nodes.
  • Thanks to George Vacek, Mark Adams, and William Beaudin at DDN for access to the NVIDIA DGX-1.
  • Thanks to Craig Yoshioka at OHSU for Torhild and 8x1080ti benchmarks.
  • Thanks to Ting-Wai Chu at National Taiwan University for faster DGX-1 benchmarks.
  • Figures generated using d3.js.