RELION GPU Benchmarks
Last Updated August 29, 2018 at 6:09 PM PST - Charles Bowman
CUDA compute capability has been functionally implemented in the popular cryo-EM processing package RELION since early 2016. This allowed cryoEM researchers to harness the parallel processing power of GPUs to not only speed up refinement and classification tasks, but also reduce computational footprint due to the high-density nature of GPU computing. NVIDIA's GPU offerings are split into three (ish) tiers:
- Geforce GTX are consumer-grade cards designed to render 3D environments for video games. These are typically priced the lowest, but also are the most limited in terms of memory and are only effective in single-precision compute. High-end cards in this category - such as the GTX 1080 and GTX 1080ti - have been used very effectively with RELION and other CUDA accelerated applications, and are the norm in many labs with GPU workstations. The recently introduced RTX line also falls into this category.
- Quadro cards are professional-grade cards that are specialized for workstation applications such as CGI and CAD rendering. These cards can technically be used for GPU computing, but this is uncommon as they are slower than their consumer counterparts at single-precision compute and have a higher price point.
- Tesla cards are enterprise-grade cards that are specialized for GPGPU computing 24/7 in datacenters. They are passively cooled, have high amounts of memory compared to GTX cards, are specialized in single and double-precision math, and can interface through either PCI-Express or Mezzanine (SXM2) connections.
Purpose of the Page
Here we are aiming to provide reference benchmarks for NVIDIA GPUs in various configurations to support cryoEM scientists looking to purchase equipment for themselves. We plan to update this page's benchmarks over time with relevant technologies. If you have local systems available for benchmarking, are a vendor looking to have your benchmarks listed here, or would like to see any other specifics, please contact email@example.com. I am actively seeking systems that use the following GPUs for testing:
- GTX 1070
- GTX 1070ti
- GTX 1080ti
- Titan Xp
- Titan Volta
- RTX 2070
- RTX 2080
- RTX 2080ti
Benchmarks below are 3D classifications performed using the Plasmodium ribosome dataset featured in Wong et al, eLife 2014 using the same methods from the RELION Benchmarks and Computer Hardware wiki page with some small modifications:
- If runs are performed using the --scratch_dir option in RELION, the time to copy to scratch is subtracted from the total runtime of the command. This allows for comparison of runs independent of storage pool variability.
- "GPU Time" is reported as the time spent on the Expectation step in each of the 25 iterations of classification, which provides a solid snapshot of GPU speed as this is the primary GPU compute step in RELION.
- Unless otherwise noted, all benchmarks in this chart are performed using 5 MPI processes (-np 5) and two threads per process (--j 2). Systems in this list have four GPUs installed.
Results are largely as expected when looking at published compute rates for each of the cards, with the exception of the Tesla P100. The plot below shows the GPU compute speedup relative to the slowest card (Tesla K80) of all of the cards from the above graph (with the exception of the 16 GB Tesla V100s as they perform identically to their 32 GB counterparts) against their compute speed rating from NVIDIA in Single-precision TFLOPS. (Tera FLoating-point Operations Per Second)
In the graph above, the regression line is calculated excluding the datapoint for the P100, and shows a strong positive linear correlation between TFLOPS and % speedup in RELION 3D classification (m=0.18, b=8.43, r
These 3D classification runs are performed using the same data as above with different parameters and represent my attempts at pushing available hardware to the limit! A big thing to notice here is that as we minimize the time spent on GPU compute, the CPU overhead becomes a significant factor. RELION's GPU acceleration is compatible with any CUDA-capable card with compute level 3.0 or higher, so yes you can even dust off your CS department's DGX-1 machines once they upgrade to DGX-2!
- Testing_8x1080ti is a server sporting 8 1080ti's that currently holds the record for fastest GPU compute time in the list.
- Torhild_4x1080 is a custom workstation out of the Yoshioka group that sports liquid cooling and overclocked GTX 1080s. This machine performs on par with some of the Tesla systems above!
- Download raw data (including system specifications) for the GPU Benchmarks here.
- Download raw data (including system specifications) for the Fastest Runs here.
- Some computational analyses were performed using shared instrumentation funded by 1-S10OD021634.
- Thanks to @LanderLab for access to GPU workstation Azathoth.
- Thanks to Art Kenney, Kihoon Yoon, Deepthi Cherlopalle, Joseph Stanfield and the rest of the Dell EMC HPC team for access to C4140 Nodes.
- Thanks to George Vacek, Mark Adams, and William Beaudin at DDN for access to the NVIDIA DGX-1.
- Thanks to Craig Yoshioka at OHSU for Torhild and 8x1080ti benchmarks.
- Figures generated using d3.js.