ARTIFACT DESCRIPTION APPENDIX : DESIGNING A GPU-ACCELERATED COMMUNICATION LAYER FOR EFFICIENT FLUID-STRUCTURE INTERACTION COMPUTATIONS ON HETEROGENOUS SYSTEMS

AUTHOR: ARISTOTLE MARTIN  
DUKE UNIVERSITY  
ARISTOTLE.MARTIN@DUKE.EDU  
PI: AMANDA RANDLES  

This repository contains the computational artifacts supporting the contributions from the work titled above. For convenience, the contributions are re-stated below.

- C1: Presentation of GPU-accelerated cell communication routines that resulted in up to an order-of-magnitude speedup on up to 32 million cells across hundreds of nodes on heterogeneous supercomputing platforms. Associated artifacts: A2,A3,A4.
- C2: Elucidation of trade-offs between code performance and GPU memory utilization. Associated artifacts: A1.
- C3: Evaluation of the impact of node architecture on GPU communication optimizations. Associated artifacts: A2, A3.

We employed HARVEY, a massively parallel fluid dynamics solver, to obtain the simulation results presented in this study. HARVEY is generally available under a proprietary research license from Duke University. This license has a provision for a free license for academic use. For access, contact the Duke Office of Licensing and Ventures.

The repository is subdivided into three major directories:
- `memoryanalysis`: Contains associated artifacts supporting contribution C2 (A1). More specifically, the runscripts and raw outputfiles used to generate Fig. 6 are included. Within this directory, there are three main sub-directories: `geometries`, `singlenode`, and `multinode`. The `geometries` directory contains the input geometries provided to the simulation, in object file format (.OFF) with a naming convention of cubeXum.off, corresponding to a cubic geometry with a side length of X micrometers. Both the `singlenode` and `multinode` directories are sub-divided into `baseline` and `optimized` directories (named for the baseline and GPU optimized implementations), that each include a pair of directories labeled `outputs` and `runscripts`. In total, there are four standard output files, and four runscripts, one for each bar in Fig. 6. The standard output files are raw output logs generated by the HARVEY simulator. For these experiments, the code was instrumented to display fine-grained memory allocation data, which show up as lines of the form "allocated GPU X data (bytes): Y," where X is the data type and Y is the number of bytes for a given MPI rank. Among these, the maximum number of bytes of each data type was selected. Note that these experiments were performed on the ALCF Polaris machine.
- `timinganalysis`: Contains associated artifacts supporting contributions C1 (A2,A3) and C3 (A2,A3). This directory is sub-divided into three main directories. As with artifact A1, the `geometries` directory contains the input .OFF geometry files used across all of the studies pertinent to the artifacts encompassed here. The sub-directory `singlenode` contains files that comprise Artifact A2, which includes runscripts and raw profiling data used to generate Fig. 7 and Fig. 8. The `singlenode` directory contains two main sub-directories, `baseline_singlenode_timing_hct` and `opt_singlenode_timing_hct`. Each directory is structured the same, with the only difference being that the former pertains to the baseline implementation, and the latter being for the optimized implementation. Within either directory, there are five sub-directories, one for runscripts, and the remaining four of which named "hctX", where X is either 10,20,30, or 40. This value corresponds to the cell density percentage reported along the horizontal axis in Fig. 7 and Fig. 8. The actual profiling data is located within hctX/rundata_singlenode_hctX/profiling/profiling.csv. The CSV file is generated automatically by an internal profiler within HARVEY containing fine-grained timing information. Each line of the CSV file is a string with values associated with simulation events. The eventType with a value of "t" gives timings that are used to generate Fig. 7 and Fig. 8.
- `weakscaling`: Contains associated artifacts supporting contribution C1 (A4). Here the `geometries` sub-directory is sub-divided into two sub-directories, named `cube` (Fig. 2(B)) and `complex` (Fig. 2(A)), corresponding to the input geometry type used for each set of weak scaling runs (Fig. 10 and Fig. 11). The `polaris` directory contains runscripts and raw outputs used for generating the Polaris weak scaling results from Fig. 10 and Fig. 11, and likewise for the `frontier` directory in relation to weak scaling results for Frontier. Each of these directories is sub-divided into `baseline` and `optimized` directories, each of which are further divided by the geometry type, `complex` and `cube`, that contain the associated standard output files generated by HARVEY (`outputs`) and the runscripts used to generate them (`runscripts`). For timings, the line of interest within a given standard output file is the "SimulationLoop Total Loop time" that gives the simulation time, not including setup and load balance, or output writing, and these values were used directly in plotting the lines in Fig. 10 and Fig. 11.