#### Performance Modeling as the Key to Extreme Scale Computing www.cs.illinois.edu/~wgropp William Gropp How Do We Know if there is a Performance Problem? My application scales well! So what! Is it efficient? PARALLEL@ILLINOIS This latter is often overlooked, leading to code implementations ## Tuning A Parallel Code - Typical Approach - Profile code. Determine where most time is being - Study code. Measure absolute performance, look at performance counters, compare FLOP rates - Improve code that takes a long time, reduce time spent in "unproductive" operations - Why this isn't the right Approach: - How do you know when you are done? To what do we compare? ◆ How can we know? decreases scalability Making the scalar code more efficient - How do you know how much performance improvement you can obtain? - Why is it hard to know? ## Performance is Key - Parallelism is (usually) used to get more performance - How do you know if you are making good (not even best) use of a parallel system? - Even measurement-based approaches can be (and all to often are) performed without any real basis of comparison - The key questions are - Where is most of the time spent? - What is the achieveable performance, and how do I get there? - erroneous conclusions based on the (immature) state of compiler / runtime / #### Heart of Blue Waters: Two New Chips #### Power7 Chip Up to 8 cores, 32 SMT threads Up to 256 GF peak performance Memory Subsystem L1 (2x64 KB), L2 (256 KB) L3 (32 MB, complex policy) 128 GB/s memory bandwidth Two memory controllers #### **PERCS Hub Chip** 1.128 TB/s total bandwidth 192 GB/s QCM connection896 GB/s to other QCMs40 GB/s general purpose GB/s general purpose I/O ## Quad-chip Module (QCM) ~1 TF peak performance 128 GBytes memory 512 GB/sec memory bw • 32 cores, 128 threads Four Power7 chips - One Hub Chip - 1.128 TB/sec bandwidth - 896 GB/s to other QCMs 192 GB/s QCM connection - ⋄ 336 GB/s to other QCMs in same drawer - 240 GB/s to QCMs in other drawers in same supernode - 320 GB/s to QCMs in other supernodes - 40 GB/s general purpose I/O #### IH Server Node (Drawer) and Supernode #### **Logical View of Blue Waters** Interconnect ## Logical View of Blue Waters Interconnect # Two-level (L, D) Direct-connect ## Logical View of Blue Waters Interconnect ## Another Example System - 128 node GPU Cluster - #3 on Green500 - Each node has - ◆ One Core i3 530 2.93 GHz dualcore CPU - ◆ One Tesla C2050 GPU per node - 33.62 TFLOPS on HPL - 934 MFLOPS/Watt - How can we *engineer* codes for performance on these complex systems? - And an exercise for the viewer: what do performance models tell you about the CPU/GPU comparisons you see? # An Even More Radical System - Rack Scale - Processing:128 Nodes, 1 (+) PF/s - Memory: - 128 TB DRAM - 0.4 PB/s Aggregate Bandwidth - NV Memory - 1 PB Phase Change Memory (addressable) Additional 128 for Redundancy/RAID - Network - 0.13 PB/sec Injection, 0.06 PB/s Bisection SVA # Why Performance Modeling? - What is the goal? - It is not precise predictions - ◆ It is insight into whether a code is achieving the performance it could, and if not, how to fix it - Performance modeling can be used - To estimate the baseline performance - ◆ To estimate the potential benefit of a nontrivial change to the code - To identify the critical resource 14 #### Different Philosophies for Performance Models Simulation: Actually two different models Performance Modeling? What do I mean by First, an analytic expression based on the application - Very accurate prediction, little insight - Traditional Performance Modeling (PM): - Focuses on accurate predictions - Tool for computer scientists, not application developers - PM as part of the software engineering process (our view) - PM for design, tuning and optimization - PMs are developed with algorithms and used in each step of the development cycle - > Performance Engineering Mismatch in developer expectations Inefficiencies in compilation/runtime Also: comparison of the two models with observed performance can identify The obvious: extrapolation to other systems, such as scalability in nodes or different interconnect Why this sort of modeling Note that a series of measurements from Second, an analytic expression based on the application's *algorithm* and data structures benchmarks are *not* a performance mode ### Our Methodology - measurement tools Combine analytical methods and performance - Programmer specifies parameterized expectation - E.g., $T = a+b*N^3$ - Estimate coefficients with appropriate benchmarks - We derive the scaling analytically and fill in the constants with empirical measurements - Focus on upper and lower bounds rather than precise predictions - Models must be as simple and effective as possible - Simplicity increases the insight - Precision needs to be just good enough to drive action. - An example: Sparse matrix-vector multiply 17 # Sparse Matrix-Vector Product - Common operation for optimal (in floating-point operations) solution of linear systems - Sample code (common CSR format): for row=1,n ``` m = i[row] - i[row-1]; y[i] = sum; sum = 0; for k=1,m sum += *a++ * x[*j++]; ``` i[n], x[n], y[n] Data structures are a[nnz], j[nnz], Realistic Measures of Peak Performance One vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120 Sparse Matrix Vector Product ■ Theoretical Peak ■ Mem BW Peak Oper. Issue PeakObserved # Simple Performance Analysis - Memory motion: - nnz (sizeof(double) + sizeof(int)) + n (2\*sizeof(double) + sizeof(int)) - Assume a perfect cache (never load same data twice) - Computation - nnz multiply-add (MA) - Roughly 12 bytes per MA - Typical node can move 1-4 bytes/MA - Maximum performance is 8-33% of peak - Use STREAM benchmark to get sustained memory - Similar analysis gives bound based on instruction issue - on these limits Implementation improvements (tricks) cannot improve - W. K. Anderson, William D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application, SC'99 (Gordon Bell Prize) $_{\rm 10}$ Power 4 (1.3 GHz) ORNL and ANL for compute time Thanks to Dinesh Kaushik; Pentium 4 Xeon (2.4 GHz) ## But the problem is so big! - Real applications are much larger isn't it hard to do this for the entire application? - Yes, but it doesn't matter for runnable apps. Look at the parts that take the most time. Break the problem into digestible parts - Contributions to performance issues from: - Single thread and node performance - Node and the Network - Placement in the Network 21 ## How Good are Compilers at Vectorizing Codes? S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers. In preparation. 2011. - Note rapidly growing numbers of functional units Power7 has 2 multiply-add units per core; BG/ Q has 4, accessed through "vector" instructions - How do we know how well we are doing? - How do we know how well the compiler is doing? - We can model the expected performance, including vectorization! - Using the model, we can also identify where manually applying well-known transformations will help - Also identifies where extra constraints, such as alignment restrictions, may inhibit use of vectorization | Appl | XLC | ICC | GCC | XLC | ICC | GCC | | |-----------|-----|-----------|------|------|--------|------|--| | | | Automatic | atic | | Manual | | | | JPEG Enc | - | 1.33 | - | 1.39 | 2.13 | 1.57 | | | JEPG Dec | 1 | - | - | - | 1.14 | 1.13 | | | H263 Enc | - | - | - | 1.25 | 2.28 | 2.06 | | | H263 Dec | - | - | - | 1.31 | 1.45 | - | | | MPEG2 Enc | ı | ı | ı | 1.06 | 1.96 | 2.43 | | | MPEG2 Dec | 1 | - | 1.15 | 1.37 | 1.45 | 1.55 | | | MPEG4 Enc | ı | ı | ı | 1.44 | 1.81 | 1.74 | | | MPEG4 Dec | ı | - | ı | 1.12 | | 1.18 | | Table shows whole program speedups measured against unvectorized application S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers. In preparation. 2011. ## Processes and Memory - performance is the limiting resource - As in sparse matrix-vector multiply - What is the appropriate sustained rate? - Memory bus bandwidth is nearly irrelevant it is the sustained rate that is usually important - What about other ways to increase effective sustained performance, such as prefetch? - Prefetch hardware can detect regular accesses and prefetch data, making use of otherwise idel memory bus time. - However, the hardware must be presented with enough independent data streams 25 # Performance Ratio Compared to CSR Format - S-CSR format is better than CSR format for all (on Power 5 and 6) or Most (on Power 4) matrices - S-BCSR format is better than BCSR format for all (on Power 6) or Most (on Power 4 and 5) matrices - Blocked format performance from ½ to 3x CSR. # Streamed Compressed Sparse Row (S-CSR) format - S-CSR format partitions the sparse matrix into blocks along rows with size of bs. Zeros are added in to keep the number of elements the same in each row of a blockThe first rows of all blocks are stored first, then second, third ... and bs-th rows. - For the sample matrix in the following Figure, NNZ = 29. Using a block size of bs = 4, it generates four equal length streams R, G, B and P. This new design only adds 7 zeros every 4 rows. ## Combining With Other Optimizations - We can further modify the S-CSR and S-BCSR to match the requirements for vectorization - We can use OSKI to optimize "within the loops" ## Processes and SMP nodes - "owns" all of the cores all of the time HPC users typically believe that their code - The reality is that was never true, but they did have all of the cores the same fraction of time when there was one core /node - suggest fixes. We can use a simple performance model to measurements to identify the problem and check the assertion and then use - distributed? amount of work. mesh, with every core having the same Consider a simple Jacobi sweep on a regular How are run times 29 ### Sharing an SMP - they can use them to solve other problems ("no one would use all of them all of the time") Having many cores available makes everyone think that - However, compute-bound scientific calculations are often written as if all compute application resources are owned by the - Such static scheduling leads to performance loss - overhead, but is better Pure dynamic scheduling adds - Careful mixed strategies are even bette - Recent results give 10-16% performance improvements on large, scalable systems Thanks to Vivek Kale ## **Model-guided Optimization** - Application is MILC, a lattice QCG code - Analytic model showed possible improvement communicating of 12% by eliminating the pack before What are the correct parameters? Model the real system, but abstractly For Blue Gene, must model independent communication links How relevant is ping-por Processes and the Network bandwidth and real syste Bandwidth (one send) Halo Exchange (4 nbrs) Halo Exchange (phased) Four Neighbor Halo Exchange analyzed in EuroMPI/10 Torsten Hoefler implemented and 15000 1000 20000 Serial Model Model P=1024 Comm Overhead Pack Overhead Next bottleneck: of up to 18% Demonstrated benefit 5000 8 8 8 8 Communication Overhead [%] - CG phase Grid Points per Process (L) 000 1500 2000 Investigating use of nonblocking collectives in a Also model-driven modified CG 32 Data copies and MPI datatypes Impacts choice of communication algorithm (many benchmarks do not provide a relevant measurement) ## **AMG Performance Model** - lower bounds and establish upper and What if a model is performance compare too difficult? We can - Includes contention, multicore penalties bandwidth, - Gahvari, Baker, 82% accuracy on Hera, 98% on Zeus - Schulz, Yang, (earlier this week) Jordan, Gropp **Not Just Collectives** - So why do people see slow communication with regular mesh codes? - One common culprit is the mapping of process interconnect) topology to physical topology (network - Note that this may be quite complex - We have used modeling to determine that a certain kind of random mapping is often preferable for Blue Waters - Avoiding hot-spots on two-level direct networks, Abhinav Bhatele, Nikhil Jain, William Gropp and Laxmikant V. Kale, submitted - One common case is a halo exchange... ### "MPI Communication is too Slow" How often do you hear - or some send or receive "late" to a collective call Often the real problem is is issued late that some process is - "Fix" (used in PETSc and FPMPI2) - Test using - MPI\_Barrier(comm) MPI\_Allreduce(...,comm); - If Barrier time is too hypothesis is that there is load long (what's that) imbalance 34 #### BG/P and Cray XT4 Halo Exchange on - 2048 doubles to each neighbor - Rate is MB/Sec (for all tables) | BG/P | 4 Neighbors | | 8 Neighbors | | |-------------|-------------|-------------|-------------|-------------| | | lrecv/Send | lrecv/Isend | lrecv/Send | lrecv/Isend | | World | 208 | 328 | 184 | 237 | | Even/Odd | 219 | 327 | 172 | 243 | | Cart_create | 301 | 581 | 242 | 410 | | | | | | | | Cray XT4 | 4 Neighbors | | | 8 Neighbors | | |--------------|-------------|-------------|--------|-------------|-------------| | | Irecv/Send | lrecv/Isend | Phased | Irecv/Send | Irecv/Isend | | World | 311 | 908 | 188 | 262 | 269 | | <br>Even/Odd | 257 | 247 | 279 | 212 | 206 | | Cart_create | 265 | 275 | 266 | 236 | 232 | 36 #### **Discovering Performance Opportunities** - Lets look at a single process sending to its neighbors - Based on our performance model, we expect the rate to be sending, not sending and receiving) roughly twice that for the halo (since this test is only | System | 4 neighbors | | 8 Neighbors | | |----------|-------------|----------|-------------|----------| | | | Periodic | | Periodic | | BG/L | 488 | 490 | 389 | 389 | | BG/L, VN | 294 | 294 | 239 | 239 | | BG/P | 1139 | 1136 | 892 | 892 | | BG/P, VN | 468 | 468 | 600 | 601 | | ХТ3 | 1005 | 1007 | 1053 | 1045 | | XT4 | 1634 | 1620 | 1773 | 1770 | | XT4 SN | 1701 | 1701 | 1811 | 1808 | 37 - Isn't this just a collection of tricks? - Yes and no - Yes, a number of different approaches have been applied - No, the same quantitative approach, based bounds, is applied emphasizing a simple model that estimates on getting performance estimates for the resources under consideration and - Quantitative Thinking - ... must be based on having a hypothesis (model), not just measurements - Ratios of a single sender to all processes sending (in rate) - Expect a factor of roughly 2 (since processes must also | System | 4 neighbors | | 8 Neighbors | | |----------|-------------|----------|-------------|----------| | | | Periodic | | Periodic | | BG/L | 2.24 | | 2.01 | | | BG/L, VN | 1.46 | | 1.81 | | | BG/P | 3.8 | | 2.2 | | | BG/P, VN | 2.6 | | 5.5 | | | хт3 | 7.5 | 8.1 | 9.08 | 9.41 | | XT4 | 10.7 | 10.7 | 13.0 | 13.7 | | XT4 SN | 5.47 | 5.56 | 6.73 | 7.06 | | | | | | | BG gives roughly double the halo rate. XTn is much higher - It should be possible to improve the halo exchange on the XT by scheduling the - Or improving the MPI implementation # Performance Models Provide Insight - SpMV, compiler vectorization - Model identifies limits of achievable performance - Using prefetch in SpMV - Abstract model based on hardwaree identifies opportunity, led to new algorithm - Jitter and adapting to runtime - Simple performance model identifies gap in achieved performance, leading to new approaches - Using MPI Datatypes - Simple model suggests benefit; results show either success or problems in MPI implementation - Topology - Simple model identifies performance gaps, even when multiple communication links involves #### Why is Performance Modeling the Key to Extreme Scale? - Measuring yesterday's applications, even with today's runtimes, is often irrelevant - Look at some of the CPU/GPU comparison (Vuduc et - Focus on achievable performance at scale - Architectures are changing rapidly - Further reduces value of measurements on existing - Models permit quantitative evaluation of different approaches and a priori estimation of possible benefit to a major change - Only way to evaluate radical (and necessary!) 41 - Torsten Hoefler - Performance modeling lead Blue Waters; MPI datatype - Garzaran, Saeed Maleki David Padua, Maria - Compiler vectorization - Dahai Guo Streamed format exploiting - Vivek Kale prefetch - SMP work partitioning - Hormozd Gahvari - AMG application modeling - Marc Snir and William - Performance model advocates - Abhinav Bhatele - Elena Caraba Process/node mapping - Van Bui Nonblocking Allreduce in CG - Performance model-based models evaluation of programming - Ankeeth Ved - Model-based updates to NAS benchmarks - Funding provided by: - Blue Waters project (State of Illinois) Illinois and the University of - Department of Energy, Office of Science - National Science Foundation