

### PAPI - PERFORMANCE API

ANDRÉ PEREIRA ampereira@di.uminho.pt

- Application and functions execution time is easy to measure
  - \* time
  - \* gprof
  - \* valgrind (callgrind)
  - \*

- Application and functions execution time is easy to measure
  - \* time
  - \* gprof
  - \* valgrind (callgrind)
  - \*
- \* It is enough to identify bottlenecks, but...
  - \* Why is is it slow?
  - \* How does the code behaves?

### \* Efficient algorithms should take into account

Cache behaviour

- Cache behaviour
- Memory and resource contention

- Cache behaviour
- Memory and resource contention
- \* Floating point efficiency

- Cache behaviour
- Memory and resource contention
- \* Floating point efficiency
- Branch behaviour

### HW Performance Counters

### HW Performance Counters

 Hardware designers added specialised registers o measure various aspects of a microprocessor

# HW Performance Counters

- Hardware designers added specialised registers o measure various aspects of a microprocessor
- \* Generally, they provide an insight into
  - Timings
  - Cache and branch behaviour
  - Memory access patterns
  - Pipeline behaviour
  - \* FP performance
  - \* IPC
  - \*

### What is PAPI?

### What is PAPI?

Interface to interact with performance counters

- With minimal overhead
- Portable across several platforms

### What is PAPI?

Interface to interact with performance counters

- With minimal overhead
- Portable across several platforms
- \* Provides utility tools, C, and Fortran API
  - Platform and counters information

# PAPI Organisation



André Pereira, UMinho, 2018/2019

# Supported Platforms

### Mainstream platforms (Linux)

- \* x86, x86\_64 Intel and AMD
- \* ARM, MIPS
- Intel Itanium II
- \* IBM PowerPC

### Utilities

### Utilities

### \* papi\_avail

|                             | 1. am                    | pereira                                          | l@com    | pute-552-2:~/tools/papi-gcc4.9.0/bin (ssh)         |  |  |  |  |  |
|-----------------------------|--------------------------|--------------------------------------------------|----------|----------------------------------------------------|--|--|--|--|--|
|                             |                          |                                                  |          |                                                    |  |  |  |  |  |
| PAPI Version                |                          | : 5.3.2.0                                        |          |                                                    |  |  |  |  |  |
| Vendor string and code      |                          |                                                  |          |                                                    |  |  |  |  |  |
| Model string and code       |                          | : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (62) |          |                                                    |  |  |  |  |  |
| CPU Revision                |                          | : 4.000000                                       |          |                                                    |  |  |  |  |  |
| CPUID Info                  |                          | : Family: 6 Model: 62 Stepping: 4                |          |                                                    |  |  |  |  |  |
| CPU Max Megahertz           |                          | : 250                                            | 1        |                                                    |  |  |  |  |  |
| CPU Min Megahertz           |                          | : 120                                            | 0        |                                                    |  |  |  |  |  |
| Hdw Threads per core        |                          | : 2                                              |          |                                                    |  |  |  |  |  |
|                             |                          | : 10                                             |          |                                                    |  |  |  |  |  |
| Sockets                     |                          | : 2                                              |          |                                                    |  |  |  |  |  |
| NUMA Nodes                  |                          | : 2                                              |          |                                                    |  |  |  |  |  |
| CPUs per Node               |                          | : 20                                             |          |                                                    |  |  |  |  |  |
| Total CPUs                  |                          | : 40                                             |          |                                                    |  |  |  |  |  |
| Running in a VM             |                          | : no                                             |          |                                                    |  |  |  |  |  |
| Number Hardwa               | are Counters             | : 11                                             |          |                                                    |  |  |  |  |  |
| Max Multiple                | <pre>K Counters</pre>    | : 32                                             |          |                                                    |  |  |  |  |  |
|                             |                          |                                                  |          |                                                    |  |  |  |  |  |
|                             |                          |                                                  |          |                                                    |  |  |  |  |  |
| Name                        | Code /                   | Avail                                            | Deriv    | Description (Note)                                 |  |  |  |  |  |
| PAPI_L1_DCM                 | 0x80000000               | Yes                                              | No       | Level 1 data cache misses                          |  |  |  |  |  |
| PAPI_L1_ICM                 | 0x80000001               | Yes                                              | No       | Level 1 instruction cache misses                   |  |  |  |  |  |
| PAPI_L2_DCM                 | 0x80000002               | Yes                                              | Yes      | Level 2 data cache misses                          |  |  |  |  |  |
| PAPI_L2_ICM                 |                          | Yes                                              | No       | Level 2 instruction cache misses                   |  |  |  |  |  |
| PAPI_L3_DCM                 | 0x80000004               | No                                               | No       | Level 3 data cache misses                          |  |  |  |  |  |
| PAPI_L3_ICM                 |                          | No                                               | No       | Level 3 instruction cache misses                   |  |  |  |  |  |
| PAPI_L1_TCM                 | 0x80000006               | Yes                                              | Yes      | Level 1 cache misses                               |  |  |  |  |  |
| PAPI_L2_TCM                 |                          | Yes                                              | No       | Level 2 cache misses                               |  |  |  |  |  |
| PAPI_L3_TCM                 |                          | Yes                                              | No       | Level 3 cache misses                               |  |  |  |  |  |
| PAPI_CA_SNP                 |                          | No                                               | No       | Requests for a snoop                               |  |  |  |  |  |
| PAPI_CA_SHR                 |                          | No                                               | No       | Requests for exclusive access to shared cache line |  |  |  |  |  |
| PAPI_CA_CLN                 |                          | No                                               | No       | Requests for exclusive access to clean cache line  |  |  |  |  |  |
| PAPI_CA_INV                 |                          | No                                               | No       | Requests for cache line invalidation               |  |  |  |  |  |
| PAPI_CA_ITV                 |                          | No                                               | No       | Requests for cache line intervention               |  |  |  |  |  |
| PAPI_L3_LDM                 |                          | No                                               | No       | Level 3 load misses                                |  |  |  |  |  |
| PAPI_L3_STM                 |                          | No                                               | No       | Level 3 store misses                               |  |  |  |  |  |
| PAPI_BRU_IDL                |                          | No                                               | No       | Cycles branch units are idle                       |  |  |  |  |  |
| PAPI_FXU_IDL                |                          | No                                               | No       | Cycles integer units are idle                      |  |  |  |  |  |
| PAPI_FPU_IDL                |                          | No                                               | No       | Cycles floating point units are idle               |  |  |  |  |  |
| PAPI_LSU_IDL                |                          | No                                               | No       | Cycles load/store units are idle                   |  |  |  |  |  |
| PAPI_LS0_IDL<br>PAPI_TLB_DM |                          | Yes                                              | Yes      | Data translation lookaside buffer misses           |  |  |  |  |  |
| PAPI_TLB_IM                 | 0x80000015               | Yes                                              | No       | Instruction translation lookaside buffer misses    |  |  |  |  |  |
| PAPI_TLB_TM<br>PAPI_TLB_TL  | 0x80000015<br>0x80000016 | No                                               | No       | Total translation lookaside buffer misses          |  |  |  |  |  |
| PAPI_TLB_TL<br>PAPI_L1_LDM  | 0x80000010<br>0x80000017 | Yes                                              | No       | Level 1 load misses                                |  |  |  |  |  |
| PAPI_LI_LDM<br>PAPI_L1_STM  | 0x80000017<br>0x80000018 | Yes                                              | No       | Level 1 store misses                               |  |  |  |  |  |
| PAPI_LI_SIM<br>PAPI_L2_LDM  | 0x80000018<br>0x80000019 | No                                               | No       | Level 2 load misses                                |  |  |  |  |  |
| PAPI_L2_LDM<br>PAPI_L2_STM  | 0x80000019<br>0x8000001a | Yes                                              | NO       | Level 2 toda misses<br>Level 2 store misses        |  |  |  |  |  |
| PAPI_LZ_SIM<br>PAPI_BTAC_M  | 0x8000001d<br>0x8000001b | No                                               | NO<br>NO | Branch target address cache misses                 |  |  |  |  |  |
| PAPI_BTAC_M<br>PAPI_PRF_DM  |                          |                                                  |          | Data prefetch cache misses                         |  |  |  |  |  |
| PAPI_PKF_DM                 | 0x8000001c               | No                                               | No       | buta prefetch cache misses                         |  |  |  |  |  |

### Utilities

#### \* papi\_avail

#### \* papi\_native\_avail

|              | 1. ampereira@compute-552-2:~/tools/papi-gcc4.9.0/bin (ssh)        |   |
|--------------|-------------------------------------------------------------------|---|
| 1            | monitor at kernel level                                           |   |
| TLB_ACCESS   | ;                                                                 |   |
| 1            | TLB access                                                        |   |
| I :STLB_     | HIT                                                               |   |
| 1            | Number of load operations that missed L1TLB but hit L2TLB         |   |
| I :LOAD_     | STLB_HIT                                                          |   |
|              | Number of load operations that missed L1TLB but hit L2TLB         |   |
| l :e=0       |                                                                   |   |
|              | edge level (may require counter-mask >= 1)                        |   |
| :i=0         | invert                                                            |   |
| l :c=0       | Livert                                                            |   |
| I .C=0       | counter-mask in range [0-255]                                     |   |
| '<br>  :t=0  |                                                                   |   |
| I .c=0       | measure any thread                                                |   |
| l :u=0       |                                                                   |   |
| I            | monitor at user level                                             |   |
| l :k=0       |                                                                   |   |
| I            | monitor at kernel level                                           |   |
|              |                                                                   | - |
| I TLB_FLUSH  |                                                                   |   |
|              | TLB flushes                                                       |   |
| I :DILB_     | THREAD                                                            |   |
| I:STLB_      | Number of DTLB flushes of thread-specific entries                 |   |
| I .31LB_     | Number of STLB flushes                                            |   |
| :e=0         |                                                                   |   |
| l            | edge level (may require counter-mask >= 1)                        |   |
| i=0          |                                                                   |   |
| I            | invert                                                            |   |
| l :c=0       |                                                                   |   |
| I            | counter-mask in range [0-255]                                     |   |
| l :t=0       |                                                                   |   |
|              | measure any thread                                                |   |
| l :u=0       |                                                                   |   |
| l :k=0       | monitor at user level                                             |   |
| I :K=0       | monitor at kernel level                                           |   |
| ۱<br>        |                                                                   |   |
| I UNHALTED_( | ORE CYCLES                                                        |   |
|              | Count core clock cycles whenever the clock signal on the specific |   |
| I            | core is running (not halted)                                      |   |
| l :e=0       |                                                                   |   |
| 1            | edge level (may require counter-mask >= 1)                        |   |
| l :i=0       |                                                                   |   |
| 1            | invert                                                            |   |
| l :c=0       |                                                                   |   |
|              | counter-mask in range [0-255]                                     |   |

#### André Pereira, UMinho, 2017/2018

### Utilities

#### \* papi\_avail

- \* papi\_native\_avail
- \* papi\_event\_chooser

| 1. ampereira@compute-552-2:-/tools/papi-gcc4.9.0/bin (ssh) Compereira@compute-552-2 bin]3 ./papi_event_chooser PRESET PAPI_PQPS Event Chooser: Available events which can be added with given events.  PAPI Version : 5.3.2.0 Vendor string and code : GenuineIntel (1) Model string and code : GenuineIntel (1) Model string and code : Family: 6 Model: 62 Stepping: 4 (PUI Max Megahertz : 2501 (PU Min Megahertz : 2501 (PU Min Megahertz : 120 Hdw Threads per core : 2 Cores per Socket : 10 Sockets : 2 NUMA Nodes : 2 Cores per Socket : 10 Number Hardware Counters : 11 Max Multiplex Counters : 32  Number Nadware Counters : 32 Number Nadware Counters : 32 Number Nadware Counters : 32 Number Nadware Counters : 32 Name Code Deriv Description (Note) PAPI_LLI_DOM 0x800000007 No Level 1 instruction cache misses PAPI_LI_TIM 0x8000007 No Level 2 cache misses PAPI_LI_TIM 0x8000007 No Level 2 cache misses PAPI_LI_TIM 0x8000007 No Level 2 cache misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 2 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x80000007 No Level 1 conter misses PAPI_LI_TIM 0x8000007 No Level 1 conter misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 2 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x8000007 No Level 2 store misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x8000007 No Level 1 store misses PAPI_LI_TIM 0x8000007 No Level 2 store misses PAPI_LI_TIM 0x8000007 No Level 1 load misses PAPI_LI_TIM 0x8000007 No Level 2 store misses                                                                                                                                                          |                          |       |                                                       |  |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|-------|-------------------------------------------------------|--|--|--|
| Event Chooser: Available events which can be added with given events.<br>The second string and code : GenuineIntel (1)<br>Madel string and code : GenuineIntel (1)<br>Madel string and code : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (62)<br>CPU Revision : 4.000000<br>CPUID Info : Family: 6 Model: 62 Stepping: 4<br>CPU Max Megahertz : 1200<br>Hdw Threads per core : 2<br>CPU Max Megahertz : 1200<br>Hdw Threads per core : 2<br>NUMA Nodes : 2<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>The Counters : 32<br>Number Hardware Counters : 32<br>Number Hardware Counters : 32<br>Number Hardware Counters : 32<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>Number Hardware State : 2<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>Number Hardware Counters : 32<br>Number Hardware State : 2<br>Number Hardware State : 1<br>Max Wultiplex Counters : 32<br>Number Hardware State : 2<br>Number Hardware : 2<br>No Level 1 data cache misses<br>PAPILI_ISTM 0x80000000<br>No Level 2 cache misses<br>PAPILI_STM 0x80000000<br>No Level 1 load misses<br>PAPILSTL_ISTM 0x80000002<br>No Carditional branch instructions state<br>PAPILSTL_ISTM 0x80000003<br>No Level 2 store misses<br>PAPILSTL_ISTM 0x80000003<br>No Level 2 store misses<br>PAPILSTL_ISTM 0x80000003<br>No Level 2 store misses<br>PAPILSTN 0x80000003<br>No Level 2 data cache mater : 2<br>PAPILSTN 0x80000035<br>No Tatal cycles<br>PAPIL2.CCM 0x80000035<br>No Tatal cycles<br>PAPIL2.CCM 0x80000035<br>No Tatal cycles<br>PAPIL2.CCM 0x80000035<br>No Tatal cycles<br>PAPIL2.CCM 0x80000035<br>No Level 2 | 🛑 🕘 🛑 1. am              | perei | ra@compute-552-2:~/tools/papi-gcc4.9.0/bin (ssh)      |  |  |  |
| PAPI Version : 5.3.2.0<br>Vendor string and code : GenuineIntel (1)<br>Model string and code : Intel(R) Xeon(R) (PU E5-2670 v2 @ 2.50GHz (62)<br>(PU Revision : 4.000000<br>CPUID Info : Family: 6 Model: 62 Stepping: 4<br>(PU Max Megahertz : 1200<br>Hdw Threads per core : 2<br>Cores per Socket : 10<br>Sockets : 2<br>CPUS per Node : 20<br>CPUS per Node : 20<br>CPUS per Node : 20<br>Running in a VM : no<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 13<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | [ampereira@compute-552-2 | bin]  | <pre>\$ ./papi_event_chooser PRESET PAPI_FP_0PS</pre> |  |  |  |
| <pre>Vendor string and code : GenuineIntel (1)<br/>Model string and code : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (62)<br/>CPU Revision : 4.000000<br/>CPUID Info : Family: 6 Model: 62 Stepping: 4<br/>CPU Max Megahertz : 2501<br/>CPU Max Megahertz : 2200<br/>Hdw Threads per core : 2<br/>Cores per Socket : 10<br/>Sockets : 2<br/>NUMA Nodes : 2<br/>CPUs per Node : 20<br/>Total CPUs : 40<br/>Running in a VM : no<br/>Number Hardware Counters : 11<br/>Max Multiplex Counters : 32<br/>Verve Verve V</pre>                                                                                   | Event Chooser: Available | even  | ts which can be added with given events.              |  |  |  |
| <pre>Vendor string and code : GenuineIntel (1)<br/>Model string and code : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (62)<br/>CPU Revision : 4.000000<br/>CPUID Info : Family: 6 Model: 62 Stepping: 4<br/>CPU Max Megahertz : 2501<br/>CPU Max Megahertz : 2200<br/>Hdw Threads per core : 2<br/>Cores per Socket : 10<br/>Sockets : 2<br/>NUMA Nodes : 2<br/>CPUs per Node : 20<br/>Total CPUs : 40<br/>Running in a VM : no<br/>Number Hardware Counters : 11<br/>Max Multiplex Counters : 32<br/>Verve Verve V</pre>                                                                                   |                          |       |                                                       |  |  |  |
| Model string and code: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (62)CPU Revision: 4.000000CPUID Info: Fomily: 6 Model: 62 Stepping: 4CPU Max Megahertz: 2501CPU Min Megahertz: 1200Hdw Threads per core: 2Cores per Socket: 10Sockets: 2CPUs per Node: 20Total CPUs: 40Running in a W: noNumber Hardware Counters: 11Max Multiplex Counters: 32Total CPUs: 40Running in a W: noNumber Hardware Counters: 11Max Multiplex Counters: 32Total CPUs: 40Number Hardware Counters: 11Max Multiplex Counters: 32Total CPUs: 40Number Mardware Counters: 11Max Multiplex Counters: 32PAFL_LIOM0x80000001NoLevel 1 data cache missesPAPI_LIDM0x80000001NoLevel 2 cache missesPAPI_LIZIOM0x80000008NoLevel 2 cache missesPAPI_LISTM0x80000018NoLevel 1 store missesPAPI_LIZTM0x80000018NoLevel 1 store missesPAPI_LISTM0x80000025NoConditional branch instructionsPAPI_STN0x80000026NoConditional branch instructionsPAPI_LIZTM0x80000027NoLevel 2 data cache readsPAPI_BR_NTS0x80000028No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                          |       |                                                       |  |  |  |
| CPU Revision       : 4.000000         CPUID Info       :: Family: 6 Model: 62 Stepping: 4         CPU Max Megahertz       : 1200         Hdw Threads per core       : 2         Cores per Socket       : 10         Sockets       : 2         NUMA Nodes       : 2         CPU by per Node       : 20         Total CPUs       : 40         Running in a VM       : no         Number Hardware Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                          |       |                                                       |  |  |  |
| CPUID Info       : Family: 6 Model: 62 Stepping: 4         CPU Min Megohertz       : 2501         CPU Min Megohertz       : 1200         Hdw Threads per core       : 2         Cores per Socket       : 10         Sockets       : 2         NUMA Nodes       : 2         CPUs per Node       : 20         Total CPUs       : 40         Running in a VM       : no         Number Handware Counters       : 11         Max Multiplex Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Model string and code    |       |                                                       |  |  |  |
| CPU Max Megdhertz       : 2501         CPU Min Megdhertz       : 1200         Hdw Threads per core       : 2         Cores per Socket       : 10         Sockets       : 2         NUMA Nodes       : 2         Cores per Node       : 20         Total CPUs       : 40         Running in a W       : no         Number Hardware Counters       : 31         Max Multiplex Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                          |       |                                                       |  |  |  |
| CPU Min Megahertz       : 1200         Hdw Threads per core       : 2         Cores per Socket       : 10         Sockets       : 2         NUMA Nodes       : 2         Otal CPUs       : 40         Running in a VM       : no         Number Hardware Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CPUID Info               |       |                                                       |  |  |  |
| Hdw Threads per core       : 2         Cores per Socket       : 10         Sockets       : 2         (PUs per Node       : 20         Total (PUs       : 40         Running in a W       : no         Number Hardware Counters       : 11         Max Multiplex Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | CPU Max Megahertz        |       |                                                       |  |  |  |
| Cores per Socket : 10<br>Sockets : 2<br>NUMA Nodes : 2<br>CPUs per Node : 20<br>Total CPUs : 40<br>Running in a VM : no<br>Number Hardware Counters : 32<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CPU Min Megahertz        | : 12  | 00                                                    |  |  |  |
| Sockets       : 2         NUMA Nodes       : 2         CPUs per Node       : 20         Total CPUs       : 40         Running in a VM       : no         Number Hardware Counters       : 11         Max Multiplex Counters       : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                          |       |                                                       |  |  |  |
| NUMA Nodes: 2CPUs per Node: 20Total CPUs: 40Running in a VM: noNumber Hardware Counters: 11Max Multiplex Counters: 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                          |       |                                                       |  |  |  |
| CPUs per Node       : 20         Total CPUs       : 40         Running in a VM       : no         Number Hardware Counters : 11         Max Multiplex Counters : 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                          |       |                                                       |  |  |  |
| Total CPUs: : 40Running in a VM: noNumber Hardware Counters: 11Max Multiplex Counters: 32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                          |       |                                                       |  |  |  |
| Running in a VM : no<br>Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                          |       |                                                       |  |  |  |
| Number Hardware Counters : 11<br>Max Multiplex Counters : 32<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                          |       |                                                       |  |  |  |
| Max Multiplex Counters       : 32         Name       Code       Deriv Description (Note)         PAPI_LL1_DCM       0x80000000       No       Level 1 data cache misses         PAPI_L1_ICM       0x80000001       No       Level 1 instruction cache misses         PAPI_L2_IOM       0x80000007       No       Level 2 instruction cache misses         PAPI_L2_TOM       0x80000007       No       Level 2 cache misses         PAPI_L3_TOM       0x80000007       No       Level 2 cache misses         PAPI_L1_BIM       0x80000017       No       Level 3 cache misses         PAPI_L1_LDM       0x80000017       No       Level 1 load misses         PAPI_L1_STM       0x80000018       No       Level 1 store misses         PAPI_L2_STM       0x80000010       No       Level 2 store misses         PAPI_STL_LCY 0x80000025       No       Conditional branch instructions         PAPI_BR_NTK       0x80000026       No       Conditional branch instructions         PAPI_BR_NTK       0x80000022       No       Conditional branch instructions         PAPI_ST_BR_NTS       0x80000032       No       Instructions         PAPI_ST_STNS       0x80000032       No       Load instructions         PAPI_ST_STNS       0x8000                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                          |       |                                                       |  |  |  |
| NameCodeDerivDescription (Note)PAPI_L1_DCM0x80000000NoLevel 1 data cache missesPAPI_L1_ICM0x80000001NoLevel 1 instruction cache missesPAPI_L2_ICM0x80000007NoLevel 2 instruction cache missesPAPI_L2_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000008NoLevel 3 cache missesPAPI_L1_DM0x80000015NoInstruction translation lookaside buffer missesPAPI_L1_DM0x80000015NoLevel 1 store missesPAPI_L2_STM0x80000018NoLevel 2 store missesPAPI_L2_STM0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000025NoConditional branch instruction sout takenPAPI_BR_CN0x80000026NoConditional branch instructions not takenPAPI_BR_NTK0x80000024NoConditional branch instructions mispredictedPAPI_SP_TS0x80000034YesFloating point instructionsPAPI_SR_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_SR_INS0x80000037NoBranch instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_SR_INS0x80000037NoBranch instructionsPAPI_L2_DCA0x80000044NoLevel 2 data cache accesses <td></td> <td></td> <td></td>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                          |       |                                                       |  |  |  |
| PAPI_L1_DCM0x80000000NoLevel 1 data cache missesPAPI_L1_ICM0x80000001NoLevel 1 instruction cache missesPAPI_L2_ICM0x80000003NoLevel 2 cache missesPAPI_L3_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000017NoLevel 3 cache missesPAPI_L1_LM0x80000017NoLevel 1 load missesPAPI_L1_LDM0x80000018NoLevel 1 store missesPAPI_L1_STM0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000025NoConditional branch instructions not takenPAPI_BR_NTK0x80000026NoConditional branch instructionsPAPI_LTO_INS0x80000027NoInstructionsPAPI_BR_NTK0x80000028NoConditional branch instructionsPAPI_ST_TOT_INS0x80000024NoConditional branch instructionsPAPI_BR_NTK0x80000035NoLoad instructionsPAPI_ST_LD_INS0x80000036NoStore instructionsPAPI_L2_CA0x80000035NoLevel 2 data cache accessesPAPI_L2_DCA0x80000044NoLevel 2 data cache readsPAPI_L2_DCA0x80000044NoLevel 2 data cache writesPAPI_L3_DCW0x80000045NoLevel 2 data cache writesPAPI_L3_DCW0x80000044NoLevel 2 instruction cache hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                          |       |                                                       |  |  |  |
| PAPI_L1_DCM0x80000000NoLevel 1 data cache missesPAPI_L1_ICM0x80000001NoLevel 1 instruction cache missesPAPI_L2_ICM0x80000003NoLevel 2 cache missesPAPI_L3_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000017NoLevel 3 cache missesPAPI_L1_LM0x80000017NoLevel 1 load missesPAPI_L1_LDM0x80000018NoLevel 1 store missesPAPI_L1_STM0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000025NoConditional branch instructions not takenPAPI_BR_NTK0x80000026NoConditional branch instructionsPAPI_LTO_INS0x80000027NoInstructionsPAPI_BR_NTK0x80000028NoConditional branch instructionsPAPI_ST_TOT_INS0x80000024NoConditional branch instructionsPAPI_BR_NTK0x80000035NoLoad instructionsPAPI_ST_LD_INS0x80000036NoStore instructionsPAPI_L2_CA0x80000035NoLevel 2 data cache accessesPAPI_L2_DCA0x80000044NoLevel 2 data cache readsPAPI_L2_DCA0x80000044NoLevel 2 data cache writesPAPI_L3_DCW0x80000045NoLevel 2 data cache writesPAPI_L3_DCW0x80000044NoLevel 2 instruction cache hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                          |       |                                                       |  |  |  |
| PAPI_L1_ICM0x80000001NoLevel 1 instruction cache missesPAPI_L2_ICM0x80000003NoLevel 2 instruction cache missesPAPI_L2_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000008NoLevel 3 cache missesPAPI_TLB_IM0x80000015NoInstruction translation lookaside buffer missesPAPI_L1_LDM0x80000017NoLevel 1 load missesPAPI_L1_STM0x80000018NoLevel 1 store missesPAPI_L2_STM0x80000010NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000026NoConditional branch instructions not takenPAPI_BR_NTK0x80000020NoConditional branch instructions mispredictedPAPI_TOT_INS0x80000032NoInstructionsPAPI_TOT_INS0x80000034YesFloating point instructionsPAPI_BR_INS0x80000035NoLoad instructionsPAPI_BR_INS0x80000036NoStore instructionsPAPI_TOT_CYC0x80000037NoBranch instructionsPAPI_L2_DCA0x80000038NoTotal cyclesPAPI_L3_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L3_DCW0x80000044NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Name Code D              | Deriv | Description (Note)                                    |  |  |  |
| PAPI_L2_ICM0x80000003NoLevel 2 instruction cache missesPAPI_L2_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000015NoInstruction translation lookaside buffer missesPAPI_TLB_IM0x80000017NoLevel 1 load missesPAPI_L1_LDM0x80000018NoLevel 1 store missesPAPI_L2_STM0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000010NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instructions issuePAPI_BR_CN0x80000026NoConditional branch instructions not takenPAPI_BR_NTK0x80000020NoConditional branch instructions mispredictedPAPI_TOT_INS0x80000032NoInstructionsPAPI_LD_INS0x80000034YesFloating point instructionsPAPI_ST_INS0x8000035NoLoad instructionsPAPI_ST_INS0x8000036NoStore instructionsPAPI_L2_DCA0x8000037NoBranch instructionsPAPI_L2_DCA0x8000038NoTotal cyclesPAPI_L2_DCR0x8000044NoLevel 2 data cache readsPAPI_L3_DCW0x8000045NoLevel 2 data cache writesPAPI_L3_DCW0x8000044NoLevel 3 data cache writesPAPI_L3_DCW0x8000044NoLevel 2 instruction cache hitsPAPI_L2_DCW0x8000044NoLevel 2 instruction cache hitsPAPI_L2_CCH0x8000044NoLevel 2 instruction cach                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | PAPI_L1_DCM 0x80000000   | No    | Level 1 data cache misses                             |  |  |  |
| PAPI_L2_TCM0x80000007NoLevel 2 cache missesPAPI_L3_TCM0x80000008NoLevel 3 cache missesPAPI_TLB_IM0x80000015NoInstruction translation lookaside buffer missesPAPI_L1_LDM0x80000017NoLevel 1 load missesPAPI_L1_STM0x80000018NoLevel 2 store missesPAPI_L2_STM0x80000010NoLevel 2 store missesPAPI_STL_TCY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000026NoConditional branch instructionsPAPI_BR_NTK0x80000026NoConditional branch instructions not takenPAPI_BR_NTK0x80000027NoConditional branch instructionsPAPI_BR_NTK0x80000026NoConditional branch instructionsPAPI_BR_NTK0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_SR_INS0x80000035NoLoad instructionsPAPI_SR_INS0x8000037NoBranch instructionsPAPI_BR_INS0x8000038NoTotal cyclesPAPI_L2_DCA0x80000044NoLevel 2 data cache accessesPAPI_L3_DCR0x80000045NoLevel 2 data cache readsPAPI_L3_DCW0x8000044NoLevel 2 data cache writesPAPI_L3_DCW0x8000044NoLevel 2 data cache writesPAPI_L2_DCM0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instructi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                          | No    | Level 1 instruction cache misses                      |  |  |  |
| PAPI_L3_TCM0x8000008NoLevel 3 cache missesPAPI_L1_B_IM0x8000015NoInstruction translation lookaside buffer missesPAPI_L1_LDM0x8000017NoLevel 1 load missesPAPI_L1_STM0x80000018NoLevel 1 store missesPAPI_L2_STM0x80000018NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000026NoConditional branch instructions not takenPAPI_BR_MTK0x80000022NoConditional branch instructions mispredictedPAPI_BR_MSP0x80000022NoConditional branch instructionsPAPI_FP_INS0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_SR_INS0x80000035NoLoad instructionsPAPI_BR_INS0x80000036NoStore instructionsPAPI_DT_CYC0x80000037NoBranch instructionsPAPI_L2_DCR0x80000044NoLevel 2 data cache accessesPAPI_L3_DCW0x80000045NoLevel 2 data cache readsPAPI_L3_DCW0x80000044NoLevel 2 data cache writesPAPI_L3_DCW0x80000044NoLevel 2 data cache writesPAPI_L3_DCW0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | PAPI_L2_ICM 0x80000003   | No    | Level 2 instruction cache misses                      |  |  |  |
| PAPI_TLB_IM0x8000015NoInstruction translation lookaside buffer missesPAPI_L1_LDM0x8000017NoLevel 1 load missesPAPI_L1_STM0x8000018NoLevel 1 store missesPAPI_L2_STM0x8000010NoLevel 2 store missesPAPI_STL_ICY0x8000025NoCycles with no instruction issuePAPI_BR_CN0x8000026NoConditional branch instructions not takenPAPI_BR_MTK0x8000022NoConditional branch instructions mispredictedPAPI_BR_MSP0x8000032NoInstructions completedPAPI_FP_INS0x8000033NoLoad instructionsPAPI_SR_INS0x8000035NoLoad instructionsPAPI_BR_INS0x8000036NoStore instructionsPAPI_TOT_CYC 0x8000037NoBranch instructionsPAPI_L2_DCR0x8000044NoLevel 2 data cache readsPAPI_L3_DCR0x8000044NoLevel 2 data cache readsPAPI_L3_DCW0x8000044NoLevel 2 data cache writesPAPI_L3_DCW0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instruction cache hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | PAPI_L2_TCM 0x80000007   | No    | Level 2 cache misses                                  |  |  |  |
| PAPI_L1_LDM0x8000017NoLevel 1 load missesPAPI_L1_STM0x8000018NoLevel 1 store missesPAPI_L2_STM0x80000025NoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x80000026NoConditional branch instructionsPAPI_BR_NTK0x80000026NoConditional branch instructions not takenPAPI_BR_MSP0x80000022NoConditional branch instructions mispredictedPAPI_TOT_INS0x80000032NoInstructions completedPAPI_LD_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_LD_CYC0x80000037NoBranch instructionsPAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L3_DCW0x80000047NoLevel 3 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L3_DCW0x80000044NoLevel 3 data cache writesPAPI_L2_ICA0x80000048NoLevel 3 data cache writesPAPI_L3_DCW0x80000048NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | PAPI_L3_TCM 0x80000008   | No    | Level 3 cache misses                                  |  |  |  |
| PAPI_L1_STM0x80000018NoLevel 1 store missesPAPI_L2_STM0x8000001aNoLevel 2 store missesPAPI_STL_ICY0x80000025NoCycles with no instruction issuePAPI_BR_CN0x8000002bNoConditional branch instructionsPAPI_BR_NTK0x8000002dNoConditional branch instructions not takenPAPI_BR_MSP0x8000002eNoConditional branch instructions mispredictedPAPI_BR_MSP0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_SR_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_L2_DCA0x80000044NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache writesPAPI_L3_DCW0x80000048NoLevel 2 instruction cache hitsPAPI_L2_ICH0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | PAPI_TLB_IM 0x80000015   | No    | Instruction translation lookaside buffer misses       |  |  |  |
| PAPI_L2_STM0x800001aNoLevel 2 store missesPAPI_STL_ICY0x8000025NoCycles with no instruction issuePAPI_BR_CN0x800002bNoConditional branch instructionsPAPI_BR_NTK0x800002dNoConditional branch instructions not takenPAPI_BR_MSP0x800002eNoConditional branch instructions mispredictedPAPI_TOT_INS0x8000032NoInstructions completedPAPI_FP_INS0x8000034YesFloating point instructionsPAPI_SR_INS0x8000035NoLoad instructionsPAPI_BR_INS0x8000036NoStore instructionsPAPI_BR_INS0x8000037NoBranch instructionsPAPI_L2_DCA0x8000038NoTotal cyclesPAPI_L2_DCR0x8000044NoLevel 2 data cache accessesPAPI_L3_DCR0x8000045NoLevel 2 data cache readsPAPI_L3_DCW0x8000047NoLevel 2 data cache writesPAPI_L3_DCW0x8000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                          | No    |                                                       |  |  |  |
| PAPI_STL_ICY 0x8000025NoCycles with no instruction issuePAPI_BR_CN0x800002bNoConditional branch instructionsPAPI_BR_NTK0x800002dNoConditional branch instructions not takenPAPI_BR_MSP0x800002eNoConditional branch instructions mispredictedPAPI_TOT_INS0x8000032NoInstructions completedPAPI_FP_INS0x8000034YesFloating point instructionsPAPI_LD_INS0x8000035NoLoad instructionsPAPI_SR_INS0x8000036NoStore instructionsPAPI_BR_INS0x8000037NoBranch instructionsPAPI_L2_DCA0x8000038NoTotal cyclesPAPI_L2_DCA0x8000044NoLevel 2 data cache accessesPAPI_L3_DCR0x8000045NoLevel 3 data cache readsPAPI_L3_DCW0x8000048NoLevel 3 data cache writesPAPI_L3_DCW0x8000044NoLevel 3 data cache writesPAPI_L2_ICH0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instruction cache hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                          | No    |                                                       |  |  |  |
| PAPI_BR_CN0x8000002bNoConditional branch instructionsPAPI_BR_NTK0x8000002dNoConditional branch instructions not takenPAPI_BR_MSP0x8000002eNoConditional branch instructions mispredictedPAPI_BR_MSP0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_LD_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_L2_DCA0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000044NoLevel 2 data cache accessesPAPI_L3_DCR0x80000045NoLevel 2 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                          |       |                                                       |  |  |  |
| PAPI_BR_NTK0x8000002dNoConditional branch instructions not takenPAPI_BR_MSP0x8000002eNoConditional branch instructions mispredictedPAPI_TOT_INS0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_LD_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000044NoLevel 2 data cache accessesPAPI_L3_DCR0x80000045NoLevel 2 data cache readsPAPI_L3_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                          |       |                                                       |  |  |  |
| PAPI_BR_MSP0x8000002eNoConditional branch instructions mispredictedPAPI_TOT_INS0x80000032NoInstructions completedPAPI_FP_INS0x80000034YesFloating point instructionsPAPI_LD_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000044NoLevel 2 data cache accessesPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                          |       |                                                       |  |  |  |
| PAPI_TOT_INS0x8000032NoInstructions completedPAPI_FP_INS0x8000034YesFloating point instructionsPAPI_LD_INS0x8000035NoLoad instructionsPAPI_SR_INS0x8000036NoStore instructionsPAPI_BR_INS0x8000037NoBranch instructionsPAPI_TOT_CYC0x800003bNoTotal cyclesPAPI_L2_DCA0x8000044NoLevel 2 data cache accessesPAPI_L3_DCR0x8000044NoLevel 2 data cache readsPAPI_L3_DCR0x8000045NoLevel 2 data cache writesPAPI_L3_DCW0x8000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                          |       |                                                       |  |  |  |
| PAPI_FP_INS0x8000034YesFloating point instructionsPAPI_LD_INS0x8000035NoLoad instructionsPAPI_SR_INS0x8000036NoStore instructionsPAPI_BR_INS0x8000037NoBranch instructionsPAPI_TOT_CYC0x800003bNoTotal cyclesPAPI_L2_DCA0x8000044NoLevel 2 data cache accessesPAPI_L3_DCR0x8000044NoLevel 2 data cache readsPAPI_L3_DCR0x8000045NoLevel 3 data cache readsPAPI_L3_DCW0x8000047NoLevel 2 data cache writesPAPI_L3_DCW0x8000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000044NoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                          |       |                                                       |  |  |  |
| PAPI_LD_INS0x80000035NoLoad instructionsPAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000004aNoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000004dNoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                          |       |                                                       |  |  |  |
| PAPI_SR_INS0x80000036NoStore instructionsPAPI_BR_INS0x80000037NoBranch instructionsPAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000004aNoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000004dNoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                          |       |                                                       |  |  |  |
| PAPI_BR_INS0x80000037NoBranch instructionsPAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000004aNoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000004dNoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                          |       |                                                       |  |  |  |
| PAPI_TOT_CYC0x8000003bNoTotal cyclesPAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000048NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                          |       |                                                       |  |  |  |
| PAPI_L2_DCA0x80000041NoLevel 2 data cache accessesPAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000048NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                          |       |                                                       |  |  |  |
| PAPI_L2_DCR0x80000044NoLevel 2 data cache readsPAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x80000048NoLevel 2 instruction cache hitsPAPI_L2_ICA0x80000044NoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                          |       |                                                       |  |  |  |
| PAPI_L3_DCR0x80000045NoLevel 3 data cache readsPAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000004aNoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000004dNoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                          |       |                                                       |  |  |  |
| PAPI_L2_DCW0x80000047NoLevel 2 data cache writesPAPI_L3_DCW0x80000048NoLevel 3 data cache writesPAPI_L2_ICH0x8000004aNoLevel 2 instruction cache hitsPAPI_L2_ICA0x8000004dNoLevel 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                          |       |                                                       |  |  |  |
| PAPI_L3_DCW 0x80000048 No Level 3 data cache writes<br>PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits<br>PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                          |       |                                                       |  |  |  |
| PAPI_L2_ICH 0x8000004a No Level 2 instruction cache hits<br>PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                          |       |                                                       |  |  |  |
| PAPI_L2_ICA 0x8000004d No Level 2 instruction cache accesses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                          |       |                                                       |  |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                          |       |                                                       |  |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                          |       |                                                       |  |  |  |

André Pereira, UMinho, 2018/2019

#### \* Preset events

- Events implemented on all platforms
  - \* PAPI\_TOT\_INS

#### \* Preset events

- Events implemented on all platforms
  - \* PAPI\_TOT\_INS
- \* Native events
  - Platform dependent events
    - \* L3\_CACHE\_MISS

#### \* Preset events

- Events implemented on all platforms
  - \* PAPI\_TOT\_INS
- \* Native events
  - \* Platform dependent events
    - \* L3\_CACHE\_MISS
- \* Derived events
  - \* Preset events that are derived from multiple native events
    - \* PAPI\_L1\_TCM may be L1 data misses + L1 instruction misses

\* Calls the low-level API

- \* Calls the low-level API
- \* Easier to use

André Pereira, UMinho, 2018/2019

- \* Calls the low-level API
- \* Easier to use
- \* Enough for coarse grain measurements
  - You will not optimise code based on the amount of L2
     TLB flushes per thread...

- \* Calls the low-level API
- \* Easier to use
- \* Enough for coarse grain measurements
  - You will not optimise code based on the amount of L2
     TLB flushes per thread...
- \* For preset events only!

The Basics

- \* PAPI\_start\_counters
- \* PAPI\_stop\_counters

### The Basics

```
#include "papi.h"
#define NUM_EVENTS 2
long long values[NUM_EVENTS];
unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};
/* Start the counters */
PAPI_start_counters((int*)Events,NUM_EVENTS);
/* What we are monitoring... */
do_work();
/* Stop counters and store results in values */
```

```
retval = PAPI_stop_counters(values,NUM_EVENTS);
```

### PAPI Low-level Interface

### PAPI Low-level Interface

Increased efficiency and functionality

### PAPI Low-level Interface

- Increased efficiency and functionality
- More information about the environment

#### PAPI Low-level Interface

- Increased efficiency and functionality
- More information about the environment
- \* Concepts to check later
  - \* EventSet
  - \* Multiplexing





#### The Basics

#include "papi.h" #define NUM\_EVENTS 2 int Events[NUM\_EVENTS]={PAPI\_FP\_INS,PAPI\_TOT\_CYC}; int EventSet; long long values[NUM\_EVENTS]; /\* Initialize the Library \*/ retval = **PAPI\_library\_init**(**PAPI\_VER\_CURRENT**); /\* Allocate space for the new eventset and do setup \*/ retval = PAPI\_create\_eventset(&EventSet); /\* Add Flops and total cycles to the eventset \*/ retval = PAPI\_add\_events(EventSet,Events,NUM\_EVENTS); /\* Start the counters \*/ retval = PAPI\_start(EventSet); /\* What we want to monitor\*/ do\_work(); /\*Stop counters and store results in values \*/ retval = PAPI\_stop(EventSet, values);

\* PAPI is also available for CUDA GPUs

\* PAPI is also available for CUDA GPUs

#### \* Uses the CUPTI

- Which counters can be directly accessed
- \* Define a file with the counters and an environment variable

\* PAPI is also available for CUDA GPUs

#### \* Uses the CUPTI

- Which counters can be directly accessed
- \* Define a file with the counters and an environment variable
- \* Gives useful information about the GPU usage
  - \* IPC
  - Memory load/stores/throughput
  - Branch divergences
  - SM(X) occupancy

#### \*

#### \* The whole application?

- \* The whole application?
- \* PAPI usefulness is limited when used alone

- \* The whole application?
- \* PAPI usefulness is limited when used alone
  - Combine it with other profilers

- \* The whole application?
- \* PAPI usefulness is limited when used alone
  - Combine it with other profilers
  - Bottleneck identification + characterisation

#### A Practical Example

for (int i = 0; i < SIZE; i++) for (int j = 0; j < SIZE; j++) for (int k = 0; k < SIZE; k++) c[i][j] += a[i][k] \* b[k][j];

#### A Practical Example

int sum;

```
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++) {
    sum = 0;
    for (int k = 0; k < SIZE; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
}</pre>
```

### A Practical Example SGEMM

int sum;

```
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++) {
    sum = 0;
    for (int k = 0; k < SIZE; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
}</pre>
```

#### **Execution Time**



@ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz

#### FLOP's



#### Cache Miss Rate



### Arithmetic Intensity



@ 2x Intel Xeon E5-2695v2, 12C with 24t each, 2.4GHz

- Instruction mix
  - \* PAPI\_FP\_INS
  - \* PAPI\_SR/LD\_INS
  - \* PAPI\_BR\_INS
  - \* PAPI\_SP/DP\_VEC



- Instruction mix
  - \* PAPI\_FP\_INS
  - \* PAPI\_SR/LD\_INS
  - \* PAPI\_BR\_INS
  - \* PAPI\_SP/DP\_VEC
- \* FLOPS and operational intensity
  - \* PAPI\_FP\_OPS
  - \* PAPI\_SP/DP\_OPS
  - \* PAPI\_TOT\_INS

- Instruction mix
  - \* PAPI\_FP\_INS
  - \* PAPI\_SR/LD\_INS
  - \* PAPI\_BR\_INS
  - \* PAPI\_SP/DP\_VEC
- \* FLOPS and operational intensity
  - \* PAPI\_FP\_OPS
  - \* PAPI\_SP/DP\_OPS
  - \* PAPI\_TOT\_INS
- \* Cache behaviour and bytes transferred
  - \* PAPI\_L1/2/3\_TCM
  - \* PAPI\_L1\_TCA

- \* Be careful choosing a measurement heuristic
  - \* Q: Why? Average? Median? Best measurement?

\* Be careful choosing a measurement heuristic

\* Q: Why? Average? Median? Best measurement?

\* Automatise the measurement process

- With scripting/C++ coding
- Using 3rd party tools that resort to PAPI
  - \* PerfSuite
  - \* HPCToolkit
  - \* TAU
  - \* VTune

\* Be careful choosing a measurement heuristic

\* Q: Why? Average? Median? Best measurement?

\* Automatise the measurement process

- With scripting/C++ coding
- Using 3rd party tools that resort to PAPI
  - \* PerfSuite
  - \* HPCToolkit
  - \* TAU
  - \* VTune

\* Available for Java and on virtual machines

- \* Use the same GCC/G++ version as
  - \* The PAPI compilation on your home
  - \* The PAPI available at the cluster

- \* Use the same GCC/G++ version as
  - The PAPI compilation on your home
  - \* The PAPI available at the cluster
- Setup the environment
  - module load gcc/5.3.0
  - \* module load papi/5.4.1
  - Add -I/share/apps/papi/5.4.1/include and -L/share/apps/papi/
     5.4.1/lib to the compilation if PAPI is not recognised

- \* Use the same GCC/G++ version as
  - \* The PAPI compilation on your home
  - \* The PAPI available at the cluster
- Setup the environment
  - \* module load gcc/5.3.0
  - \* module load papi/5.4.1
  - Add -I/share/apps/papi/5.4.1/include and -L/share/apps/papi/
     5.4.1/lib to the compilation if PAPI is not recognised
- Code compilation g++ -O3 c.cpp -lpapi

#### Hands-on

- Assess the available counters on a node (interactive qsub)
  - \* qsub -I -qmei -Inodes=1,walltime=10:00
- Perform the FLOPs and miss rate measurements interactively
  - https://bitbucket.org/ampereira/papi/downloads

#### References

- Dongarra, J., London, K., Moore, S., Mucci, P., Terpstra, D. "Using PAPI for Hardware Performance Monitoring on Linux Systems," Conference on Linux Clusters: The HPC Revolution, Linux Clusters Institute, Urbana, Illinois, June 25-27, 2001.
- Weaver, V., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra, D., Moore, S. "Measuring Energy and Power with PAPI," International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 10, 2012.
- Malony, A., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Duncan Poole, P., Lamb, C. "Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs," International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, September 13-16, 2011.
- Weaver, V., Dongarra, J. "Can Hardware Performance Counters Produce Expected, Deterministic Results?," 3rd Workshop on Functionality of Hardware Performance Monitoring, Atlanta, GA, December 4, 2010.