Thanks for using Compiler Explorer
Sponsors
Jakt
C++
Ada
Analysis
Android Java
Android Kotlin
Assembly
C
C3
Carbon
C++ (Circle)
CIRCT
Clean
CMake
CMakeScript
COBOL
C++ for OpenCL
MLIR
Cppx
Cppx-Blue
Cppx-Gold
Cpp2-cppfront
Crystal
C#
CUDA C++
D
Dart
Elixir
Erlang
Fortran
F#
Go
Haskell
HLSL
Hook
Hylo
ispc
Java
Julia
Kotlin
LLVM IR
LLVM MIR
Modula-2
Nim
Objective-C
Objective-C++
OCaml
OpenCL C
Pascal
Pony
Python
Racket
Ruby
Rust
Snowball
Scala
Solidity
Spice
Swift
LLVM TableGen
Toit
TypeScript Native
V
Vala
Visual Basic
Zig
Javascript
GIMPLE
c++ source #1
Output
Compile to binary object
Link to binary
Execute the code
Intel asm syntax
Demangle identifiers
Verbose demangling
Filters
Unused labels
Library functions
Directives
Comments
Horizontal whitespace
Debug intrinsics
Compiler
6502-c++ 11.1.0
ARM GCC 10.2.0
ARM GCC 10.3.0
ARM GCC 10.4.0
ARM GCC 10.5.0
ARM GCC 11.1.0
ARM GCC 11.2.0
ARM GCC 11.3.0
ARM GCC 11.4.0
ARM GCC 12.1.0
ARM GCC 12.2.0
ARM GCC 12.3.0
ARM GCC 13.1.0
ARM GCC 13.2.0
ARM GCC 13.2.0 (unknown-eabi)
ARM GCC 4.5.4
ARM GCC 4.6.4
ARM GCC 5.4
ARM GCC 6.3.0
ARM GCC 6.4.0
ARM GCC 7.3.0
ARM GCC 7.5.0
ARM GCC 8.2.0
ARM GCC 8.5.0
ARM GCC 9.3.0
ARM GCC 9.4.0
ARM GCC 9.5.0
ARM GCC trunk
ARM gcc 10.2.1 (none)
ARM gcc 10.3.1 (2021.07 none)
ARM gcc 10.3.1 (2021.10 none)
ARM gcc 11.2.1 (none)
ARM gcc 5.4.1 (none)
ARM gcc 7.2.1 (none)
ARM gcc 8.2 (WinCE)
ARM gcc 8.3.1 (none)
ARM gcc 9.2.1 (none)
ARM msvc v19.0 (WINE)
ARM msvc v19.10 (WINE)
ARM msvc v19.14 (WINE)
ARM64 Morello gcc 10.1 Alpha 2
ARM64 gcc 10.2
ARM64 gcc 10.3
ARM64 gcc 10.4
ARM64 gcc 10.5.0
ARM64 gcc 11.1
ARM64 gcc 11.2
ARM64 gcc 11.3
ARM64 gcc 11.4.0
ARM64 gcc 12.1
ARM64 gcc 12.2.0
ARM64 gcc 12.3.0
ARM64 gcc 13.1.0
ARM64 gcc 13.2.0
ARM64 gcc 5.4
ARM64 gcc 6.3
ARM64 gcc 6.4
ARM64 gcc 7.3
ARM64 gcc 7.5
ARM64 gcc 8.2
ARM64 gcc 8.5
ARM64 gcc 9.3
ARM64 gcc 9.4
ARM64 gcc 9.5
ARM64 gcc trunk
ARM64 msvc v19.14 (WINE)
AVR gcc 10.3.0
AVR gcc 11.1.0
AVR gcc 12.1.0
AVR gcc 12.2.0
AVR gcc 12.3.0
AVR gcc 13.1.0
AVR gcc 13.2.0
AVR gcc 4.5.4
AVR gcc 4.6.4
AVR gcc 5.4.0
AVR gcc 9.2.0
AVR gcc 9.3.0
Arduino Mega (1.8.9)
Arduino Uno (1.8.9)
BPF clang (trunk)
BPF clang 13.0.0
BPF clang 14.0.0
BPF clang 15.0.0
BPF clang 16.0.0
BPF clang 17.0.1
BPF clang 18.1.0
BPF gcc 13.1.0
BPF gcc 13.2.0
BPF gcc trunk
EDG (experimental reflection)
EDG 6.5
EDG 6.5 (GNU mode gcc 13)
EDG 6.6
EDG 6.6 (GNU mode gcc 13)
FRC 2019
FRC 2020
FRC 2023
KVX ACB 4.1.0 (GCC 7.5.0)
KVX ACB 4.1.0-cd1 (GCC 7.5.0)
KVX ACB 4.10.0 (GCC 10.3.1)
KVX ACB 4.11.1 (GCC 10.3.1)
KVX ACB 4.12.0 (GCC 11.3.0)
KVX ACB 4.2.0 (GCC 7.5.0)
KVX ACB 4.3.0 (GCC 7.5.0)
KVX ACB 4.4.0 (GCC 7.5.0)
KVX ACB 4.6.0 (GCC 9.4.1)
KVX ACB 4.8.0 (GCC 9.4.1)
KVX ACB 4.9.0 (GCC 9.4.1)
M68K gcc 13.1.0
M68K gcc 13.2.0
M68k clang (trunk)
MRISC32 gcc (trunk)
MSP430 gcc 4.5.3
MSP430 gcc 5.3.0
MSP430 gcc 6.2.1
MinGW clang 14.0.3
MinGW clang 14.0.6
MinGW clang 15.0.7
MinGW clang 16.0.0
MinGW clang 16.0.2
MinGW gcc 11.3.0
MinGW gcc 12.1.0
MinGW gcc 12.2.0
MinGW gcc 13.1.0
RISC-V (32-bits) gcc (trunk)
RISC-V (32-bits) gcc 10.2.0
RISC-V (32-bits) gcc 10.3.0
RISC-V (32-bits) gcc 11.2.0
RISC-V (32-bits) gcc 11.3.0
RISC-V (32-bits) gcc 11.4.0
RISC-V (32-bits) gcc 12.1.0
RISC-V (32-bits) gcc 12.2.0
RISC-V (32-bits) gcc 12.3.0
RISC-V (32-bits) gcc 13.1.0
RISC-V (32-bits) gcc 13.2.0
RISC-V (32-bits) gcc 8.2.0
RISC-V (32-bits) gcc 8.5.0
RISC-V (32-bits) gcc 9.4.0
RISC-V (64-bits) gcc (trunk)
RISC-V (64-bits) gcc 10.2.0
RISC-V (64-bits) gcc 10.3.0
RISC-V (64-bits) gcc 11.2.0
RISC-V (64-bits) gcc 11.3.0
RISC-V (64-bits) gcc 11.4.0
RISC-V (64-bits) gcc 12.1.0
RISC-V (64-bits) gcc 12.2.0
RISC-V (64-bits) gcc 12.3.0
RISC-V (64-bits) gcc 13.1.0
RISC-V (64-bits) gcc 13.2.0
RISC-V (64-bits) gcc 8.2.0
RISC-V (64-bits) gcc 8.5.0
RISC-V (64-bits) gcc 9.4.0
RISC-V rv32gc clang (trunk)
RISC-V rv32gc clang 10.0.0
RISC-V rv32gc clang 10.0.1
RISC-V rv32gc clang 11.0.0
RISC-V rv32gc clang 11.0.1
RISC-V rv32gc clang 12.0.0
RISC-V rv32gc clang 12.0.1
RISC-V rv32gc clang 13.0.0
RISC-V rv32gc clang 13.0.1
RISC-V rv32gc clang 14.0.0
RISC-V rv32gc clang 15.0.0
RISC-V rv32gc clang 16.0.0
RISC-V rv32gc clang 17.0.1
RISC-V rv32gc clang 18.1.0
RISC-V rv32gc clang 9.0.0
RISC-V rv32gc clang 9.0.1
RISC-V rv64gc clang (trunk)
RISC-V rv64gc clang 10.0.0
RISC-V rv64gc clang 10.0.1
RISC-V rv64gc clang 11.0.0
RISC-V rv64gc clang 11.0.1
RISC-V rv64gc clang 12.0.0
RISC-V rv64gc clang 12.0.1
RISC-V rv64gc clang 13.0.0
RISC-V rv64gc clang 13.0.1
RISC-V rv64gc clang 14.0.0
RISC-V rv64gc clang 15.0.0
RISC-V rv64gc clang 16.0.0
RISC-V rv64gc clang 17.0.1
RISC-V rv64gc clang 18.1.0
RISC-V rv64gc clang 9.0.0
RISC-V rv64gc clang 9.0.1
Raspbian Buster
Raspbian Stretch
SPARC LEON gcc 12.2.0
SPARC LEON gcc 12.3.0
SPARC LEON gcc 13.1.0
SPARC LEON gcc 13.2.0
SPARC gcc 12.2.0
SPARC gcc 12.3.0
SPARC gcc 13.1.0
SPARC gcc 13.2.0
SPARC64 gcc 12.2.0
SPARC64 gcc 12.3.0
SPARC64 gcc 13.1.0
SPARC64 gcc 13.2.0
TI C6x gcc 12.2.0
TI C6x gcc 12.3.0
TI C6x gcc 13.1.0
TI C6x gcc 13.2.0
TI CL430 21.6.1
VAX gcc NetBSDELF 10.4.0
VAX gcc NetBSDELF 10.5.0 (Nov 15 03:50:22 2023)
WebAssembly clang (trunk)
Xtensa ESP32 gcc 11.2.0 (2022r1)
Xtensa ESP32 gcc 12.2.0 (20230208)
Xtensa ESP32 gcc 8.2.0 (2019r2)
Xtensa ESP32 gcc 8.2.0 (2020r1)
Xtensa ESP32 gcc 8.2.0 (2020r2)
Xtensa ESP32 gcc 8.4.0 (2020r3)
Xtensa ESP32 gcc 8.4.0 (2021r1)
Xtensa ESP32 gcc 8.4.0 (2021r2)
Xtensa ESP32-S2 gcc 11.2.0 (2022r1)
Xtensa ESP32-S2 gcc 12.2.0 (20230208)
Xtensa ESP32-S2 gcc 8.2.0 (2019r2)
Xtensa ESP32-S2 gcc 8.2.0 (2020r1)
Xtensa ESP32-S2 gcc 8.2.0 (2020r2)
Xtensa ESP32-S2 gcc 8.4.0 (2020r3)
Xtensa ESP32-S2 gcc 8.4.0 (2021r1)
Xtensa ESP32-S2 gcc 8.4.0 (2021r2)
Xtensa ESP32-S3 gcc 11.2.0 (2022r1)
Xtensa ESP32-S3 gcc 12.2.0 (20230208)
Xtensa ESP32-S3 gcc 8.4.0 (2020r3)
Xtensa ESP32-S3 gcc 8.4.0 (2021r1)
Xtensa ESP32-S3 gcc 8.4.0 (2021r2)
arm64 msvc v19.28 VS16.9
arm64 msvc v19.29 VS16.10
arm64 msvc v19.29 VS16.11
arm64 msvc v19.30
arm64 msvc v19.31
arm64 msvc v19.32
arm64 msvc v19.33
arm64 msvc v19.34
arm64 msvc v19.35
arm64 msvc v19.36
arm64 msvc v19.37
arm64 msvc v19.38
arm64 msvc v19.latest
armv7-a clang (trunk)
armv7-a clang 10.0.0
armv7-a clang 10.0.1
armv7-a clang 11.0.0
armv7-a clang 11.0.1
armv7-a clang 9.0.0
armv7-a clang 9.0.1
armv8-a clang (all architectural features, trunk)
armv8-a clang (trunk)
armv8-a clang 10.0.0
armv8-a clang 10.0.1
armv8-a clang 11.0.0
armv8-a clang 11.0.1
armv8-a clang 12.0.0
armv8-a clang 13.0.0
armv8-a clang 14.0.0
armv8-a clang 15.0.0
armv8-a clang 16.0.0
armv8-a clang 17.0.1
armv8-a clang 18.1.0
armv8-a clang 9.0.0
armv8-a clang 9.0.1
ellcc 0.1.33
ellcc 0.1.34
ellcc 2017-07-16
hexagon-clang 16.0.5
llvm-mos atari2600-3e
llvm-mos atari2600-4k
llvm-mos atari2600-common
llvm-mos atari5200-supercart
llvm-mos atari8-cart-megacart
llvm-mos atari8-cart-std
llvm-mos atari8-cart-xegs
llvm-mos atari8-common
llvm-mos atari8-dos
llvm-mos c128
llvm-mos c64
llvm-mos commodore
llvm-mos cpm65
llvm-mos cx16
llvm-mos dodo
llvm-mos eater
llvm-mos mega65
llvm-mos nes
llvm-mos nes-action53
llvm-mos nes-cnrom
llvm-mos nes-gtrom
llvm-mos nes-mmc1
llvm-mos nes-mmc3
llvm-mos nes-nrom
llvm-mos nes-unrom
llvm-mos nes-unrom-512
llvm-mos osi-c1p
llvm-mos pce
llvm-mos pce-cd
llvm-mos pce-common
llvm-mos pet
llvm-mos rp6502
llvm-mos rpc8e
llvm-mos supervision
llvm-mos vic20
loongarch64 gcc 12.2.0
loongarch64 gcc 12.3.0
loongarch64 gcc 13.1.0
loongarch64 gcc 13.2.0
mips clang 13.0.0
mips clang 14.0.0
mips clang 15.0.0
mips clang 16.0.0
mips clang 17.0.1
mips clang 18.1.0
mips gcc 11.2.0
mips gcc 12.1.0
mips gcc 12.2.0
mips gcc 12.3.0
mips gcc 13.1.0
mips gcc 13.2.0
mips gcc 4.9.4
mips gcc 5.4
mips gcc 5.5.0
mips gcc 9.3.0 (codescape)
mips gcc 9.5.0
mips64 (el) gcc 12.1.0
mips64 (el) gcc 12.2.0
mips64 (el) gcc 12.3.0
mips64 (el) gcc 13.1.0
mips64 (el) gcc 13.2.0
mips64 (el) gcc 4.9.4
mips64 (el) gcc 5.4.0
mips64 (el) gcc 5.5.0
mips64 (el) gcc 9.5.0
mips64 clang 13.0.0
mips64 clang 14.0.0
mips64 clang 15.0.0
mips64 clang 16.0.0
mips64 clang 17.0.1
mips64 clang 18.1.0
mips64 gcc 11.2.0
mips64 gcc 12.1.0
mips64 gcc 12.2.0
mips64 gcc 12.3.0
mips64 gcc 13.1.0
mips64 gcc 13.2.0
mips64 gcc 4.9.4
mips64 gcc 5.4.0
mips64 gcc 5.5.0
mips64 gcc 9.5.0
mips64el clang 13.0.0
mips64el clang 14.0.0
mips64el clang 15.0.0
mips64el clang 16.0.0
mips64el clang 17.0.1
mips64el clang 18.1.0
mipsel clang 13.0.0
mipsel clang 14.0.0
mipsel clang 15.0.0
mipsel clang 16.0.0
mipsel clang 17.0.1
mipsel clang 18.1.0
mipsel gcc 12.1.0
mipsel gcc 12.2.0
mipsel gcc 12.3.0
mipsel gcc 13.1.0
mipsel gcc 13.2.0
mipsel gcc 4.9.4
mipsel gcc 5.4.0
mipsel gcc 5.5.0
mipsel gcc 9.5.0
nanoMIPS gcc 6.3.0 (mtk)
power gcc 11.2.0
power gcc 12.1.0
power gcc 12.2.0
power gcc 12.3.0
power gcc 13.1.0
power gcc 13.2.0
power gcc 4.8.5
power64 AT12.0 (gcc8)
power64 AT13.0 (gcc9)
power64 gcc 11.2.0
power64 gcc 12.1.0
power64 gcc 12.2.0
power64 gcc 12.3.0
power64 gcc 13.1.0
power64 gcc 13.2.0
power64 gcc trunk
power64le AT12.0 (gcc8)
power64le AT13.0 (gcc9)
power64le clang (trunk)
power64le gcc 11.2.0
power64le gcc 12.1.0
power64le gcc 12.2.0
power64le gcc 12.3.0
power64le gcc 13.1.0
power64le gcc 13.2.0
power64le gcc 6.3.0
power64le gcc trunk
powerpc64 clang (trunk)
s390x gcc 11.2.0
s390x gcc 12.1.0
s390x gcc 12.2.0
s390x gcc 12.3.0
s390x gcc 13.1.0
s390x gcc 13.2.0
sh gcc 12.2.0
sh gcc 12.3.0
sh gcc 13.1.0
sh gcc 13.2.0
sh gcc 4.9.4
sh gcc 9.5.0
vast (trunk)
x64 msvc v19.0 (WINE)
x64 msvc v19.10 (WINE)
x64 msvc v19.14
x64 msvc v19.14 (WINE)
x64 msvc v19.15
x64 msvc v19.16
x64 msvc v19.20
x64 msvc v19.21
x64 msvc v19.22
x64 msvc v19.23
x64 msvc v19.24
x64 msvc v19.25
x64 msvc v19.26
x64 msvc v19.27
x64 msvc v19.28
x64 msvc v19.28 VS16.9
x64 msvc v19.29 VS16.10
x64 msvc v19.29 VS16.11
x64 msvc v19.30
x64 msvc v19.31
x64 msvc v19.32
x64 msvc v19.33
x64 msvc v19.34
x64 msvc v19.35
x64 msvc v19.36
x64 msvc v19.37
x64 msvc v19.38
x64 msvc v19.latest
x86 djgpp 4.9.4
x86 djgpp 5.5.0
x86 djgpp 6.4.0
x86 djgpp 7.2.0
x86 msvc v19.0 (WINE)
x86 msvc v19.10 (WINE)
x86 msvc v19.14
x86 msvc v19.14 (WINE)
x86 msvc v19.15
x86 msvc v19.16
x86 msvc v19.20
x86 msvc v19.21
x86 msvc v19.22
x86 msvc v19.23
x86 msvc v19.24
x86 msvc v19.25
x86 msvc v19.26
x86 msvc v19.27
x86 msvc v19.28
x86 msvc v19.28 VS16.9
x86 msvc v19.29 VS16.10
x86 msvc v19.29 VS16.11
x86 msvc v19.30
x86 msvc v19.31
x86 msvc v19.32
x86 msvc v19.33
x86 msvc v19.34
x86 msvc v19.35
x86 msvc v19.36
x86 msvc v19.37
x86 msvc v19.38
x86 msvc v19.latest
x86 nvc++ 22.11
x86 nvc++ 22.7
x86 nvc++ 22.9
x86 nvc++ 23.1
x86 nvc++ 23.11
x86 nvc++ 23.3
x86 nvc++ 23.5
x86 nvc++ 23.7
x86 nvc++ 23.9
x86 nvc++ 24.1
x86 nvc++ 24.3
x86-64 Zapcc 190308
x86-64 clang (amd-stg-open)
x86-64 clang (assertions trunk)
x86-64 clang (clangir)
x86-64 clang (experimental -Wlifetime)
x86-64 clang (experimental P1061)
x86-64 clang (experimental P1144)
x86-64 clang (experimental P1221)
x86-64 clang (experimental P2996)
x86-64 clang (experimental metaprogramming - P2632)
x86-64 clang (experimental pattern matching)
x86-64 clang (old concepts branch)
x86-64 clang (reflection)
x86-64 clang (resugar)
x86-64 clang (thephd.dev)
x86-64 clang (trunk)
x86-64 clang (variadic friends - P2893)
x86-64 clang (widberg)
x86-64 clang 10.0.0
x86-64 clang 10.0.0 (assertions)
x86-64 clang 10.0.1
x86-64 clang 11.0.0
x86-64 clang 11.0.0 (assertions)
x86-64 clang 11.0.1
x86-64 clang 12.0.0
x86-64 clang 12.0.0 (assertions)
x86-64 clang 12.0.1
x86-64 clang 13.0.0
x86-64 clang 13.0.0 (assertions)
x86-64 clang 13.0.1
x86-64 clang 14.0.0
x86-64 clang 14.0.0 (assertions)
x86-64 clang 15.0.0
x86-64 clang 15.0.0 (assertions)
x86-64 clang 16.0.0
x86-64 clang 16.0.0 (assertions)
x86-64 clang 17.0.1
x86-64 clang 17.0.1 (assertions)
x86-64 clang 18.1.0
x86-64 clang 18.1.0 (assertions)
x86-64 clang 2.6.0 (assertions)
x86-64 clang 2.7.0 (assertions)
x86-64 clang 2.8.0 (assertions)
x86-64 clang 2.9.0 (assertions)
x86-64 clang 3.0.0
x86-64 clang 3.0.0 (assertions)
x86-64 clang 3.1
x86-64 clang 3.1 (assertions)
x86-64 clang 3.2
x86-64 clang 3.2 (assertions)
x86-64 clang 3.3
x86-64 clang 3.3 (assertions)
x86-64 clang 3.4 (assertions)
x86-64 clang 3.4.1
x86-64 clang 3.5
x86-64 clang 3.5 (assertions)
x86-64 clang 3.5.1
x86-64 clang 3.5.2
x86-64 clang 3.6
x86-64 clang 3.6 (assertions)
x86-64 clang 3.7
x86-64 clang 3.7 (assertions)
x86-64 clang 3.7.1
x86-64 clang 3.8
x86-64 clang 3.8 (assertions)
x86-64 clang 3.8.1
x86-64 clang 3.9.0
x86-64 clang 3.9.0 (assertions)
x86-64 clang 3.9.1
x86-64 clang 4.0.0
x86-64 clang 4.0.0 (assertions)
x86-64 clang 4.0.1
x86-64 clang 5.0.0
x86-64 clang 5.0.0 (assertions)
x86-64 clang 5.0.1
x86-64 clang 5.0.2
x86-64 clang 6.0.0
x86-64 clang 6.0.0 (assertions)
x86-64 clang 6.0.1
x86-64 clang 7.0.0
x86-64 clang 7.0.0 (assertions)
x86-64 clang 7.0.1
x86-64 clang 7.1.0
x86-64 clang 8.0.0
x86-64 clang 8.0.0 (assertions)
x86-64 clang 8.0.1
x86-64 clang 9.0.0
x86-64 clang 9.0.0 (assertions)
x86-64 clang 9.0.1
x86-64 clang rocm-4.5.2
x86-64 clang rocm-5.0.2
x86-64 clang rocm-5.1.3
x86-64 clang rocm-5.2.3
x86-64 clang rocm-5.3.3
x86-64 clang rocm-5.7.0
x86-64 gcc (contract labels)
x86-64 gcc (contracts natural syntax)
x86-64 gcc (contracts)
x86-64 gcc (coroutines)
x86-64 gcc (modules)
x86-64 gcc (trunk)
x86-64 gcc 10.1
x86-64 gcc 10.2
x86-64 gcc 10.3
x86-64 gcc 10.4
x86-64 gcc 10.5
x86-64 gcc 11.1
x86-64 gcc 11.2
x86-64 gcc 11.3
x86-64 gcc 11.4
x86-64 gcc 12.1
x86-64 gcc 12.2
x86-64 gcc 12.3
x86-64 gcc 13.1
x86-64 gcc 13.2
x86-64 gcc 4.1.2
x86-64 gcc 4.4.7
x86-64 gcc 4.5.3
x86-64 gcc 4.6.4
x86-64 gcc 4.7.1
x86-64 gcc 4.7.2
x86-64 gcc 4.7.3
x86-64 gcc 4.7.4
x86-64 gcc 4.8.1
x86-64 gcc 4.8.2
x86-64 gcc 4.8.3
x86-64 gcc 4.8.4
x86-64 gcc 4.8.5
x86-64 gcc 4.9.0
x86-64 gcc 4.9.1
x86-64 gcc 4.9.2
x86-64 gcc 4.9.3
x86-64 gcc 4.9.4
x86-64 gcc 5.1
x86-64 gcc 5.2
x86-64 gcc 5.3
x86-64 gcc 5.4
x86-64 gcc 5.5
x86-64 gcc 6.1
x86-64 gcc 6.2
x86-64 gcc 6.3
x86-64 gcc 6.4
x86-64 gcc 7.1
x86-64 gcc 7.2
x86-64 gcc 7.3
x86-64 gcc 7.4
x86-64 gcc 7.5
x86-64 gcc 8.1
x86-64 gcc 8.2
x86-64 gcc 8.3
x86-64 gcc 8.4
x86-64 gcc 8.5
x86-64 gcc 9.1
x86-64 gcc 9.2
x86-64 gcc 9.3
x86-64 gcc 9.4
x86-64 gcc 9.5
x86-64 icc 13.0.1
x86-64 icc 16.0.3
x86-64 icc 17.0.0
x86-64 icc 18.0.0
x86-64 icc 19.0.0
x86-64 icc 19.0.1
x86-64 icc 2021.1.2
x86-64 icc 2021.10.0
x86-64 icc 2021.2.0
x86-64 icc 2021.3.0
x86-64 icc 2021.4.0
x86-64 icc 2021.5.0
x86-64 icc 2021.6.0
x86-64 icc 2021.7.0
x86-64 icc 2021.7.1
x86-64 icc 2021.8.0
x86-64 icc 2021.9.0
x86-64 icx (latest)
x86-64 icx 2021.1.2
x86-64 icx 2021.2.0
x86-64 icx 2021.3.0
x86-64 icx 2021.4.0
x86-64 icx 2022.0.0
x86-64 icx 2022.1.0
x86-64 icx 2022.2.0
x86-64 icx 2022.2.1
x86-64 icx 2023.0.0
x86-64 icx 2023.1.0
x86-64 icx 2023.2.1
x86-64 icx 2024.0.0
zig c++ 0.10.0
zig c++ 0.11.0
zig c++ 0.12.0
zig c++ 0.6.0
zig c++ 0.7.0
zig c++ 0.7.1
zig c++ 0.8.0
zig c++ 0.9.0
zig c++ trunk
Options
Source code
#include <cfloat> #include <cmath> #include <cstdio> #include <cstring> #include <iomanip> #include <iostream> #include <sstream> #include <string> #include <tuple> #include <omp.h> // Much simplified struct double2 {double x,y;}; struct dim3 {unsigned int x,y,z; constexpr dim3(unsigned int x=1u,unsigned int y=1u,unsigned int z=1u):x(x),y(y),z(z){}}; // // include/complex_quda.h // template <typename ValueType> struct complex; template<> struct complex <double> : public double2 { typedef double value_type; inline complex<double>() {}; constexpr complex<double>(const double & re, const double& im = double()) : double2{re, im} { } constexpr double real() const { return x; } constexpr double imag() const { return y; } inline void real(double re){ x = re; } inline void imag(double im){ y = im; } // cast operators template <typename T> inline operator complex<T>() const { return complex<T>(static_cast<T>(real()),static_cast<T>(imag())); } }; // // TARGET device // namespace device { constexpr int warp_size() { return 8; } template <int block_size_y = 1, int block_size_z = 1> constexpr unsigned int max_block_size() { return std::max(warp_size(), 512 / (block_size_y * block_size_z)); } unsigned int processor_count() { static int m = -1; if(m<0){ #pragma omp target map(tofrom:m) m = omp_get_num_procs(); } return m/warp_size(); } unsigned int max_threads_per_block() { static int m = -1; if(m<0){ #pragma omp target teams map(tofrom:m) if(omp_get_team_num()==0) m = omp_get_max_threads(); } return m; } unsigned int max_threads_per_block_dim(int i) { return max_threads_per_block(); } unsigned int max_grid_size(int i) { return 65536; } } struct Timer { double t; void start() { t = omp_get_wtime(); } void stop() { t = omp_get_wtime() - t; } double last() { return t; } }; typedef Timer host_timer_t; typedef Timer device_timer_t; // // include/tune_key.h // struct TuneKey { static const int volume_n = 32; static const int name_n = 512; static const int aux_n = 256; char volume[volume_n]; char name[name_n]; TuneKey() { } TuneKey(const char v[], const char n[]) { strcpy(volume, v); strcpy(name, n); } /* bool operator<(const TuneKey &other) const { int vc = std::strcmp(volume, other.volume); if (vc < 0) { return true; } else if (vc == 0) { int nc = std::strcmp(name, other.name); if (nc < 0) { return true; } return false; } */ friend std::ostream& operator<<(std::ostream& output, const TuneKey& key) { output << "volume = " << key.volume << ", "; output << "name = " << key.name << ", "; return output; } }; // // include/tune_quda.h // struct TuneParam { dim3 block; dim3 grid; float time; friend std::ostream& operator<<(std::ostream& output, const TuneParam& param) { output << "block=(" << param.block.x << "," << param.block.y << "," << param.block.z << "), "; output << "grid=(" << param.grid.x << "," << param.grid.y << "," << param.grid.z << ")"; return output; } }; struct Tunable { virtual long long flops() const { return 0; } virtual long long bytes() const { return 0; } virtual unsigned int minThreads() const { return 1; } virtual bool tuneGridDim() const { return true; } virtual bool advanceGridDim(TuneParam ¶m) const { if (tuneGridDim()) { const int step = gridStep(); param.grid.x += step; if (param.grid.x > maxGridSize()) { param.grid.x = minGridSize(); return false; } else { return true; } } else { return false; } } virtual unsigned int maxBlockSize(const TuneParam ¶m) const { return device::max_threads_per_block() / (param.block.y*param.block.z); } virtual unsigned int maxGridSize() const { return 2*device::processor_count(); } virtual unsigned int minGridSize() const { return 1; } virtual int gridStep() const { return 1; } virtual int blockStep() const; virtual int blockMin() const; virtual void resetBlockDim(TuneParam ¶m) const { if (tuneGridDim()) { param.block.x = blockMin(); } else { // not tuning the grid dimension so have to set a valid grid size const auto step = blockStep(); const auto max_threads = maxBlockSize(param); const auto max_blocks = device::max_grid_size(0); // ensure the blockDim is large enough given the limit on gridDim param.block.x = (minThreads() + max_blocks - 1) / max_blocks; param.block.x = ((param.block.x+step-1)/step)*step; // round up to nearest step size if (param.block.x > max_threads && param.block.y == 1 && param.block.z == 1){ printf("Local lattice volume is too large for device"); exit(1); } } } virtual bool advanceBlockDim(TuneParam ¶m) const { const unsigned int max_threads = maxBlockSize(param); bool ret; param.block.x += blockStep(); unsigned int nthreads = param.block.x * param.block.y * param.block.z; if (param.block.x > max_threads || nthreads > device::max_threads_per_block()) { resetBlockDim(param); ret = false; } else { ret = true; } if (!tuneGridDim()) param.grid.x = (minThreads() + param.block.x - 1) / param.block.x; return ret; } char vol[TuneKey::volume_n]; virtual TuneKey tuneKey() const = 0; virtual void apply() = 0; virtual void preTune() { } virtual void postTune() { } virtual int tuningIter() const { return 32; } virtual std::string paramString(const TuneParam ¶m) const { std::stringstream ps; ps << param; return ps.str(); } virtual std::string perfString(float time) const { float gflops = flops() / (1e9 * time); float gbytes = bytes() / (1e9 * time); std::stringstream ss; ss << std::setiosflags(std::ios::fixed) << std::setprecision(2) << gflops << " Gflop/s, "; ss << gbytes << " GB/s"; return ss.str(); } virtual void initTuneParam(TuneParam ¶m) const { const unsigned int max_threads = device::max_threads_per_block_dim(0); const unsigned int max_blocks = device::max_grid_size(0); const int min_grid_size = minGridSize(); const int min_block_size = blockMin(); if (tuneGridDim()) { param.block = dim3(min_block_size,1,1); param.grid = dim3(min_grid_size,1,1); } else { // find the minimum valid blockDim param.block = dim3((minThreads()+max_blocks-1)/max_blocks, 1, 1); param.block.x = ((param.block.x+min_block_size-1) / min_block_size) * min_block_size; // round up to the nearest multiple of desired minimum block size if (param.block.x > max_threads){ printf("Local lattice volume is too large for device"); exit(1); } param.grid = dim3((minThreads()+param.block.x-1)/param.block.x, 1, 1); } } virtual void defaultTuneParam(TuneParam ¶m) const { initTuneParam(param); if (tuneGridDim()) param.grid.x = maxGridSize(); // don't set y and z in case derived initTuneParam has } virtual bool advanceTuneParam(TuneParam ¶m) const { return advanceBlockDim(param) || advanceGridDim(param); } }; // // TARGET specific // #define QUDA_RT_CONSTS \ const dim3\ blockDim=launch_param.block,\ gridDim=launch_param.grid,\ threadIdx(omp_get_thread_num()%launch_param.block.x, (omp_get_thread_num()/launch_param.block.x)%launch_param.block.y, omp_get_thread_num()/(launch_param.block.x*launch_param.block.y)),\ blockIdx(omp_get_team_num()%launch_param.grid.x, (omp_get_team_num()/launch_param.grid.x)%launch_param.grid.y, omp_get_team_num()/(launch_param.grid.x*launch_param.grid.y)) bool invalid_launch_param(TuneParam ¶m) { const int gd = param.grid.x*param.grid.y*param.grid.z; const int ld = param.block.x*param.block.y*param.block.z; int gn = 0, ln = 0; #pragma omp target teams num_teams(gd) thread_limit(ld) map(tofrom:gn,ln) #pragma omp parallel num_threads(ld) { if(omp_get_team_num()==0 && omp_get_thread_num()==0){ gn = omp_get_num_teams(); ln = omp_get_num_threads(); } } return gn!=gd||ln!=ld; } struct LaunchParam{ dim3 block; dim3 grid; }; LaunchParam launch_param; #pragma omp declare target to(launch_param) void qudaSetupLaunchParameter(const TuneParam &tp) { launch_param.grid = tp.grid; launch_param.block = tp.block; #pragma omp target update to(launch_param) } struct ColorSpinorField { int nColor, nSpin; size_t volumeCB; size_t length; size_t bytes; void *v; bool alloc; char vol_string[TuneKey::volume_n]; ColorSpinorField(int nColor, int nSpin, size_t volumeCB) : nColor(nColor), nSpin(nSpin), volumeCB(volumeCB), length(volumeCB*nColor*nSpin*2), bytes(length*sizeof(double)), v(omp_target_alloc(bytes, omp_get_default_device())), alloc(true) { snprintf(vol_string, TuneKey::volume_n-1, "%zu", length); } ColorSpinorField(int nColor, int nSpin, size_t volumeCB, void *v) : nColor(nColor), nSpin(nSpin), volumeCB(volumeCB), length(volumeCB*nColor*nSpin*2), bytes(length*sizeof(double)), v(v), alloc(false) { snprintf(vol_string, TuneKey::volume_n-1, "%zu", length); } ~ColorSpinorField(void) { if(alloc) omp_target_free(v, omp_get_default_device()); } void * V(void) const {return v;} size_t Bytes(void) const {return bytes;} int VolumeCB(void) const {return volumeCB;} int Stride(void) const {return volumeCB;} int SiteSubset(void) const {return 1;} inline const char *VolString() const { return vol_string; } }; template <template <typename> class Functor, typename Arg, bool grid_stride = false> void Kernel2D_impl(const Arg &arg) { QUDA_RT_CONSTS; Functor<Arg> f(arg); auto i = threadIdx.x + blockIdx.x * blockDim.x; auto j = threadIdx.y + blockIdx.y * blockDim.y; if (j >= arg.threads.y) return; while (i < arg.threads.x) { f(i, j); if (grid_stride) i += gridDim.x * blockDim.x; else break; } } template <template <typename> class Functor, typename Arg, bool grid_stride = false> void Kernel2D(Arg arg) { const int gd = launch_param.grid.x*launch_param.grid.y*launch_param.grid.z; const int ld = launch_param.block.x*launch_param.block.y*launch_param.block.z; Arg *dparg = (Arg*)omp_target_alloc(sizeof(Arg), omp_get_default_device()); omp_target_memcpy(dparg, (void *)(&arg), sizeof(Arg), 0, 0, omp_get_default_device(), omp_get_initial_device()); #pragma omp target teams num_teams(gd) thread_limit(ld) is_device_ptr(dparg) #pragma omp parallel num_threads(ld) { char buffer[sizeof(Arg)]; memcpy(buffer, (void *)dparg, sizeof(Arg)); Kernel2D_impl<Functor, Arg, grid_stride>(*(Arg *)buffer); } omp_target_free(dparg, omp_get_default_device()); } // // include/targets/omptarget/tunable_kernel.h // struct TunableKernel : public Tunable { template <bool grid_stride, typename Func, typename Arg> void launch_device(Func *func, const TuneParam &tp, const Arg &arg) { qudaSetupLaunchParameter(tp); func(arg); } TuneKey tuneKey() const { return TuneKey(vol, typeid(*this).name()); } }; // // include/tunable_nd.h // template <bool grid_stride> struct TunableKernel1D_base : public TunableKernel { virtual bool tuneGridDim() const { return grid_stride; } TunableKernel1D_base(const ColorSpinorField &field) { strcpy(vol, field.VolString()); } }; template <bool grid_stride = false> struct TunableKernel2D_base : public TunableKernel1D_base<grid_stride> { mutable unsigned int vector_length_y; mutable unsigned int step_y; bool tune_block_x; template <template <typename> class Functor, typename Arg> void launch_device(const TuneParam &tp, const Arg &arg) { const_cast<Arg &>(arg).threads.y = vector_length_y; TunableKernel::launch_device<grid_stride>(Kernel2D<Functor, Arg, grid_stride>, tp, arg); } template <template <typename> class Functor, typename Arg> void launch(const TuneParam &tp, const Arg &arg) { launch_device<Functor, Arg>(tp, arg); } TunableKernel2D_base(const ColorSpinorField &field, unsigned int vector_length_y) : TunableKernel1D_base<grid_stride>(field), vector_length_y(vector_length_y), step_y(1), tune_block_x(true) { } bool advanceBlockDim(TuneParam ¶m) const { dim3 block = param.block; dim3 grid = param.grid; param.block.y = block.y; param.grid.y = grid.y; bool ret = tune_block_x ? Tunable::advanceBlockDim(param) : false; if (ret) { return true; } else { // block.x (spacetime) was reset auto next = param; next.block.y += step_y; // we can advance spin/block-color since this is valid if (param.block.y < vector_length_y && param.block.y < device::max_threads_per_block_dim(1) && param.block.x*(param.block.y+step_y)*param.block.z <= device::max_threads_per_block()) { param.block.y += step_y; param.grid.y = (vector_length_y + param.block.y - 1) / param.block.y; return true; } else { // we have run off the end so let's reset param.block.y = step_y; param.grid.y = (vector_length_y + param.block.y - 1) / param.block.y; return false; } } } void initTuneParam(TuneParam ¶m) const { Tunable::initTuneParam(param); param.block.y = step_y; param.grid.y = (vector_length_y + step_y - 1) / step_y; } /** sets default values for when tuning is disabled */ void defaultTuneParam(TuneParam ¶m) const { Tunable::defaultTuneParam(param); param.block.y = step_y; param.grid.y = (vector_length_y + step_y - 1) / step_y; } void resizeVector(int y) const { vector_length_y = y; } void resizeStep(int y) const { step_y = y; } }; struct TunableKernel2D : public TunableKernel2D_base<false> { virtual unsigned int minThreads() const = 0; TunableKernel2D(const ColorSpinorField &field, unsigned int vector_length_y) : TunableKernel2D_base<false>(field, vector_length_y) {} }; // // lib/tune.cpp // static bool tuning = false; TuneParam tuneLaunch(Tunable &tunable) { TuneKey key = tunable.tuneKey(); static TuneParam param; if (!tuning) { TuneParam best_param; double best_time; tuning = true; best_time = DBL_MAX; printf("Tuning %s at vol=%s\n", key.name, key.volume); device_timer_t timer; host_timer_t tune_timer; tune_timer.start(); tunable.initTuneParam(param); while (tuning) { printf("About to call tunable.apply block=(%d,%d,%d) grid=(%d,%d,%d)", param.block.x, param.block.y, param.block.z, param.grid.x, param.grid.y, param.grid.z); tunable.apply(); // do initial call in case we need to jit compile for these parameters or if policy tuning timer.start(); for (int i = 0; i < tunable.tuningIter(); i++) { tunable.apply(); // calls tuneLaunch() again, which simply returns the currently active param } timer.stop(); float elapsed_time = timer.last() / tunable.tuningIter(); if (elapsed_time < best_time) { best_time = elapsed_time; best_param = param; } printf("... gives %s (time %.3f ms x %d)\n", tunable.perfString(elapsed_time).c_str(), elapsed_time*1e3, tunable.tuningIter()); do{ tuning = tunable.advanceTuneParam(param); }while(tuning && invalid_launch_param(param)); } printf("Tuned %s giving %s for %s\n", tunable.paramString(best_param).c_str(), tunable.perfString(best_time).c_str(), key.name); best_param.time = best_time; param = best_param; } return param; } int Tunable::blockStep() const { return device::warp_size(); } int Tunable::blockMin() const { return device::warp_size(); } // // include/targets/generic/load_store.h // template <typename VectorType> inline VectorType vector_load(const void *ptr, int idx) { VectorType value; value = reinterpret_cast<const VectorType *>(ptr)[idx]; return value; } template <typename VectorType> inline void vector_store(void *ptr, int idx, const VectorType &value) { reinterpret_cast<VectorType *>(ptr)[idx] = value; } // // include/convert.h // template <typename T1, typename T2> constexpr void copy(T1 &a, const T2 &b) { a = b; } template <typename T1, typename T2> constexpr void copy_scaled(T1 &a, const T2 &b) { copy(a, b); } template <typename T1, typename T2, typename T3> constexpr void copy_and_scale(T1 &a, const T2 &b, const T3 &) { copy(a, b); } // // include/register_traits.h // template<typename> struct mapper { }; template<> struct mapper<double> { typedef double type; }; template <typename Float, int number> struct VectorType; template <> struct VectorType<double, 1>{typedef double type; }; template <> struct VectorType<double, 2>{typedef double2 type; }; template<bool large_alloc> struct AllocType { }; template<> struct AllocType<true> { typedef size_t type; }; template<> struct AllocType<false> { typedef int type; }; // // include/color_spinor_field_order.h // /** @brief colorspinor_wrapper is an internal class that is used to wrap instances of colorspinor accessors, currying in a specifc location on the field. The operator() accessors in colorspinor-field accessors return instances to this class, allowing us to then use operator overloading upon this class to interact with the ColorSpinor class. As a result we can include colorspinor-field accessors directly in ColorSpinor expressions in kernels without having to declare temporaries with explicit calls to the load/save methods in the colorspinor-field accessors. */ template <typename Float, typename T> struct colorspinor_wrapper { const T &field; const int x_cb; const int parity; /** @brief colorspinor_wrapper constructor @param[in] a colorspinor field accessor we are wrapping @param[in] x_cb checkerboarded space-time index we are accessing @param[in] parity Parity we are accessing */ inline colorspinor_wrapper<Float, T>(const T &field, int x_cb, int parity) : field(field), x_cb(x_cb), parity(parity) { } /** @brief Assignment operator with ColorSpinor instance as input @param[in] C ColorSpinor we want to store in this accessor */ template <typename C> inline void operator=(const C &a) const { field.save(a.data, x_cb, parity); } }; /** @brief Accessor routine for ColorSpinorFields in native field order. @tparam Float Underlying storage data type of the field @tparam Ns Number of spin components @tparam Nc Number of colors @tparam N Number of real numbers per short vector @tparam spin_project Whether the ghosts are spin projected or not @tparam huge_alloc Template parameter that enables 64-bit pointer arithmetic for huge allocations (e.g., packed set of vectors). Default is to use 32-bit pointer arithmetic. */ template <typename Float, int Ns, int Nc, int N_, bool spin_project = false, bool huge_alloc = false> struct FloatNOrder { static_assert((2 * Ns * Nc) % N_ == 0, "Internal degrees of freedom not divisible by short-vector length"); static constexpr int length = 2 * Ns * Nc; static constexpr int N = N_; static constexpr int M = length / N; using Accessor = FloatNOrder<Float, Ns, Nc, N, spin_project, huge_alloc>; using real = typename mapper<Float>::type; using complex = complex<real>; using Vector = typename VectorType<Float, N>::type; using AllocInt = typename AllocType<huge_alloc>::type; using norm_type = float; Float *field; const AllocInt offset; // offset can be 32-bit or 64-bit int volumeCB; int stride; int nParity; void *backup_h; //! host memory for backing up the field when tuning size_t bytes; FloatNOrder(const ColorSpinorField &a, Float *field_ = 0) : field(field_ ? field_ : (Float *)a.V()), offset(a.Bytes() / (2 * sizeof(Float) * N)), volumeCB(a.VolumeCB()), stride(a.Stride()), nParity(a.SiteSubset()), bytes(a.Bytes()) { } inline void load(complex out[length / 2], int x, int parity = 0) const { real v[length]; norm_type nrm = 0.0; #pragma unroll for (int i=0; i<M; i++) { // first load from memory Vector vecTmp = vector_load<Vector>(field, parity * offset + x + stride * i); // now copy into output and scale #pragma unroll for (int j = 0; j < N; j++) copy_and_scale(v[i * N + j], reinterpret_cast<Float *>(&vecTmp)[j], nrm); } #pragma unroll for (int i = 0; i < length / 2; i++) out[i] = complex(v[2 * i + 0], v[2 * i + 1]); } inline void save(const complex in[length / 2], int x, int parity = 0) const { real v[length]; #pragma unroll for (int i = 0; i < length / 2; i++) { v[2 * i + 0] = in[i].real(); v[2 * i + 1] = in[i].imag(); } #pragma unroll for (int i=0; i<M; i++) { Vector vecTmp; // first do scalar copy converting into storage type #pragma unroll for (int j = 0; j < N; j++) copy_scaled(reinterpret_cast<Float *>(&vecTmp)[j], v[i * N + j]); // second do vectorized copy into memory vector_store(field, parity * offset + x + stride * i, vecTmp); } } /** @brief This accessor routine returns a colorspinor_wrapper to this object, allowing us to overload various operators for manipulating at the site level interms of matrix operations. @param[in] x_cb Checkerboarded space-time index we are requesting @param[in] parity Parity we are requesting @return Instance of a colorspinor_wrapper that curries in access to this field at the above coordinates. */ inline auto operator()(int x_cb, int parity) const { return colorspinor_wrapper<real, Accessor>(*this, x_cb, parity); } size_t Bytes() const { return nParity * volumeCB * (Nc * Ns * 2 * sizeof(Float)); } }; template <typename T, int Ns, int Nc, bool project = false, bool huge_alloc = false> struct colorspinor_mapper { }; // double precision template <int Nc, bool huge_alloc> struct colorspinor_mapper<double, 4, Nc, false, huge_alloc> { typedef FloatNOrder<double, 4, Nc, 2, false, huge_alloc> type; }; template <int Nc, bool huge_alloc> struct colorspinor_mapper<double, 4, Nc, true, huge_alloc> { typedef FloatNOrder<double, 4, Nc, 2, true, huge_alloc> type; }; template <int Nc, bool huge_alloc> struct colorspinor_mapper<double, 2, Nc, false, huge_alloc> { typedef FloatNOrder<double, 2, Nc, 2, false, huge_alloc> type; }; template <int Nc, bool huge_alloc> struct colorspinor_mapper<double, 1, Nc, false, huge_alloc> { typedef FloatNOrder<double, 1, Nc, 2, false, huge_alloc> type; }; // // include/kernel_helper.h // template <bool use_kernel_arg_ = true> struct kernel_param { static constexpr bool use_kernel_arg = use_kernel_arg_; dim3 threads; /** number of active threads required */ constexpr kernel_param() = default; constexpr kernel_param(dim3 threads) : threads(threads) { } }; // // include/color_spinor.h // template <typename Float, int Nc, int Ns> struct ColorSpinor { static constexpr int size = Nc * Ns; complex<Float> data[size]; inline ColorSpinor<Float, Nc, Ns>() { #pragma unroll for (int i = 0; i < size; i++) { data[i] = 0; } } inline ColorSpinor<Float, Nc, Ns>(const ColorSpinor<Float, Nc, Ns> &a) { #pragma unroll for (int i = 0; i < size; i++) { data[i] = a.data[i]; } } inline ColorSpinor<Float, Nc, Ns>& operator=(const ColorSpinor<Float, Nc, Ns> &a) { if (this != &a) { #pragma unroll for (int i = 0; i < size; i++) { data[i] = a.data[i]; } } return *this; } template<typename S> inline ColorSpinor(const colorspinor_wrapper<Float,S> &a) { a.field.load(data, a.x_cb, a.parity); } }; // // include/kernels/copy_color_spinor.cuh // template <typename FloatOut, typename FloatIn, int nSpin_, int nColor_, typename Out, typename In, template <int, int> class Basis_> struct CopyColorSpinorArg : kernel_param<> { using Basis = Basis_<nSpin_, nColor_>; using realOut = typename mapper<FloatOut>::type; using realIn = typename mapper<FloatIn>::type; static constexpr int nSpin = nSpin_; static constexpr int nColor = nColor_; Out out; const In in; const int outParity; const int inParity; CopyColorSpinorArg(ColorSpinorField &out, const ColorSpinorField &in, FloatOut* Out_, FloatIn *In_, float *outNorm, float *inNorm) : kernel_param(dim3(in.VolumeCB(), in.SiteSubset(), 1)), out(out, Out_), in(in, In_), outParity(0), inParity(0) { } }; template <int Ns, int Nc> struct PreserveBasis { template <typename FloatOut, typename FloatIn> inline void operator()(complex<FloatOut> out[Ns*Nc], const complex<FloatIn> in[Ns*Nc]) const { for (int s=0; s<Ns; s++) for (int c=0; c<Nc; c++) out[s*Nc+c] = in[s*Nc+c]; } }; template <typename Arg> struct CopyColorSpinor_ { const Arg &arg; constexpr CopyColorSpinor_(const Arg &arg): arg(arg) {} inline void operator()(int x_cb, int parity) { ColorSpinor<typename Arg::realIn, Arg::nColor, Arg::nSpin> in = arg.in(x_cb, (parity+arg.inParity)&1); ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out; typename Arg::Basis basis; basis(out.data, in.data); arg.out(x_cb, (parity+arg.outParity)&1) = out; } }; // // lib/copy_color_spinor.cuh // template <int Ns, int Nc, typename Out, typename In, typename param_t> class CopyColorSpinor : TunableKernel2D { using FloatOut = typename std::remove_pointer<typename std::tuple_element<2, param_t>::type>::type; using FloatIn = typename std::remove_pointer<typename std::tuple_element<3, param_t>::type>::type; template <template <int, int> class Basis> using Arg = CopyColorSpinorArg<FloatOut, FloatIn, Ns, Nc, Out, In, Basis>; FloatOut *Out_; FloatIn *In_; float *outNorm; float *inNorm; ColorSpinorField &out; const ColorSpinorField ∈ unsigned int minThreads() const { return in.VolumeCB(); } public: CopyColorSpinor(ColorSpinorField &out, const ColorSpinorField &in, const param_t ¶m) : TunableKernel2D(in, in.SiteSubset()), Out_(std::get<2>(param)), In_(std::get<3>(param)), outNorm(std::get<4>(param)), inNorm(std::get<5>(param)), out(out), in(in) { apply(); } void apply() { TuneParam tp = tuneLaunch(*this); launch<CopyColorSpinor_>(tp, Arg<PreserveBasis>(out, in, Out_, In_, outNorm, inNorm)); } long long bytes() const { return in.Bytes() + out.Bytes(); } }; constexpr int Ns = 4; constexpr int Nc = 3; int main(int argc, char *argv[]) { size_t vol, i, fail; double *x, *y; vol = argc <= 1 ? 1L<<22 : atol(argv[1]); x = (double *)omp_target_alloc(sizeof(double)*vol*Ns*Nc*2, omp_get_default_device()); #pragma omp target teams distribute parallel for is_device_ptr(x) for(i=0;i<vol*Ns*Nc*2;++i) x[i] = (double)i; ColorSpinorField out(Nc, Ns, vol); ColorSpinorField in(Nc, Ns, vol, x); using FloatOut = double; using FloatIn = double; using param_t = std::tuple<ColorSpinorField &, const ColorSpinorField &, FloatOut *, FloatIn *, float *, float *>; using O = typename colorspinor_mapper<FloatOut,Ns,Nc>::type; using I = typename colorspinor_mapper<FloatIn,Ns,Nc>::type; param_t param(out, in, nullptr, nullptr, nullptr, nullptr); CopyColorSpinor<Ns, Nc, O, I, param_t>(out, in, param); fail = 0; y = (double*)out.v; #pragma omp target teams distribute parallel for reduction(+:fail) is_device_ptr(x,y) for(i=0;i<vol*Ns*Nc*2;++i) if(x[i]!=y[i]) fail++; if(fail>0) printf("%zu elements differ!\n", fail); omp_target_free(x, omp_get_default_device()); return fail>0; }
Become a Patron
Sponsor on GitHub
Donate via PayPal
Source on GitHub
Mailing list
Installed libraries
Wiki
Report an issue
How it works
Contact the author
CE on Mastodon
About the author
Statistics
Changelog
Version tree