Thanks for using Compiler Explorer
Sponsors
C with Coccinelle
C++ with Coccinelle
Jakt
C++
Ada
Algol68
Analysis
Android Java
Android Kotlin
Assembly
C
C3
Carbon
C++ (Circle)
CIRCT
Clean
CMake
CMakeScript
COBOL
C++ for OpenCL
MLIR
Cppx
Cppx-Blue
Cppx-Gold
Cpp2-cppfront
Crystal
C#
CUDA C++
D
Dart
Elixir
Erlang
Fortran
F#
GLSL
Go
Haskell
HLSL
Hook
Hylo
IL
ispc
Java
Julia
Kotlin
LLVM IR
LLVM MIR
Modula-2
Nim
Numba
Objective-C
Objective-C++
OCaml
Odin
OpenCL C
Pascal
Pony
Python
Racket
Ruby
Rust
Snowball
Scala
Slang
Solidity
Spice
SPIR-V
Swift
LLVM TableGen
Toit
TypeScript Native
V
Vala
Visual Basic
Vyper
WASM
Zig
Javascript
GIMPLE
Ygen
sway
c++ source #1
Output
Compile to binary object
Link to binary
Execute the code
Intel asm syntax
Demangle identifiers
Verbose demangling
Filters
Unused labels
Library functions
Directives
Comments
Horizontal whitespace
Debug intrinsics
Compiler
6502-c++ 11.1.0
ARM GCC 10.2.0
ARM GCC 10.3.0
ARM GCC 10.4.0
ARM GCC 10.5.0
ARM GCC 11.1.0
ARM GCC 11.2.0
ARM GCC 11.3.0
ARM GCC 11.4.0
ARM GCC 12.1.0
ARM GCC 12.2.0
ARM GCC 12.3.0
ARM GCC 12.4.0
ARM GCC 13.1.0
ARM GCC 13.2.0
ARM GCC 13.2.0 (unknown-eabi)
ARM GCC 13.3.0
ARM GCC 13.3.0 (unknown-eabi)
ARM GCC 14.1.0
ARM GCC 14.1.0 (unknown-eabi)
ARM GCC 14.2.0
ARM GCC 14.2.0 (unknown-eabi)
ARM GCC 4.5.4
ARM GCC 4.6.4
ARM GCC 5.4
ARM GCC 6.3.0
ARM GCC 6.4.0
ARM GCC 7.3.0
ARM GCC 7.5.0
ARM GCC 8.2.0
ARM GCC 8.5.0
ARM GCC 9.3.0
ARM GCC 9.4.0
ARM GCC 9.5.0
ARM GCC trunk
ARM gcc 10.2.1 (none)
ARM gcc 10.3.1 (2021.07 none)
ARM gcc 10.3.1 (2021.10 none)
ARM gcc 11.2.1 (none)
ARM gcc 5.4.1 (none)
ARM gcc 7.2.1 (none)
ARM gcc 8.2 (WinCE)
ARM gcc 8.3.1 (none)
ARM gcc 9.2.1 (none)
ARM msvc v19.0 (WINE)
ARM msvc v19.10 (WINE)
ARM msvc v19.14 (WINE)
ARM64 Morello gcc 10.1 Alpha 2
ARM64 gcc 10.2
ARM64 gcc 10.3
ARM64 gcc 10.4
ARM64 gcc 10.5.0
ARM64 gcc 11.1
ARM64 gcc 11.2
ARM64 gcc 11.3
ARM64 gcc 11.4.0
ARM64 gcc 12.1
ARM64 gcc 12.2.0
ARM64 gcc 12.3.0
ARM64 gcc 12.4.0
ARM64 gcc 13.1.0
ARM64 gcc 13.2.0
ARM64 gcc 13.3.0
ARM64 gcc 14.1.0
ARM64 gcc 14.2.0
ARM64 gcc 4.9.4
ARM64 gcc 5.4
ARM64 gcc 5.5.0
ARM64 gcc 6.3
ARM64 gcc 6.4
ARM64 gcc 7.3
ARM64 gcc 7.5
ARM64 gcc 8.2
ARM64 gcc 8.5
ARM64 gcc 9.3
ARM64 gcc 9.4
ARM64 gcc 9.5
ARM64 gcc trunk
ARM64 msvc v19.14 (WINE)
AVR gcc 10.3.0
AVR gcc 11.1.0
AVR gcc 12.1.0
AVR gcc 12.2.0
AVR gcc 12.3.0
AVR gcc 12.4.0
AVR gcc 13.1.0
AVR gcc 13.2.0
AVR gcc 13.3.0
AVR gcc 14.1.0
AVR gcc 14.2.0
AVR gcc 4.5.4
AVR gcc 4.6.4
AVR gcc 5.4.0
AVR gcc 9.2.0
AVR gcc 9.3.0
Arduino Mega (1.8.9)
Arduino Uno (1.8.9)
BPF clang (trunk)
BPF clang 13.0.0
BPF clang 14.0.0
BPF clang 15.0.0
BPF clang 16.0.0
BPF clang 17.0.1
BPF clang 18.1.0
BPF clang 19.1.0
BPF clang 20.1.0
EDG (experimental reflection)
EDG 6.5
EDG 6.5 (GNU mode gcc 13)
EDG 6.6
EDG 6.6 (GNU mode gcc 13)
EDG 6.7
EDG 6.7 (GNU mode gcc 14)
FRC 2019
FRC 2020
FRC 2023
HPPA gcc 14.2.0
KVX ACB 4.1.0 (GCC 7.5.0)
KVX ACB 4.1.0-cd1 (GCC 7.5.0)
KVX ACB 4.10.0 (GCC 10.3.1)
KVX ACB 4.11.1 (GCC 10.3.1)
KVX ACB 4.12.0 (GCC 11.3.0)
KVX ACB 4.2.0 (GCC 7.5.0)
KVX ACB 4.3.0 (GCC 7.5.0)
KVX ACB 4.4.0 (GCC 7.5.0)
KVX ACB 4.6.0 (GCC 9.4.1)
KVX ACB 4.8.0 (GCC 9.4.1)
KVX ACB 4.9.0 (GCC 9.4.1)
KVX ACB 5.0.0 (GCC 12.2.1)
KVX ACB 5.2.0 (GCC 13.2.1)
LoongArch64 clang (trunk)
LoongArch64 clang 17.0.1
LoongArch64 clang 18.1.0
LoongArch64 clang 19.1.0
LoongArch64 clang 20.1.0
M68K gcc 13.1.0
M68K gcc 13.2.0
M68K gcc 13.3.0
M68K gcc 14.1.0
M68K gcc 14.2.0
M68k clang (trunk)
MRISC32 gcc (trunk)
MSP430 gcc 4.5.3
MSP430 gcc 5.3.0
MSP430 gcc 6.2.1
MinGW clang 14.0.3
MinGW clang 14.0.6
MinGW clang 15.0.7
MinGW clang 16.0.0
MinGW clang 16.0.2
MinGW gcc 11.3.0
MinGW gcc 12.1.0
MinGW gcc 12.2.0
MinGW gcc 13.1.0
RISC-V (32-bits) gcc (trunk)
RISC-V (32-bits) gcc 10.2.0
RISC-V (32-bits) gcc 10.3.0
RISC-V (32-bits) gcc 11.2.0
RISC-V (32-bits) gcc 11.3.0
RISC-V (32-bits) gcc 11.4.0
RISC-V (32-bits) gcc 12.1.0
RISC-V (32-bits) gcc 12.2.0
RISC-V (32-bits) gcc 12.3.0
RISC-V (32-bits) gcc 12.4.0
RISC-V (32-bits) gcc 13.1.0
RISC-V (32-bits) gcc 13.2.0
RISC-V (32-bits) gcc 13.3.0
RISC-V (32-bits) gcc 14.1.0
RISC-V (32-bits) gcc 14.2.0
RISC-V (32-bits) gcc 8.2.0
RISC-V (32-bits) gcc 8.5.0
RISC-V (32-bits) gcc 9.4.0
RISC-V (64-bits) gcc (trunk)
RISC-V (64-bits) gcc 10.2.0
RISC-V (64-bits) gcc 10.3.0
RISC-V (64-bits) gcc 11.2.0
RISC-V (64-bits) gcc 11.3.0
RISC-V (64-bits) gcc 11.4.0
RISC-V (64-bits) gcc 12.1.0
RISC-V (64-bits) gcc 12.2.0
RISC-V (64-bits) gcc 12.3.0
RISC-V (64-bits) gcc 12.4.0
RISC-V (64-bits) gcc 13.1.0
RISC-V (64-bits) gcc 13.2.0
RISC-V (64-bits) gcc 13.3.0
RISC-V (64-bits) gcc 14.1.0
RISC-V (64-bits) gcc 14.2.0
RISC-V (64-bits) gcc 8.2.0
RISC-V (64-bits) gcc 8.5.0
RISC-V (64-bits) gcc 9.4.0
RISC-V rv32gc clang (trunk)
RISC-V rv32gc clang 10.0.0
RISC-V rv32gc clang 10.0.1
RISC-V rv32gc clang 11.0.0
RISC-V rv32gc clang 11.0.1
RISC-V rv32gc clang 12.0.0
RISC-V rv32gc clang 12.0.1
RISC-V rv32gc clang 13.0.0
RISC-V rv32gc clang 13.0.1
RISC-V rv32gc clang 14.0.0
RISC-V rv32gc clang 15.0.0
RISC-V rv32gc clang 16.0.0
RISC-V rv32gc clang 17.0.1
RISC-V rv32gc clang 18.1.0
RISC-V rv32gc clang 19.1.0
RISC-V rv32gc clang 20.1.0
RISC-V rv32gc clang 9.0.0
RISC-V rv32gc clang 9.0.1
RISC-V rv64gc clang (trunk)
RISC-V rv64gc clang 10.0.0
RISC-V rv64gc clang 10.0.1
RISC-V rv64gc clang 11.0.0
RISC-V rv64gc clang 11.0.1
RISC-V rv64gc clang 12.0.0
RISC-V rv64gc clang 12.0.1
RISC-V rv64gc clang 13.0.0
RISC-V rv64gc clang 13.0.1
RISC-V rv64gc clang 14.0.0
RISC-V rv64gc clang 15.0.0
RISC-V rv64gc clang 16.0.0
RISC-V rv64gc clang 17.0.1
RISC-V rv64gc clang 18.1.0
RISC-V rv64gc clang 19.1.0
RISC-V rv64gc clang 20.1.0
RISC-V rv64gc clang 9.0.0
RISC-V rv64gc clang 9.0.1
Raspbian Buster
Raspbian Stretch
SPARC LEON gcc 12.2.0
SPARC LEON gcc 12.3.0
SPARC LEON gcc 12.4.0
SPARC LEON gcc 13.1.0
SPARC LEON gcc 13.2.0
SPARC LEON gcc 13.3.0
SPARC LEON gcc 14.1.0
SPARC LEON gcc 14.2.0
SPARC gcc 12.2.0
SPARC gcc 12.3.0
SPARC gcc 12.4.0
SPARC gcc 13.1.0
SPARC gcc 13.2.0
SPARC gcc 13.3.0
SPARC gcc 14.1.0
SPARC gcc 14.2.0
SPARC64 gcc 12.2.0
SPARC64 gcc 12.3.0
SPARC64 gcc 12.4.0
SPARC64 gcc 13.1.0
SPARC64 gcc 13.2.0
SPARC64 gcc 13.3.0
SPARC64 gcc 14.1.0
SPARC64 gcc 14.2.0
TI C6x gcc 12.2.0
TI C6x gcc 12.3.0
TI C6x gcc 12.4.0
TI C6x gcc 13.1.0
TI C6x gcc 13.2.0
TI C6x gcc 13.3.0
TI C6x gcc 14.1.0
TI C6x gcc 14.2.0
TI CL430 21.6.1
Tricore gcc 11.3.0 (EEESlab)
VAX gcc NetBSDELF 10.4.0
VAX gcc NetBSDELF 10.5.0 (Nov 15 03:50:22 2023)
WebAssembly clang (trunk)
Xtensa ESP32 gcc 11.2.0 (2022r1)
Xtensa ESP32 gcc 12.2.0 (20230208)
Xtensa ESP32 gcc 14.2.0 (20241119)
Xtensa ESP32 gcc 8.2.0 (2019r2)
Xtensa ESP32 gcc 8.2.0 (2020r1)
Xtensa ESP32 gcc 8.2.0 (2020r2)
Xtensa ESP32 gcc 8.4.0 (2020r3)
Xtensa ESP32 gcc 8.4.0 (2021r1)
Xtensa ESP32 gcc 8.4.0 (2021r2)
Xtensa ESP32-S2 gcc 11.2.0 (2022r1)
Xtensa ESP32-S2 gcc 12.2.0 (20230208)
Xtensa ESP32-S2 gcc 14.2.0 (20241119)
Xtensa ESP32-S2 gcc 8.2.0 (2019r2)
Xtensa ESP32-S2 gcc 8.2.0 (2020r1)
Xtensa ESP32-S2 gcc 8.2.0 (2020r2)
Xtensa ESP32-S2 gcc 8.4.0 (2020r3)
Xtensa ESP32-S2 gcc 8.4.0 (2021r1)
Xtensa ESP32-S2 gcc 8.4.0 (2021r2)
Xtensa ESP32-S3 gcc 11.2.0 (2022r1)
Xtensa ESP32-S3 gcc 12.2.0 (20230208)
Xtensa ESP32-S3 gcc 14.2.0 (20241119)
Xtensa ESP32-S3 gcc 8.4.0 (2020r3)
Xtensa ESP32-S3 gcc 8.4.0 (2021r1)
Xtensa ESP32-S3 gcc 8.4.0 (2021r2)
arm64 msvc v19.20 VS16.0
arm64 msvc v19.21 VS16.1
arm64 msvc v19.22 VS16.2
arm64 msvc v19.23 VS16.3
arm64 msvc v19.24 VS16.4
arm64 msvc v19.25 VS16.5
arm64 msvc v19.27 VS16.7
arm64 msvc v19.28 VS16.8
arm64 msvc v19.28 VS16.9
arm64 msvc v19.29 VS16.10
arm64 msvc v19.29 VS16.11
arm64 msvc v19.30 VS17.0
arm64 msvc v19.31 VS17.1
arm64 msvc v19.32 VS17.2
arm64 msvc v19.33 VS17.3
arm64 msvc v19.34 VS17.4
arm64 msvc v19.35 VS17.5
arm64 msvc v19.36 VS17.6
arm64 msvc v19.37 VS17.7
arm64 msvc v19.38 VS17.8
arm64 msvc v19.39 VS17.9
arm64 msvc v19.40 VS17.10
arm64 msvc v19.latest
armv7-a clang (trunk)
armv7-a clang 10.0.0
armv7-a clang 10.0.1
armv7-a clang 11.0.0
armv7-a clang 11.0.1
armv7-a clang 12.0.0
armv7-a clang 12.0.1
armv7-a clang 13.0.0
armv7-a clang 13.0.1
armv7-a clang 14.0.0
armv7-a clang 15.0.0
armv7-a clang 16.0.0
armv7-a clang 17.0.1
armv7-a clang 18.1.0
armv7-a clang 19.1.0
armv7-a clang 9.0.0
armv7-a clang 9.0.1
armv8-a clang (all architectural features, trunk)
armv8-a clang (trunk)
armv8-a clang 10.0.0
armv8-a clang 10.0.1
armv8-a clang 11.0.0
armv8-a clang 11.0.1
armv8-a clang 12.0.0
armv8-a clang 13.0.0
armv8-a clang 14.0.0
armv8-a clang 15.0.0
armv8-a clang 16.0.0
armv8-a clang 17.0.1
armv8-a clang 18.1.0
armv8-a clang 19.1.0
armv8-a clang 20.1.0
armv8-a clang 9.0.0
armv8-a clang 9.0.1
clad trunk (clang 19.1.0)
clad v1.8 (clang 18.1.0)
clad v1.9 (clang 19.1.0)
clang-cl 18.1.0
ellcc 0.1.33
ellcc 0.1.34
ellcc 2017-07-16
hexagon-clang 16.0.5
llvm-mos atari2600-3e
llvm-mos atari2600-4k
llvm-mos atari2600-common
llvm-mos atari5200-supercart
llvm-mos atari8-cart-megacart
llvm-mos atari8-cart-std
llvm-mos atari8-cart-xegs
llvm-mos atari8-common
llvm-mos atari8-dos
llvm-mos c128
llvm-mos c64
llvm-mos commodore
llvm-mos cpm65
llvm-mos cx16
llvm-mos dodo
llvm-mos eater
llvm-mos mega65
llvm-mos nes
llvm-mos nes-action53
llvm-mos nes-cnrom
llvm-mos nes-gtrom
llvm-mos nes-mmc1
llvm-mos nes-mmc3
llvm-mos nes-nrom
llvm-mos nes-unrom
llvm-mos nes-unrom-512
llvm-mos osi-c1p
llvm-mos pce
llvm-mos pce-cd
llvm-mos pce-common
llvm-mos pet
llvm-mos rp6502
llvm-mos rpc8e
llvm-mos supervision
llvm-mos vic20
loongarch64 gcc 12.2.0
loongarch64 gcc 12.3.0
loongarch64 gcc 12.4.0
loongarch64 gcc 13.1.0
loongarch64 gcc 13.2.0
loongarch64 gcc 13.3.0
loongarch64 gcc 14.1.0
loongarch64 gcc 14.2.0
mips clang 13.0.0
mips clang 14.0.0
mips clang 15.0.0
mips clang 16.0.0
mips clang 17.0.1
mips clang 18.1.0
mips clang 19.1.0
mips clang 20.1.0
mips gcc 11.2.0
mips gcc 12.1.0
mips gcc 12.2.0
mips gcc 12.3.0
mips gcc 12.4.0
mips gcc 13.1.0
mips gcc 13.2.0
mips gcc 13.3.0
mips gcc 14.1.0
mips gcc 14.2.0
mips gcc 4.9.4
mips gcc 5.4
mips gcc 5.5.0
mips gcc 9.3.0 (codescape)
mips gcc 9.5.0
mips64 (el) gcc 12.1.0
mips64 (el) gcc 12.2.0
mips64 (el) gcc 12.3.0
mips64 (el) gcc 12.4.0
mips64 (el) gcc 13.1.0
mips64 (el) gcc 13.2.0
mips64 (el) gcc 13.3.0
mips64 (el) gcc 14.1.0
mips64 (el) gcc 14.2.0
mips64 (el) gcc 4.9.4
mips64 (el) gcc 5.4.0
mips64 (el) gcc 5.5.0
mips64 (el) gcc 9.5.0
mips64 clang 13.0.0
mips64 clang 14.0.0
mips64 clang 15.0.0
mips64 clang 16.0.0
mips64 clang 17.0.1
mips64 clang 18.1.0
mips64 clang 19.1.0
mips64 clang 20.1.0
mips64 gcc 11.2.0
mips64 gcc 12.1.0
mips64 gcc 12.2.0
mips64 gcc 12.3.0
mips64 gcc 12.4.0
mips64 gcc 13.1.0
mips64 gcc 13.2.0
mips64 gcc 13.3.0
mips64 gcc 14.1.0
mips64 gcc 14.2.0
mips64 gcc 4.9.4
mips64 gcc 5.4.0
mips64 gcc 5.5.0
mips64 gcc 9.5.0
mips64el clang 13.0.0
mips64el clang 14.0.0
mips64el clang 15.0.0
mips64el clang 16.0.0
mips64el clang 17.0.1
mips64el clang 18.1.0
mips64el clang 19.1.0
mips64el clang 20.1.0
mipsel clang 13.0.0
mipsel clang 14.0.0
mipsel clang 15.0.0
mipsel clang 16.0.0
mipsel clang 17.0.1
mipsel clang 18.1.0
mipsel clang 19.1.0
mipsel clang 20.1.0
mipsel gcc 12.1.0
mipsel gcc 12.2.0
mipsel gcc 12.3.0
mipsel gcc 12.4.0
mipsel gcc 13.1.0
mipsel gcc 13.2.0
mipsel gcc 13.3.0
mipsel gcc 14.1.0
mipsel gcc 14.2.0
mipsel gcc 4.9.4
mipsel gcc 5.4.0
mipsel gcc 5.5.0
mipsel gcc 9.5.0
nanoMIPS gcc 6.3.0 (mtk)
power gcc 11.2.0
power gcc 12.1.0
power gcc 12.2.0
power gcc 12.3.0
power gcc 12.4.0
power gcc 13.1.0
power gcc 13.2.0
power gcc 13.3.0
power gcc 14.1.0
power gcc 14.2.0
power gcc 4.8.5
power64 AT12.0 (gcc8)
power64 AT13.0 (gcc9)
power64 gcc 11.2.0
power64 gcc 12.1.0
power64 gcc 12.2.0
power64 gcc 12.3.0
power64 gcc 12.4.0
power64 gcc 13.1.0
power64 gcc 13.2.0
power64 gcc 13.3.0
power64 gcc 14.1.0
power64 gcc 14.2.0
power64 gcc trunk
power64le AT12.0 (gcc8)
power64le AT13.0 (gcc9)
power64le clang (trunk)
power64le gcc 11.2.0
power64le gcc 12.1.0
power64le gcc 12.2.0
power64le gcc 12.3.0
power64le gcc 12.4.0
power64le gcc 13.1.0
power64le gcc 13.2.0
power64le gcc 13.3.0
power64le gcc 14.1.0
power64le gcc 14.2.0
power64le gcc 6.3.0
power64le gcc trunk
powerpc64 clang (trunk)
qnx 8.0.0
s390x gcc 11.2.0
s390x gcc 12.1.0
s390x gcc 12.2.0
s390x gcc 12.3.0
s390x gcc 12.4.0
s390x gcc 13.1.0
s390x gcc 13.2.0
s390x gcc 13.3.0
s390x gcc 14.1.0
s390x gcc 14.2.0
sh gcc 12.2.0
sh gcc 12.3.0
sh gcc 12.4.0
sh gcc 13.1.0
sh gcc 13.2.0
sh gcc 13.3.0
sh gcc 14.1.0
sh gcc 14.2.0
sh gcc 4.9.4
sh gcc 9.5.0
vast (trunk)
x64 msvc v19.0 (WINE)
x64 msvc v19.10 (WINE)
x64 msvc v19.14 (WINE)
x64 msvc v19.20 VS16.0
x64 msvc v19.21 VS16.1
x64 msvc v19.22 VS16.2
x64 msvc v19.23 VS16.3
x64 msvc v19.24 VS16.4
x64 msvc v19.25 VS16.5
x64 msvc v19.27 VS16.7
x64 msvc v19.28 VS16.8
x64 msvc v19.28 VS16.9
x64 msvc v19.29 VS16.10
x64 msvc v19.29 VS16.11
x64 msvc v19.30 VS17.0
x64 msvc v19.31 VS17.1
x64 msvc v19.32 VS17.2
x64 msvc v19.33 VS17.3
x64 msvc v19.34 VS17.4
x64 msvc v19.35 VS17.5
x64 msvc v19.36 VS17.6
x64 msvc v19.37 VS17.7
x64 msvc v19.38 VS17.8
x64 msvc v19.39 VS17.9
x64 msvc v19.40 VS17.10
x64 msvc v19.latest
x86 djgpp 4.9.4
x86 djgpp 5.5.0
x86 djgpp 6.4.0
x86 djgpp 7.2.0
x86 msvc v19.0 (WINE)
x86 msvc v19.10 (WINE)
x86 msvc v19.14 (WINE)
x86 msvc v19.20 VS16.0
x86 msvc v19.21 VS16.1
x86 msvc v19.22 VS16.2
x86 msvc v19.23 VS16.3
x86 msvc v19.24 VS16.4
x86 msvc v19.25 VS16.5
x86 msvc v19.27 VS16.7
x86 msvc v19.28 VS16.8
x86 msvc v19.28 VS16.9
x86 msvc v19.29 VS16.10
x86 msvc v19.29 VS16.11
x86 msvc v19.30 VS17.0
x86 msvc v19.31 VS17.1
x86 msvc v19.32 VS17.2
x86 msvc v19.33 VS17.3
x86 msvc v19.34 VS17.4
x86 msvc v19.35 VS17.5
x86 msvc v19.36 VS17.6
x86 msvc v19.37 VS17.7
x86 msvc v19.38 VS17.8
x86 msvc v19.39 VS17.9
x86 msvc v19.40 VS17.10
x86 msvc v19.latest
x86 nvc++ 22.11
x86 nvc++ 22.7
x86 nvc++ 22.9
x86 nvc++ 23.1
x86 nvc++ 23.11
x86 nvc++ 23.3
x86 nvc++ 23.5
x86 nvc++ 23.7
x86 nvc++ 23.9
x86 nvc++ 24.1
x86 nvc++ 24.11
x86 nvc++ 24.3
x86 nvc++ 24.5
x86 nvc++ 24.7
x86 nvc++ 24.9
x86 nvc++ 25.1
x86-64 Zapcc 190308
x86-64 clang (Chris Bazley N3089)
x86-64 clang (EricWF contracts)
x86-64 clang (amd-staging)
x86-64 clang (assertions trunk)
x86-64 clang (clangir)
x86-64 clang (experimental -Wlifetime)
x86-64 clang (experimental P1061)
x86-64 clang (experimental P1144)
x86-64 clang (experimental P1221)
x86-64 clang (experimental P2996)
x86-64 clang (experimental P2998)
x86-64 clang (experimental P3068)
x86-64 clang (experimental P3309)
x86-64 clang (experimental P3367)
x86-64 clang (experimental P3372)
x86-64 clang (experimental metaprogramming - P2632)
x86-64 clang (old concepts branch)
x86-64 clang (p1974)
x86-64 clang (pattern matching - P2688)
x86-64 clang (reflection)
x86-64 clang (resugar)
x86-64 clang (string interpolation - P3412)
x86-64 clang (thephd.dev)
x86-64 clang (trunk)
x86-64 clang (variadic friends - P2893)
x86-64 clang (widberg)
x86-64 clang 10.0.0
x86-64 clang 10.0.0 (assertions)
x86-64 clang 10.0.1
x86-64 clang 11.0.0
x86-64 clang 11.0.0 (assertions)
x86-64 clang 11.0.1
x86-64 clang 12.0.0
x86-64 clang 12.0.0 (assertions)
x86-64 clang 12.0.1
x86-64 clang 13.0.0
x86-64 clang 13.0.0 (assertions)
x86-64 clang 13.0.1
x86-64 clang 14.0.0
x86-64 clang 14.0.0 (assertions)
x86-64 clang 15.0.0
x86-64 clang 15.0.0 (assertions)
x86-64 clang 16.0.0
x86-64 clang 16.0.0 (assertions)
x86-64 clang 17.0.1
x86-64 clang 17.0.1 (assertions)
x86-64 clang 18.1.0
x86-64 clang 18.1.0 (assertions)
x86-64 clang 19.1.0
x86-64 clang 19.1.0 (assertions)
x86-64 clang 2.6.0 (assertions)
x86-64 clang 2.7.0 (assertions)
x86-64 clang 2.8.0 (assertions)
x86-64 clang 2.9.0 (assertions)
x86-64 clang 20.1.0
x86-64 clang 20.1.0 (assertions)
x86-64 clang 3.0.0
x86-64 clang 3.0.0 (assertions)
x86-64 clang 3.1
x86-64 clang 3.1 (assertions)
x86-64 clang 3.2
x86-64 clang 3.2 (assertions)
x86-64 clang 3.3
x86-64 clang 3.3 (assertions)
x86-64 clang 3.4 (assertions)
x86-64 clang 3.4.1
x86-64 clang 3.5
x86-64 clang 3.5 (assertions)
x86-64 clang 3.5.1
x86-64 clang 3.5.2
x86-64 clang 3.6
x86-64 clang 3.6 (assertions)
x86-64 clang 3.7
x86-64 clang 3.7 (assertions)
x86-64 clang 3.7.1
x86-64 clang 3.8
x86-64 clang 3.8 (assertions)
x86-64 clang 3.8.1
x86-64 clang 3.9.0
x86-64 clang 3.9.0 (assertions)
x86-64 clang 3.9.1
x86-64 clang 4.0.0
x86-64 clang 4.0.0 (assertions)
x86-64 clang 4.0.1
x86-64 clang 5.0.0
x86-64 clang 5.0.0 (assertions)
x86-64 clang 5.0.1
x86-64 clang 5.0.2
x86-64 clang 6.0.0
x86-64 clang 6.0.0 (assertions)
x86-64 clang 6.0.1
x86-64 clang 7.0.0
x86-64 clang 7.0.0 (assertions)
x86-64 clang 7.0.1
x86-64 clang 7.1.0
x86-64 clang 8.0.0
x86-64 clang 8.0.0 (assertions)
x86-64 clang 8.0.1
x86-64 clang 9.0.0
x86-64 clang 9.0.0 (assertions)
x86-64 clang 9.0.1
x86-64 clang rocm-4.5.2
x86-64 clang rocm-5.0.2
x86-64 clang rocm-5.1.3
x86-64 clang rocm-5.2.3
x86-64 clang rocm-5.3.3
x86-64 clang rocm-5.7.0
x86-64 clang rocm-6.0.2
x86-64 clang rocm-6.1.2
x86-64 gcc (contract labels)
x86-64 gcc (contracts natural syntax)
x86-64 gcc (contracts)
x86-64 gcc (coroutines)
x86-64 gcc (modules)
x86-64 gcc (trunk)
x86-64 gcc 10.1
x86-64 gcc 10.2
x86-64 gcc 10.3
x86-64 gcc 10.3 (assertions)
x86-64 gcc 10.4
x86-64 gcc 10.4 (assertions)
x86-64 gcc 10.5
x86-64 gcc 10.5 (assertions)
x86-64 gcc 11.1
x86-64 gcc 11.1 (assertions)
x86-64 gcc 11.2
x86-64 gcc 11.2 (assertions)
x86-64 gcc 11.3
x86-64 gcc 11.3 (assertions)
x86-64 gcc 11.4
x86-64 gcc 11.4 (assertions)
x86-64 gcc 12.1
x86-64 gcc 12.1 (assertions)
x86-64 gcc 12.2
x86-64 gcc 12.2 (assertions)
x86-64 gcc 12.3
x86-64 gcc 12.3 (assertions)
x86-64 gcc 12.4
x86-64 gcc 12.4 (assertions)
x86-64 gcc 13.1
x86-64 gcc 13.1 (assertions)
x86-64 gcc 13.2
x86-64 gcc 13.2 (assertions)
x86-64 gcc 13.3
x86-64 gcc 13.3 (assertions)
x86-64 gcc 14.1
x86-64 gcc 14.1 (assertions)
x86-64 gcc 14.2
x86-64 gcc 14.2 (assertions)
x86-64 gcc 3.4.6
x86-64 gcc 4.0.4
x86-64 gcc 4.1.2
x86-64 gcc 4.4.7
x86-64 gcc 4.5.3
x86-64 gcc 4.6.4
x86-64 gcc 4.7.1
x86-64 gcc 4.7.2
x86-64 gcc 4.7.3
x86-64 gcc 4.7.4
x86-64 gcc 4.8.1
x86-64 gcc 4.8.2
x86-64 gcc 4.8.3
x86-64 gcc 4.8.4
x86-64 gcc 4.8.5
x86-64 gcc 4.9.0
x86-64 gcc 4.9.1
x86-64 gcc 4.9.2
x86-64 gcc 4.9.3
x86-64 gcc 4.9.4
x86-64 gcc 5.1
x86-64 gcc 5.2
x86-64 gcc 5.3
x86-64 gcc 5.4
x86-64 gcc 5.5
x86-64 gcc 6.1
x86-64 gcc 6.2
x86-64 gcc 6.3
x86-64 gcc 6.4
x86-64 gcc 6.5
x86-64 gcc 7.1
x86-64 gcc 7.2
x86-64 gcc 7.3
x86-64 gcc 7.4
x86-64 gcc 7.5
x86-64 gcc 8.1
x86-64 gcc 8.2
x86-64 gcc 8.3
x86-64 gcc 8.4
x86-64 gcc 8.5
x86-64 gcc 9.1
x86-64 gcc 9.2
x86-64 gcc 9.3
x86-64 gcc 9.4
x86-64 gcc 9.5
x86-64 icc 13.0.1
x86-64 icc 16.0.3
x86-64 icc 17.0.0
x86-64 icc 18.0.0
x86-64 icc 19.0.0
x86-64 icc 19.0.1
x86-64 icc 2021.1.2
x86-64 icc 2021.10.0
x86-64 icc 2021.2.0
x86-64 icc 2021.3.0
x86-64 icc 2021.4.0
x86-64 icc 2021.5.0
x86-64 icc 2021.6.0
x86-64 icc 2021.7.0
x86-64 icc 2021.7.1
x86-64 icc 2021.8.0
x86-64 icc 2021.9.0
x86-64 icx 2021.1.2
x86-64 icx 2021.2.0
x86-64 icx 2021.3.0
x86-64 icx 2021.4.0
x86-64 icx 2022.0.0
x86-64 icx 2022.1.0
x86-64 icx 2022.2.0
x86-64 icx 2022.2.1
x86-64 icx 2023.0.0
x86-64 icx 2023.1.0
x86-64 icx 2023.2.1
x86-64 icx 2024.0.0
x86-64 icx 2024.1.0
x86-64 icx 2024.2.0
x86-64 icx 2024.2.1
x86-64 icx 2025.0.0
x86-64 icx 2025.0.1
x86-64 icx 2025.0.3
x86-64 icx 2025.0.4
x86-64 icx 2025.0.4
zig c++ 0.10.0
zig c++ 0.11.0
zig c++ 0.12.0
zig c++ 0.12.1
zig c++ 0.13.0
zig c++ 0.14.0
zig c++ 0.6.0
zig c++ 0.7.0
zig c++ 0.7.1
zig c++ 0.8.0
zig c++ 0.9.0
zig c++ trunk
Options
Source code
// Branchless, vectorized `atan2f`. Various functions of increasing // performance are presented. The fastest version is 50~ faster than libc // on batch workloads, outputing a result every ~2 clock cycles, compared to // ~110 for libc. The functions all use the same `atan` approximation, and their // max error is around ~1/10000 of a degree. // // They also do not handle inf / -inf // and the origin as an input as they should -- in our case these are a sign // that something is wrong anyway. Moreover, manual_2 does not handle NaN // correctly (it drops them silently), and all the auto_ functions do not // handle -0 correctly. But manual_1 handles everything but +-inf and +-0,+-0 // correctly. // // Tested on a Xeon W-2145, a Skylake processor. Compiled with // // $ clang++ --version // clang version 12.0.1 // $ clang++ -static --std=c++20 -march=skylake -O3 -Wall vectorized-atan2f.cpp -lm -o vectorized-atan2f // // Results: // // $ ./vectorized-atan2f --test-edge-cases 100000 100 1000 // Generating data... done. // Tests will read 824kB and write 412kB (103048 points) // Running 1000 warm ups and 1000 iterations // // baseline: 2.46 s, 0.32GB/s, 105.79 cycles/elem, 129.34 instrs/elem, 1.22 instrs/cycle, 27.76 branches/elem, 7.08% branch misses, 0.47% cache misses, 4.29GHz // auto_1: 0.21 s, 3.75GB/s, 8.29 cycles/elem, 8.38 instrs/elem, 1.01 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.03% cache misses, 3.89GHz, 0.000109283deg max error, max error point: -0.563291132, -0.544303775 // auto_2: 97.50ms, 8.19GB/s, 3.79 cycles/elem, 5.38 instrs/elem, 1.42 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.01% cache misses, 3.88GHz, 0.000109283deg max error, max error point: -0.854430377, +0.107594967 // auto_3: 75.95ms, 10.52GB/s, 2.95 cycles/elem, 4.13 instrs/elem, 1.40 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.03% cache misses, 3.88GHz, 0.000109283deg max error, max error point: -0.854430377, +0.107594967 // auto_4: 55.89ms, 14.29GB/s, 2.17 cycles/elem, 3.63 instrs/elem, 1.67 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.01% cache misses, 3.88GHz, 0.000109283deg max error, max error point: -0.854430377, +0.107594967 // manual_1: 52.57ms, 15.20GB/s, 2.04 cycles/elem, 3.63 instrs/elem, 1.78 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.03% cache misses, 3.88GHz, 0.000109283deg max error, max error point: -0.854430377, +0.107594967 // manual_2: 50.16ms, 15.93GB/s, 1.95 cycles/elem, 3.88 instrs/elem, 1.99 instrs/cycle, 0.13 branches/elem, 0.01% branch misses, 0.02% cache misses, 3.88GHz, 0.000109283deg max error, max error point: -0.854430377, +0.107594967 // // The atan approximation is from sheet 11 of "Approximations for digital // computers", C. Hastings, 1955, a delightful document overall. // // Functions auto_1 to auto_5 are automatically vectorized. manual_1 and manual_2 // are vectorized manually. // // You'll get quite different results with g++. The main problem with g++ is that // it vectorizes less aggressively. However it inserts FMA instructions _more_ // aggresively, which is a plus. clang needs the `fp-contract` pragma // or an explicit `fmaf`. // // Moreover, the #include <cmath> #include <algorithm> #include <iostream> #include <limits> #include <immintrin.h> #include <vector> #include <unistd.h> #include <sys/ioctl.h> #include <linux/perf_event.h> #include <string.h> #include <asm/unistd.h> #include <random> #include <sstream> #define USE_AVX using namespace std; #define NOINLINE __attribute__((noinline)) #define UNUSED __attribute__((unused)) // -------------------------------------------------------------------- // AVX utils #ifdef USE_AVX template<typename A> inline void assert_avx_aligned(const A* ptr) { if (reinterpret_cast<uintptr_t>(ptr) % 32 != 0) { cerr << "Pointer " << ptr << " is not 32-byte aligned, exiting" << endl; exit(EXIT_FAILURE); } } inline void assert_multiple_of_8(size_t num) { if (num % 8 != 0) { cerr << "Array size " << num << " is not a multiple of 8, exiting" << endl; exit(EXIT_FAILURE); } } #endif // -------------------------------------------------------------------- // functions NOINLINE static void atan2_baseline(size_t cases, const float* ys, const float* xs, float* out) { for (size_t i = 0; i < cases; i++) { out[i] = atan2f(ys[i], xs[i]); } } // not tested since it is very slow. NOINLINE UNUSED static void atan2_fpatan(size_t cases, const float* ys, const float* xs, float* out) { for (size_t i = 0; i < cases; i++) { asm ( "flds (%[ys], %[i], 4)\n" "flds (%[xs], %[i], 4)\n" "fpatan\n" "fstps (%[out], %[i], 4)\n" : : [ys]"r"(ys), [xs]"r"(xs), [out]"r"(out), [i]"r"(i) ); } } // Polynomial approximation of atan between [-1, 1]. Stated max error ~0.000001rad. // See comment at the beginning of file for source. inline float atan_approximation(float x) { float a1 = 0.99997726f; float a3 = -0.33262347f; float a5 = 0.19354346f; float a7 = -0.11643287f; float a9 = 0.05265332f; float a11 = -0.01172120f; float x_sq = x*x; return x * (a1 + x_sq * (a3 + x_sq * (a5 + x_sq * (a7 + x_sq * (a9 + x_sq * a11))))); } // First automatic version: naive translation of the maths NOINLINE static void atan2_auto_1(size_t num_points, const float* ys, const float* xs, float* out) { for (size_t i = 0; i < num_points; i++) { // Ensure input is in [-1, +1] float y = ys[i]; float x = xs[i]; bool swap = fabs(x) < fabs(y); float atan_input = (swap ? x : y) / (swap ? y : x); // Approximate atan float res = atan_approximation(atan_input); // If swapped, adjust atan output res = swap ? (atan_input >= 0 ? M_PI_2 : -M_PI_2) - res : res; // Adjust quadrants if (x >= 0 && y >= 0) {} // 1st quadrant else if (x < 0 && y >= 0) { res = M_PI + res; } // 2nd quadrant else if (x < 0 && y < 0) { res = -M_PI + res; } // 3rd quadrant else if (x >= 0 && y < 0) {} // 4th quadrant // Store result out[i] = res; } } // Second automatic version: get rid of casting to double NOINLINE static void atan2_auto_2(size_t num_points, const float* ys, const float* xs, float* out) { float pi = M_PI; float pi_2 = M_PI_2; for (size_t i = 0; i < num_points; i++) { // Ensure input is in [-1, +1] float y = ys[i]; float x = xs[i]; bool swap = fabs(x) < fabs(y); float atan_input = (swap ? x : y) / (swap ? y : x); // Approximate atan float res = atan_approximation(atan_input); // If swapped, adjust atan output res = swap ? (atan_input >= 0 ? pi_2 : -pi_2) - res : res; // Adjust quadrants if (x >= 0 && y >= 0) {} // 1st quadrant else if (x < 0 && y >= 0) { res = pi + res; } // 2nd quadrant else if (x < 0 && y < 0) { res = -pi + res; } // 3rd quadrant else if (x >= 0 && y < 0) {} // 4th quadrant // Store result out[i] = res; } } // Third automatic version: perform positive check for x and y once -- the compiler // can't assume that in the presence of NaNs since `0/0 >= 0` is false and `0/0 < 0` is also // false. NOINLINE static void atan2_auto_3(size_t num_points, const float* ys, const float* xs, float* out) { float pi = M_PI; float pi_2 = M_PI_2; for (size_t i = 0; i < num_points; i++) { // Ensure input is in [-1, +1] float y = ys[i]; float x = xs[i]; bool swap = fabs(x) < fabs(y); float atan_input = (swap ? x : y) / (swap ? y : x); // Approximate atan float res = atan_approximation(atan_input); // If swapped, adjust atan output res = swap ? (atan_input >= 0 ? pi_2 : -pi_2) - res : res; // Adjust the result depending on the input quadrant if (x < 0) { res = (y >= 0 ? pi : -pi) + res; } // Store result out[i] = res; } } inline float atan_fma_approximation(float x) { float a1 = 0.99997726f; float a3 = -0.33262347f; float a5 = 0.19354346f; float a7 = -0.11643287f; float a9 = 0.05265332f; float a11 = -0.01172120f; // Compute approximation using Horner's method float x_sq = x*x; return x * fmaf(x_sq, fmaf(x_sq, fmaf(x_sq, fmaf(x_sq, fmaf(x_sq, a11, a9), a7), a5), a3), a1); } // Fifth automatic version: use FMA for the polynomial NOINLINE static void atan2_auto_4(size_t num_points, const float* ys, const float* xs, float* out) { float pi = M_PI; float pi_2 = M_PI_2; for (size_t i = 0; i < num_points; i++) { // Ensure input is in [-1, +1] float y = ys[i]; float x = xs[i]; bool swap = fabs(x) < fabs(y); float atan_input = (swap ? x : y) / (swap ? y : x); // Approximate atan float res = atan_fma_approximation(atan_input); // If swapped, adjust atan output res = swap ? copysignf(pi_2, atan_input) - res : res; // Adjust the result depending on the input quadrant if (x < 0) { res = copysignf(pi, y) + res; } // Store result out[i] = res; } } #ifdef USE_AVX inline __m256 atan_avx_approximation(__m256 x) { __m256 a1 = _mm256_set1_ps( 0.99997726f); __m256 a3 = _mm256_set1_ps(-0.33262347f); __m256 a5 = _mm256_set1_ps( 0.19354346f); __m256 a7 = _mm256_set1_ps(-0.11643287f); __m256 a9 = _mm256_set1_ps( 0.05265332f); __m256 a11 = _mm256_set1_ps(-0.01172120f); __m256 x_sq = _mm256_mul_ps(x, x); __m256 result; result = a11; result = _mm256_fmadd_ps(x_sq, result, a9); result = _mm256_fmadd_ps(x_sq, result, a7); result = _mm256_fmadd_ps(x_sq, result, a5); result = _mm256_fmadd_ps(x_sq, result, a3); result = _mm256_fmadd_ps(x_sq, result, a1); result = _mm256_mul_ps(x, result); return result; } // First manual version: straightfoward translation of atan2_auto_5 NOINLINE static void atan2_manual_1(size_t num_points, const float* ys, const float* xs, float* out) { assert_avx_aligned(ys), assert_avx_aligned(xs), assert_avx_aligned(out); const __m256 pi = _mm256_set1_ps(M_PI); const __m256 pi_2 = _mm256_set1_ps(M_PI_2); // Everything but the sign bit. AND'ing with it will give us // the absolute value of a float. const __m256 abs_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x7FFFFFFF));; // Only sign bit. XOR'ing with it will negate a number. AND'ing // with it will give us the sign of the number. const __m256 sign_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x80000000)); for (size_t i = 0; i < num_points; i += 8) { __m256 y = _mm256_load_ps(&ys[i]); __m256 x = _mm256_load_ps(&xs[i]); // Prepare input __m256 swap_mask = _mm256_cmp_ps( _mm256_and_ps(y, abs_mask), _mm256_and_ps(x, abs_mask), _CMP_GT_OS ); // Use blend instructions, together with a mask that tells us which // number has greater magnitude, to ensure the input is within [-1, 1]. __m256 atan_input = _mm256_div_ps( _mm256_blendv_ps(y, x, swap_mask), // pick the lowest between |y| and |x| for each number _mm256_blendv_ps(x, y, swap_mask) // and the highest. ); // Approximate atan __m256 result = atan_avx_approximation(atan_input); // If swapped, adjust atan output // // Decide whether the result needs to be unchanged by using blend instructions. // If it does, we can just apply the sign of the input to pi/2. result = _mm256_blendv_ps( result, _mm256_sub_ps( _mm256_or_ps(pi_2, _mm256_and_ps(atan_input, sign_mask)), result ), swap_mask ); // Adjust the result depending on the input quadrant. We create // a mask for the sign of `x` using an arithmetic right shift: // the mask will be all 0s if the sign if positive, and all 1s // if the sign is negative. __m256 x_sign_mask = _mm256_castsi256_ps(_mm256_srai_epi32(_mm256_castps_si256(x), 31)); // Then use the mask to perform the adjustment only when the sign // if positive, and use the sign bit of `y` to know whether to add // `pi` or `-pi`. result = _mm256_add_ps( _mm256_and_ps( _mm256_xor_ps(pi, _mm256_and_ps(sign_mask, y)), x_sign_mask ), result ); // Store result _mm256_store_ps(&out[i], result); } } // Second manual version: use the abs values we get at the beginning // for more ILP (see comment below) NOINLINE static void atan2_manual_2(size_t num_points, const float* ys, const float* xs, float* out) { assert_avx_aligned(ys), assert_avx_aligned(xs), assert_avx_aligned(out); const __m256 pi = _mm256_set1_ps(M_PI); const __m256 pi_2 = _mm256_set1_ps(M_PI_2); // no sign bit -- AND with this to get absolute value const __m256 abs_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x7FFFFFFF));; // only sign bit -- XOR with this to get negative const __m256 sign_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x80000000)); for (size_t i = 0; i < num_points; i += 8) { __m256 y = _mm256_load_ps(&ys[i]); __m256 x = _mm256_load_ps(&xs[i]); __m256 abs_y = _mm256_and_ps(abs_mask, y); __m256 abs_x = _mm256_and_ps(abs_mask, x); // atan_input = min(|y|, |x|) / max(|y|, |x|) // // I've experimented with using blend rather than min, given that we // compute `swap_mask` later anyway, but that delays the atan // computation by 2+ cycles, and is less parallel, since on skylake: // // latency throughput // vminps 4 0.5 // vmaxps 4 0.5 // vblendvps 2 0.66 // vcmpps 4 0.5 // // So while it decreases the numbers of instructions needed it makes // the overall function slower. __m256 atan_input = _mm256_div_ps( _mm256_min_ps(abs_y, abs_x), _mm256_max_ps(abs_y, abs_x) ); // Approximate atan __m256 result = atan_avx_approximation(atan_input); // We first do the usual +- pi - res, but in this case // we know the sign of pi since the input is always positive. // // result = (abs_y > abs_x) ? pi - res : res; __m256 swap_mask = _mm256_cmp_ps(abs_y, abs_x, _CMP_GT_OQ); result = _mm256_add_ps( _mm256_xor_ps(result, _mm256_and_ps(sign_mask, swap_mask)), _mm256_and_ps(pi_2, swap_mask) ); // We now have to adjust the quadrant slightly differently: // // * If both values are positive, do nothing; // * If y is negative and x is positive, do nothing; // * If y is positive and x is negative, do pi + result; // * If y is negative and x is negative, do result - pi; // // These can easily be verified by analyzing what happens to the output // for each input quadrant. // // These cases can be compressed to the two branches below, and then // made branchless without using any blend instruction. // result = (x < 0) ? pi - res : res; __m256 x_sign_mask = _mm256_castsi256_ps(_mm256_srai_epi32(_mm256_castps_si256(x), 31)); result = _mm256_add_ps( _mm256_xor_ps(result, _mm256_and_ps(x, sign_mask)), _mm256_and_ps(pi, x_sign_mask) ); // result = (y < 0) ? -res : res; result = _mm256_xor_ps(result, _mm256_and_ps(y, sign_mask)); // Store result _mm256_store_ps(&out[i], result); } } #endif // -------------------------------------------------------------------- // data generation NOINLINE static void generate_data( bool random_points, bool test_edge_cases, size_t desired_num_cases, size_t* cases, float** ys, float** xs, float** ref_out, float** out ) { cout << "Generating data..." << flush; size_t edge = static_cast<size_t>(sqrtf(static_cast<float>(desired_num_cases))); vector<float> extra_cases; if (test_edge_cases) { // We want to make sure to test some special cases, so we always add them. // We do not include negative zero, since we do not care about the sign // matching up in that case. // We also do not include NaN and inf, since we assume that the input does // not contain it. using limits = numeric_limits<float>; extra_cases = { 1.0f, -1.0f, 0.0f, -0.0f, limits::epsilon(), -limits::epsilon(), limits::max(), -limits::max(), limits::quiet_NaN() }; } // The -1 is for the 0, 0 case, which we do not want. *cases = (edge+extra_cases.size())*(edge+extra_cases.size()) - 1; // Make sure cases is a multiple of 8 so that the AVX functions can work cleanly *cases = *cases + (8 - (*cases % 8)); *ys = reinterpret_cast<float*>(aligned_alloc(32, *cases * sizeof(float))); *xs = reinterpret_cast<float*>(aligned_alloc(32, *cases * sizeof(float))); *ref_out = reinterpret_cast<float*>(aligned_alloc(32, *cases * sizeof(float))); *out = reinterpret_cast<float*>(aligned_alloc(32, *cases * sizeof(float))); if (*ys == nullptr || *xs == nullptr || *ref_out == nullptr || *out == nullptr) { cerr << "Could not allocate arrays" << endl; exit(EXIT_FAILURE); } mt19937_64 gen(0); { float bound = 1.0f; float step = (bound * 2.0f) / static_cast<float>(edge); uniform_real_distribution<float> dist{ -1.0f, 1.0f }; const auto get_number = [random_points, &extra_cases, &gen, bound, step, &dist](size_t i) -> float { if (i < extra_cases.size()) { return extra_cases.at(i); } else if (random_points) { return dist(gen); } else { return -bound + static_cast<float>(i - extra_cases.size()) * step; } }; size_t ix = 0; for (size_t i = 0; i < edge + extra_cases.size(); i++) { for (size_t j = 0; j < edge + extra_cases.size(); j++) { float y = get_number(i); float x = get_number(j); if (y == 0.0f && x == 0.0f) { continue; } (*ys)[ix] = y; (*xs)[ix] = x; ix++; } } // pad with dummies for (; ix < *cases; ix++) { (*ys)[ix] = 1.0f; (*xs)[ix] = 1.0f; } } // shuffle to confuse branch predictor in the case of explicit branches and // non-random points { for (size_t i = 0; i < *cases - 1; i++) { size_t swap_with = static_cast<size_t>(i + 1 + (gen() % (*cases - i - 1))); swap((*ys)[i], (*ys)[swap_with]); swap((*xs)[i], (*xs)[swap_with]); } } cout << " done." << endl; } // -------------------------------------------------------------------- // Pin to first CPU static void pin_to_cpu_0(void) { cpu_set_t cpu_mask; CPU_ZERO(&cpu_mask); CPU_SET(0, &cpu_mask); if (sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask) != 0) { fprintf(stderr, "Could not set CPU affinity\n"); exit(EXIT_FAILURE); } } // -------------------------------------------------------------------- // perf instrumentation -- a mixture of man 3 perf_event_open and // <https://stackoverflow.com/a/42092180> static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags) { int ret; ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags); return ret; } static void setup_perf_event( struct perf_event_attr *evt, int *fd, uint64_t *id, uint32_t evt_type, uint64_t evt_config, int group_fd ) { memset(evt, 0, sizeof(struct perf_event_attr)); evt->type = evt_type; evt->size = sizeof(struct perf_event_attr); evt->config = evt_config; evt->disabled = 1; evt->exclude_kernel = 1; evt->exclude_hv = 1; evt->read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; *fd = perf_event_open(evt, 0, -1, group_fd, 0); if (*fd == -1) { fprintf(stderr, "Error opening leader %llx\n", evt->config); exit(EXIT_FAILURE); } ioctl(*fd, PERF_EVENT_IOC_ID, id); } static struct perf_event_attr perf_cycles_evt; static int perf_cycles_fd; static uint64_t perf_cycles_id; static struct perf_event_attr perf_clock_evt; static int perf_clock_fd; static uint64_t perf_clock_id; static struct perf_event_attr perf_instrs_evt; static int perf_instrs_fd; static uint64_t perf_instrs_id; static struct perf_event_attr perf_cache_misses_evt; static int perf_cache_misses_fd; static uint64_t perf_cache_misses_id; static struct perf_event_attr perf_cache_references_evt; static int perf_cache_references_fd; static uint64_t perf_cache_references_id; static struct perf_event_attr perf_branch_misses_evt; static int perf_branch_misses_fd; static uint64_t perf_branch_misses_id; static struct perf_event_attr perf_branch_instructions_evt; static int perf_branch_instructions_fd; static uint64_t perf_branch_instructions_id; static void perf_init(void) { // Cycles setup_perf_event( &perf_cycles_evt, &perf_cycles_fd, &perf_cycles_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, -1 ); // Clock setup_perf_event( &perf_clock_evt, &perf_clock_fd, &perf_clock_id, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_TASK_CLOCK, perf_cycles_fd ); // Instructions setup_perf_event( &perf_instrs_evt, &perf_instrs_fd, &perf_instrs_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, perf_cycles_fd ); // Cache misses setup_perf_event( &perf_cache_misses_evt, &perf_cache_misses_fd, &perf_cache_misses_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_CACHE_MISSES, perf_cycles_fd ); // Cache references setup_perf_event( &perf_cache_references_evt, &perf_cache_references_fd, &perf_cache_references_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_CACHE_REFERENCES, perf_cycles_fd ); // Branch misses setup_perf_event( &perf_branch_misses_evt, &perf_branch_misses_fd, &perf_branch_misses_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_MISSES, perf_cycles_fd ); // Branch instructions setup_perf_event( &perf_branch_instructions_evt, &perf_branch_instructions_fd, &perf_branch_instructions_id, PERF_TYPE_HARDWARE, PERF_COUNT_HW_BRANCH_INSTRUCTIONS, perf_cycles_fd ); } static void perf_close(void) { close(perf_clock_fd); close(perf_cycles_fd); close(perf_instrs_fd); close(perf_cache_misses_fd); close(perf_cache_references_fd); } static void disable_perf_count(void) { ioctl(perf_cycles_fd, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP); } static void enable_perf_count(void) { ioctl(perf_cycles_fd, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP); } static void reset_perf_count(void) { ioctl(perf_cycles_fd, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP); } struct perf_read_value { uint64_t value; uint64_t id; }; struct perf_read_format { uint64_t nr; struct perf_read_value values[]; }; static char perf_read_buf[4096]; struct perf_count { uint64_t cycles; double seconds; uint64_t instructions; uint64_t cache_misses; uint64_t cache_references; uint64_t branch_misses; uint64_t branch_instructions; perf_count(): cycles(0), seconds(0.0), instructions(0), cache_misses(0), cache_references(0), branch_misses(0), branch_instructions(0) {} }; static void read_perf_count(struct perf_count *count) { if (!read(perf_cycles_fd, perf_read_buf, sizeof(perf_read_buf))) { fprintf(stderr, "Could not read cycles from perf\n"); exit(EXIT_FAILURE); } struct perf_read_format* rf = (struct perf_read_format *) perf_read_buf; if (rf->nr != 7) { fprintf(stderr, "Bad number of perf events\n"); exit(EXIT_FAILURE); } for (int i = 0; i < static_cast<int>(rf->nr); i++) { struct perf_read_value *value = &rf->values[i]; if (value->id == perf_cycles_id) { count->cycles = value->value; } else if (value->id == perf_clock_id) { count->seconds = ((double) (value->value / 1000ull)) / 1000000.0; } else if (value->id == perf_instrs_id) { count->instructions = value->value; } else if (value->id == perf_cache_misses_id) { count->cache_misses = value->value; } else if (value->id == perf_cache_references_id) { count->cache_references = value->value; } else if (value->id == perf_branch_misses_id) { count->branch_misses = value->value; } else if (value->id == perf_branch_instructions_id) { count->branch_instructions = value->value; } else { fprintf(stderr, "Spurious value in perf read (%ld)\n", value->id); exit(EXIT_FAILURE); } } } // -------------------------------------------------------------------- // running the function struct max_error { float y; float x; float error; max_error(): y(0.0f), x(0.0f), error(0.0f) {} void update(float y, float x, float reference, float result) { if (isnan(reference) && !isnan(result)) { cerr << "Expected NaN in result, but got " << result << endl; cerr << "For point " << x << ", " << y << endl; exit(EXIT_FAILURE); } if (!isnan(reference) && isnan(result)) { cerr << "Unexpected NaN in result, expected " << reference << endl; cerr << "For point " << x << ", " << y << endl; exit(EXIT_FAILURE); } float error = abs(reference - result); if (error > this->error) { this->y = y; this->x = x; this->error = error; } } }; NOINLINE static void run_timed( int pre_iterations, int iterations, bool test_negative_zero, bool test_max, bool test_nan, size_t num, const string& name, void (*fun)(size_t, const float*, const float*, float*), const float* ref_out, // if null no comparison will be run const float* in_y, const float* in_x, float* out ) { if (iterations < 1) { fprintf(stderr, "iterations < 1: %d\n", iterations); exit(EXIT_FAILURE); } for (int i = 0; i < pre_iterations; i++) { fun(num, in_y, in_x, out); } struct perf_count counts; disable_perf_count(); reset_perf_count(); enable_perf_count(); // The enabling / disabling adds noise for small timings, so // wrap the loop since the loop counts for very little (one add, one cmp, one // predicted conditional jump). for (int i = 0; i < iterations; i++) { fun(num, in_y, in_x, out); } disable_perf_count(); read_perf_count(&counts); // two floats per element uint64_t bytes = ((uint64_t) num) * sizeof(float) * 2; double gb_per_s = (((double) bytes) / 1000000000.0) / (counts.seconds / ((double) iterations)); double time = counts.seconds; const char *unit = " s"; if (time < 0.1) { time *= 1000.0; unit = "ms"; } if (time < 0.1) { time *= 1000.0; unit = "us"; } constexpr int padded_name_size = 10; char padded_name[padded_name_size]; int name_len = snprintf(padded_name, padded_name_size, "%s:", name.c_str()); for (int i = name_len; i < padded_name_size; i++) { padded_name[i] = ' '; } padded_name[padded_name_size-1] = '\0'; printf( "%s %5.2f%s, %6.2fGB/s, %6.2f cycles/elem, %6.2f instrs/elem, %5.2f instrs/cycle, %5.2f branches/elem, %5.2f%% branch misses, %5.2f%% cache misses, %5.2fGHz", padded_name, time, unit, gb_per_s, ((double) counts.cycles) / ((double) iterations) / ((double) num), ((double) counts.instructions) / ((double) iterations) / ((double) num), ((double) counts.instructions) / ((double) counts.cycles), ((double) counts.branch_instructions) / ((double) iterations) / ((double) num), 100.0 * ((double) counts.branch_misses) / ((double) counts.branch_instructions), 100.0 * ((double) counts.cache_misses) / ((double) counts.cache_references), ((double) counts.cycles) / counts.seconds / 1000000000.0 ); if (ref_out != nullptr) { max_error max_error; for (size_t ix = 0; ix < num; ix++) { float y = in_y[ix]; float x = in_x[ix]; if (!test_negative_zero && (y == -0.0f || x == -0.0f)) { continue; } if ( !test_max && (fabs(y) == numeric_limits<float>::max() || fabs(x) == numeric_limits<float>::max()) ) { continue; } if (!test_nan && (isnan(y) || isnan(x))) { continue; } max_error.update(in_y[ix], in_x[ix], ref_out[ix], out[ix]); } printf( ", %.9fdeg max error, max error point: %+.9f, %+.9f\n", max_error.error * 180.0f / M_PI, max_error.x, max_error.y ); } else { printf("\n"); } } #define run_timed_atan2(pre_iterations, iterations, test_negative_zero, test_max, test_nan, num, fun, ref_out, in_y, in_x, out) \ run_timed(pre_iterations, iterations, test_negative_zero, test_max, test_nan, num, #fun, atan2_ ## fun, ref_out, in_y, in_x, out) // -------------------------------------------------------------------- // main NOINLINE static pair<string, size_t> format_size(size_t size) { string unit = "B"; if (size > 10000ull) { size /= 1000ull; unit = "kB"; } if (size > 10000ull) { size /= 1000ull; unit = "MB"; } if (size > 10000ull) { size /= 1000ull; unit = "GB"; } return { unit, size }; } struct config { bool random_points = false; bool test_edge_cases = false; bool test_baseline = true; bool test_auto_1 = true; bool test_auto_2 = true; bool test_auto_3 = true; bool test_auto_4 = true; bool test_manual_1 = true; bool test_manual_2 = true; }; int main(int argc, const char* argv[]) { cout.precision(numeric_limits<float>::max_digits10); pin_to_cpu_0(); perf_init(); config config; const auto bad_usage = [&argv]() { cerr << "Usage: " << argv[0] << " [--random-points] [--test-edge-cases] [--functions COMMA_SEPARATED_FUNCTIONS] NUMBER_OF_TEST_CASES NUMBER_OF_WARM_UPS NUMBER_OF_ITERATIONS" << endl; exit(EXIT_FAILURE); }; vector<string> arguments; for (int i = 1; i < argc; i++) { string arg{argv[i]}; if (arg == "--random-points") { config.random_points = true; } else if (arg == "--test-edge-cases") { config.test_edge_cases = true; } else if (arg == "--functions") { config.test_baseline = false; config.test_auto_1 = false; config.test_auto_2 = false; config.test_auto_3 = false; config.test_auto_4 = false; config.test_manual_1 = false; config.test_manual_2 = false; if (i < argc - 1) { i++; stringstream functions{ argv[i] }; for (string function; getline(functions, function, ','); ) { if (function == "baseline") { config.test_baseline = true; } else if (function == "auto_1") { config.test_auto_1 = true; } else if (function == "auto_2") { config.test_auto_2 = true; } else if (function == "auto_3") { config.test_auto_3 = true; } else if (function == "auto_4") { config.test_auto_4 = true; } #ifdef USE_AVX else if (function == "manual") { config.test_manual_1 = true; } else if (function == "manual_2") { config.test_manual_2 = true; } #endif else { cerr << "Bad function " << function << endl; bad_usage(); } } } else { cerr << "Expecting argument after --functions" << endl; bad_usage(); } } else { arguments.emplace_back(arg); } } if (arguments.size() != 3) { bad_usage(); } size_t desired_cases; if (sscanf(arguments.at(0).c_str(), "%zu", &desired_cases) != 1) { cerr << "Could not parse number of cases" << endl; bad_usage(); } int pre_iterations; if (sscanf(arguments.at(1).c_str(), "%d", &pre_iterations) != 1) { cerr << "Could not parse warm ups" << endl; bad_usage(); } int iterations; if (sscanf(arguments.at(2).c_str(), "%d", &iterations) != 1) { cerr << "Could not parse iterations" << endl; bad_usage(); } size_t cases; float* ys; float* xs; float* ref_out; float* out; generate_data( config.random_points, config.test_edge_cases, desired_cases, &cases, &ys, &xs, &ref_out, &out ); if (!config.test_baseline) { free(ref_out); ref_out = nullptr; } const auto formatted_input_size = format_size(sizeof(float) * 2 * cases); const auto formatted_output_size = format_size(sizeof(float) * cases); cout << "Tests will read " << formatted_input_size.second << formatted_input_size.first << " and write " << formatted_output_size.second << formatted_output_size.first << " (" << cases << " points)" << endl; cout << "Running " << pre_iterations << " warm ups and " << iterations << " iterations" << endl; cout << endl; if (config.test_baseline) { run_timed(pre_iterations, iterations, true, true, true, cases, "baseline", atan2_baseline, nullptr, ys, xs, ref_out); } // auto_1 to auto_3 do not support the max value because the input gets // reduce to negative zero in that case and the atan_input >= 0 branch // doesn't preserve the sign properly. // // Similarly, none of the functions apart from manual_1 and manual_2 support // negative zero properly because of the x < 0 branch. // // manual_2 silently drop NaNs, it can be made to not do that at slight // performance hit (see comment in it). if (config.test_auto_1) { run_timed_atan2(pre_iterations, iterations, false, false, true, cases, auto_1, ref_out, ys, xs, out); } if (config.test_auto_2) { run_timed_atan2(pre_iterations, iterations, false, false, true, cases, auto_2, ref_out, ys, xs, out); } if (config.test_auto_3) { run_timed_atan2(pre_iterations, iterations, false, false, true, cases, auto_3, ref_out, ys, xs, out); } if (config.test_auto_4) { run_timed_atan2(pre_iterations, iterations, false, true, true, cases, auto_4, ref_out, ys, xs, out); } #ifdef USE_AVX if (config.test_manual_1) { run_timed_atan2(pre_iterations, iterations, true, true, true, cases, manual_1, ref_out, ys, xs, out); } if (config.test_manual_2) { run_timed_atan2(pre_iterations, iterations, true, true, false, cases, manual_2, ref_out, ys, xs, out); } #endif cout << endl; free(ys); free(xs); free(ref_out); free(out); perf_close(); return EXIT_SUCCESS; }
Become a Patron
Sponsor on GitHub
Donate via PayPal
Source on GitHub
Mailing list
Installed libraries
Wiki
Report an issue
How it works
Contact the author
CE on Mastodon
CE on Bluesky
About the author
Statistics
Changelog
Version tree