Thanks for using Compiler Explorer
Sponsors
Jakt
C++
Ada
Analysis
Assembly
C
Carbon
C++ (Circle)
CIRCT
Clean
CMake
C++ for OpenCL
MLIR
Cppx
Cppx-Blue
Cppx-Gold
Crystal
C#
CUDA C++
D
Dart
Erlang
Fortran
F#
Go
Haskell
HLSL
ispc
Java
Kotlin
LLVM IR
Nim
OCaml
OpenCL C
Pascal
Pony
Python
Ruby
Rust
Scala
Solidity
Swift
Toit
TypeScript Native
Visual Basic
Zig
c source #1
Output
Compile to binary
Execute the code
Intel asm syntax
Demangle identifiers
Filters
Unused labels
Library functions
Directives
Comments
Horizontal whitespace
Compiler
6502 cc65 2.17
6502 cc65 2.18
6502 cc65 2.19
6502 cc65 trunk
ARM gcc 10.2 (linux)
ARM gcc 10.2.1 (none)
ARM gcc 10.3 (linux)
ARM gcc 10.3.1 (2021.07 none)
ARM gcc 10.3.1 (2021.10 none)
ARM gcc 11.1 (linux)
ARM gcc 11.2 (linux)
ARM gcc 11.2.1 (none)
ARM gcc 11.3 (linux)
ARM gcc 12.1 (linux)
ARM gcc 4.5.4 (linux)
ARM gcc 4.6.4 (linux)
ARM gcc 5.4 (linux)
ARM gcc 5.4.1 (none)
ARM gcc 6.3.0 (linux)
ARM gcc 6.4 (linux)
ARM gcc 7.2.1 (none)
ARM gcc 7.3 (linux)
ARM gcc 7.5 (linux)
ARM gcc 8.2 (WinCE)
ARM gcc 8.2 (linux)
ARM gcc 8.3.1 (none)
ARM gcc 8.5 (linux)
ARM gcc 9.2.1 (none)
ARM gcc 9.3 (linux)
ARM gcc trunk (linux)
ARM msvc v19.0 (WINE)
ARM msvc v19.10 (WINE)
ARM msvc v19.14 (WINE)
ARM64 Morello gcc 10.1 Alpha 1
ARM64 gcc 10.2
ARM64 gcc 10.3
ARM64 gcc 11.1
ARM64 gcc 11.2
ARM64 gcc 11.3
ARM64 gcc 12.1
ARM64 gcc 5.4
ARM64 gcc 6.3
ARM64 gcc 6.4
ARM64 gcc 7.3
ARM64 gcc 7.5
ARM64 gcc 8.2
ARM64 gcc 8.5
ARM64 gcc 9.3
ARM64 gcc trunk
ARM64 msvc v19.14 (WINE)
AVR gcc 10.3.0
AVR gcc 11.1.0
AVR gcc 4.5.4
AVR gcc 4.6.4
AVR gcc 5.4.0
AVR gcc 9.2.0
AVR gcc 9.3.0
Arduino Mega (1.8.9)
Arduino Uno (1.8.9)
Chibicc 2020-12-07
FRC 2019
FRC 2020
K1C gcc 7.4
K1C gcc 7.5
KVX gcc 7.5 (ACB 4.1.0)
KVX gcc 7.5 (ACB 4.1.0-cd1)
KVX gcc 7.5 (ACB 4.2.0)
KVX gcc 7.5 (ACB 4.3.0)
KVX gcc 7.5 (ACB 4.4.0)
KVX gcc 9.4 (ACB 4.6.0)
MRISC32 gcc (trunk)
MSP430 gcc 4.5.3
MSP430 gcc 5.3.0
MSP430 gcc 6.2.1
RISC-V rv32gc clang (trunk)
RISC-V rv32gc clang 10.0.0
RISC-V rv32gc clang 10.0.1
RISC-V rv32gc clang 11.0.0
RISC-V rv32gc clang 11.0.1
RISC-V rv32gc clang 12.0.0
RISC-V rv32gc clang 12.0.1
RISC-V rv32gc clang 13.0.0
RISC-V rv32gc clang 13.0.1
RISC-V rv32gc clang 14.0.0
RISC-V rv32gc clang 9.0.0
RISC-V rv32gc clang 9.0.1
RISC-V rv32gc gcc 10.2.0
RISC-V rv32gc gcc 10.3.0
RISC-V rv32gc gcc 11.2.0
RISC-V rv32gc gcc 11.3.0
RISC-V rv32gc gcc 12.1.0
RISC-V rv32gc gcc 8.2.0
RISC-V rv32gc gcc 8.5.0
RISC-V rv32gc gcc 9.4.0
RISC-V rv64gc clang (trunk)
RISC-V rv64gc clang 10.0.0
RISC-V rv64gc clang 10.0.1
RISC-V rv64gc clang 11.0.0
RISC-V rv64gc clang 11.0.1
RISC-V rv64gc clang 12.0.0
RISC-V rv64gc clang 12.0.1
RISC-V rv64gc clang 13.0.0
RISC-V rv64gc clang 13.0.1
RISC-V rv64gc clang 14.0.0
RISC-V rv64gc clang 9.0.0
RISC-V rv64gc clang 9.0.1
RISC-V rv64gc gcc 10.2.0
RISC-V rv64gc gcc 10.3.0
RISC-V rv64gc gcc 11.2.0
RISC-V rv64gc gcc 11.3.0
RISC-V rv64gc gcc 12.1.0
RISC-V rv64gc gcc 8.2.0
RISC-V rv64gc gcc 8.5.0
RISC-V rv64gc gcc 9.4.0
Raspbian Buster
Raspbian Stretch
SDCC 4.0.0
SDCC 4.1.0
TCC (trunk)
TCC 0.9.27
WebAssembly clang (trunk)
Xtensa ESP32 gcc 8.2.0 (2019r2)
Xtensa ESP32 gcc 8.2.0 (2020r1)
Xtensa ESP32 gcc 8.2.0 (2020r2)
Xtensa ESP32 gcc 8.4.0 (2020r3)
Xtensa ESP32 gcc 8.4.0 (2021r1)
Xtensa ESP32 gcc 8.4.0 (2021r2)
Xtensa ESP32-S2 gcc 8.2.0 (2019r2)
Xtensa ESP32-S2 gcc 8.2.0 (2020r1)
Xtensa ESP32-S2 gcc 8.2.0 (2020r2)
Xtensa ESP32-S2 gcc 8.4.0 (2020r3)
Xtensa ESP32-S2 gcc 8.4.0 (2021r1)
Xtensa ESP32-S2 gcc 8.4.0 (2021r2)
Xtensa ESP32-S3 gcc 8.4.0 (2020r3)
Xtensa ESP32-S3 gcc 8.4.0 (2021r1)
Xtensa ESP32-S3 gcc 8.4.0 (2021r2)
arm64 msvc v19.28 VS16.9
arm64 msvc v19.29 VS16.10
arm64 msvc v19.29 VS16.11
arm64 msvc v19.30
arm64 msvc v19.31
arm64 msvc v19.32
arm64 msvc v19.latest
armv7-a clang (trunk)
armv7-a clang 10.0.0
armv7-a clang 10.0.1
armv7-a clang 11.0.0
armv7-a clang 11.0.1
armv7-a clang 9.0.0
armv7-a clang 9.0.1
armv8-a clang (trunk)
armv8-a clang (trunk, all architectural features)
armv8-a clang 10.0.0
armv8-a clang 10.0.1
armv8-a clang 11.0.0
armv8-a clang 11.0.1
armv8-a clang 9.0.0
armv8-a clang 9.0.1
cproc-master
mips (el) gcc 5.4
mips gcc 11.2.0
mips gcc 5.4
mips gcc 9.3.0
mips64 (el) gcc 5.4
mips64 gcc 11.2.0
mips64 gcc 5.4
nanoMIPS gcc 6.3.0
power gcc 11.2.0
power gcc 4.8.5
power64 AT12.0 (gcc8)
power64 AT13.0 (gcc9)
power64 gcc 11.2.0
power64le AT12.0 (gcc8)
power64le AT13.0 (gcc9)
power64le clang (trunk)
power64le gcc 11.2.0
power64le gcc 6.3.0
powerpc64 clang (trunk)
ppci 0.5.5
s390x GCC 11.2.0
x64 msvc v19.0 (WINE)
x64 msvc v19.10 (WINE)
x64 msvc v19.14
x64 msvc v19.14 (WINE)
x64 msvc v19.15
x64 msvc v19.16
x64 msvc v19.20
x64 msvc v19.21
x64 msvc v19.22
x64 msvc v19.23
x64 msvc v19.24
x64 msvc v19.25
x64 msvc v19.26
x64 msvc v19.27
x64 msvc v19.28
x64 msvc v19.28 VS16.9
x64 msvc v19.29 VS16.10
x64 msvc v19.29 VS16.11
x64 msvc v19.30
x64 msvc v19.31
x64 msvc v19.32
x64 msvc v19.latest
x86 gcc 1.27
x86 msvc v19.0 (WINE)
x86 msvc v19.10 (WINE)
x86 msvc v19.14
x86 msvc v19.14 (WINE)
x86 msvc v19.15
x86 msvc v19.16
x86 msvc v19.20
x86 msvc v19.21
x86 msvc v19.22
x86 msvc v19.23
x86 msvc v19.24
x86 msvc v19.25
x86 msvc v19.26
x86 msvc v19.27
x86 msvc v19.28
x86 msvc v19.28 VS16.9
x86 msvc v19.29 VS16.10
x86 msvc v19.29 VS16.11
x86 msvc v19.30
x86 msvc v19.31
x86 msvc v19.32
x86 msvc v19.latest
x86 tendra (trunk)
x86-64 clang (assertions trunk)
x86-64 clang (thephd.dev)
x86-64 clang (trunk)
x86-64 clang (widberg)
x86-64 clang 10.0.0
x86-64 clang 10.0.1
x86-64 clang 11.0.0
x86-64 clang 11.0.1
x86-64 clang 12.0.0
x86-64 clang 12.0.1
x86-64 clang 13.0.0
x86-64 clang 13.0.1
x86-64 clang 14.0.0
x86-64 clang 3.0.0
x86-64 clang 3.1
x86-64 clang 3.2
x86-64 clang 3.3
x86-64 clang 3.4.1
x86-64 clang 3.5
x86-64 clang 3.5.1
x86-64 clang 3.5.2
x86-64 clang 3.6
x86-64 clang 3.7
x86-64 clang 3.7.1
x86-64 clang 3.8
x86-64 clang 3.8.1
x86-64 clang 3.9.0
x86-64 clang 3.9.1
x86-64 clang 4.0.0
x86-64 clang 4.0.1
x86-64 clang 5.0.0
x86-64 clang 5.0.1
x86-64 clang 5.0.2
x86-64 clang 6.0.0
x86-64 clang 6.0.1
x86-64 clang 7.0.0
x86-64 clang 7.0.1
x86-64 clang 7.1.0
x86-64 clang 8.0.0
x86-64 clang 8.0.1
x86-64 clang 9.0.0
x86-64 clang 9.0.1
x86-64 gcc (static analysis)
x86-64 gcc (trunk)
x86-64 gcc 10.1
x86-64 gcc 10.2
x86-64 gcc 10.3
x86-64 gcc 10.4
x86-64 gcc 11.1
x86-64 gcc 11.2
x86-64 gcc 11.3
x86-64 gcc 12.1
x86-64 gcc 4.1.2
x86-64 gcc 4.4.7
x86-64 gcc 4.5.3
x86-64 gcc 4.6.4
x86-64 gcc 4.7.1
x86-64 gcc 4.7.2
x86-64 gcc 4.7.3
x86-64 gcc 4.7.4
x86-64 gcc 4.8.1
x86-64 gcc 4.8.2
x86-64 gcc 4.8.3
x86-64 gcc 4.8.4
x86-64 gcc 4.8.5
x86-64 gcc 4.9.0
x86-64 gcc 4.9.1
x86-64 gcc 4.9.2
x86-64 gcc 4.9.3
x86-64 gcc 4.9.4
x86-64 gcc 5.1
x86-64 gcc 5.2
x86-64 gcc 5.3
x86-64 gcc 5.4
x86-64 gcc 6.1
x86-64 gcc 6.2
x86-64 gcc 6.3
x86-64 gcc 7.1
x86-64 gcc 7.2
x86-64 gcc 7.3
x86-64 gcc 7.4
x86-64 gcc 8.1
x86-64 gcc 8.2
x86-64 gcc 8.3
x86-64 gcc 8.4
x86-64 gcc 8.5
x86-64 gcc 9.1
x86-64 gcc 9.2
x86-64 gcc 9.3
x86-64 gcc 9.4
x86-64 gcc 9.5
x86-64 icc 13.0.1
x86-64 icc 16.0.3
x86-64 icc 17.0.0
x86-64 icc 18.0.0
x86-64 icc 19.0.0
x86-64 icc 19.0.1
x86-64 icc 2021.1.2
x86-64 icc 2021.2.0
x86-64 icc 2021.3.0
x86-64 icc 2021.4.0
x86-64 icc 2021.5.0
x86-64 icx 2021.1.2
x86-64 icx 2021.2.0
x86-64 icx 2021.3.0
x86-64 icx 2021.4.0
x86-64 icx 2022.0.0
zig cc 0.6.0
zig cc 0.7.0
zig cc 0.7.1
zig cc 0.8.0
zig cc 0.9.0
zig cc trunk
Options
Source code
#include <immintrin.h> #include <stdint.h> static inline __m256i hsum_epu8_epu64(__m256i v) { return _mm256_sad_epu8(v, _mm256_setzero_si256()); // SAD against zero is a handy trick } static inline uint64_t hsum_epu64_scalar(__m256i v) { __m128i lo = _mm256_castsi256_si128(v); __m128i hi = _mm256_extracti128_si256(v, 1); __m128i sum2x64 = _mm_add_epi64(lo, hi); // narrow to 128 hi = _mm_unpackhi_epi64(sum2x64, sum2x64); __m128i sum = _mm_add_epi64(hi, sum2x64); // narrow to 64 return _mm_cvtsi128_si64(sum); } unsigned long long char_count_AVX2(char const* vector, size_t size, char c) { __m256i C=_mm256_set1_epi8(c); // todo: count elements and increment `vector` until it is aligned to 256bytes, or at least 32 __m256i const * simd_vector = (__m256i const *) vector; // *simd_vector is an alignment-required load, unlike _mm256_loadu_si256() __m256i sum64 = _mm256_setzero_si256(); size_t unrolled_size_limit = size - 4*255*32 + 1; for(size_t k=0; k<unrolled_size_limit ; k+=4*255*32) // outer loop: TODO { __m256i counter[4]; // multiple counter registers to hide latencies for(int j=0; j<4; j++) counter[j]=_mm256_setzero_si256(); // inner loop: make sure that you don't go beyond the data you can read for(int i=0; i<255; ++i) { // or limit this inner loop to ~22 to avoid branch mispredicts for(int j=0; j<4; ++j) { counter[j]=_mm256_sub_epi8(counter[j], // count -= 0 or -1 _mm256_cmpeq_epi8(*simd_vector, C)); ++simd_vector; } } // only need one outer accumulator: OoO exec hides the latency of adding into it sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[0])); sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[1])); sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[2])); sum64 = _mm256_add_epi64(sum64, hsum_epu8_epu64(counter[3])); } uint64_t sum = hsum_epu64_scalar(sum64); // TODO add up remaining bytes with sum. // Including a rolled-up vector loop before going scalar // because we're potentially a *long* way from the end // Maybe put some logic into the main loop to shorten the 255 inner iterations // if we're close to the end. A little bit of scalar work there shouldn't hurt every 255 iters. return sum; }
Become a Patron
Sponsor on GitHub
Donate via PayPal
Source on GitHub
Mailing list
Installed libraries
Wiki
Report an issue
How it works
Contact the author
About the author
Changelog
Version tree