Thanks for using Compiler Explorer
Sponsors
Jakt
C++
Ada
Analysis
Android Java
Android Kotlin
Assembly
C
C3
Carbon
C++ (Circle)
CIRCT
Clean
CMake
CMakeScript
COBOL
C++ for OpenCL
MLIR
Cppx
Cppx-Blue
Cppx-Gold
Cpp2-cppfront
Crystal
C#
CUDA C++
D
Dart
Elixir
Erlang
Fortran
F#
Go
Haskell
HLSL
Hook
Hylo
ispc
Java
Julia
Kotlin
LLVM IR
LLVM MIR
Modula-2
Nim
Objective-C
Objective-C++
OCaml
OpenCL C
Pascal
Pony
Python
Racket
Ruby
Rust
Snowball
Scala
Solidity
Spice
Swift
LLVM TableGen
Toit
TypeScript Native
V
Vala
Visual Basic
WASM
Zig
Javascript
GIMPLE
assembly source #1
Output
Compile to binary object
Link to binary
Execute the code
Intel asm syntax
Demangle identifiers
Verbose demangling
Filters
Unused labels
Library functions
Directives
Comments
Horizontal whitespace
Debug intrinsics
Compiler
AArch64 binutils 2.28
AArch64 binutils 2.31.1
AArch64 binutils 2.33.1
AArch64 binutils 2.35.1
AArch64 binutils 2.38
ARM binutils 2.25
ARM binutils 2.28
ARM binutils 2.31.1
ARM gcc 10.2 (linux)
ARM gcc 9.3 (linux)
ARMhf binutils 2.28
BeebAsm 1.09
NASM 2.12.02
NASM 2.13.02
NASM 2.13.03
NASM 2.14.02
NASM 2.16.01
PTX Assembler 10.0.130
PTX Assembler 10.1.105
PTX Assembler 10.1.168
PTX Assembler 10.1.243
PTX Assembler 10.2.89
PTX Assembler 11.0.2
PTX Assembler 11.0.3
PTX Assembler 11.1.0
PTX Assembler 11.1.1
PTX Assembler 11.2.0
PTX Assembler 11.2.1
PTX Assembler 11.2.2
PTX Assembler 11.3.0
PTX Assembler 11.3.1
PTX Assembler 11.4.0
PTX Assembler 11.4.1
PTX Assembler 11.5.0
PTX Assembler 9.1.85
PTX Assembler 9.2.88
RISC-V binutils 2.31.1
RISC-V binutils 2.31.1
RISC-V binutils 2.35.1
RISC-V binutils 2.35.1
RISC-V binutils 2.37.0
RISC-V binutils 2.37.0
RISC-V binutils 2.38.0
RISC-V binutils 2.38.0
x86-64 binutils (trunk)
x86-64 binutils 2.27
x86-64 binutils 2.28
x86-64 binutils 2.29.1
x86-64 binutils 2.34
x86-64 binutils 2.36.1
x86-64 binutils 2.38
x86-64 clang (assertions trunk)
x86-64 clang (trunk)
x86-64 clang 10.0.0
x86-64 clang 10.0.1
x86-64 clang 11.0.0
x86-64 clang 11.0.1
x86-64 clang 12.0.0
x86-64 clang 12.0.1
x86-64 clang 13.0.0
x86-64 clang 14.0.0
x86-64 clang 15.0.0
x86-64 clang 16.0.0
x86-64 clang 17.0.1
x86-64 clang 18.1.0
x86-64 clang 3.0.0
x86-64 clang 3.1
x86-64 clang 3.2
x86-64 clang 3.3
x86-64 clang 3.4.1
x86-64 clang 3.5
x86-64 clang 3.5.1
x86-64 clang 3.5.2
x86-64 clang 3.6
x86-64 clang 3.7
x86-64 clang 3.7.1
x86-64 clang 3.8
x86-64 clang 3.8.1
x86-64 clang 3.9.0
x86-64 clang 3.9.1
x86-64 clang 4.0.0
x86-64 clang 4.0.1
x86-64 clang 5.0.0
x86-64 clang 6.0.0
x86-64 clang 7.0.0
x86-64 clang 8.0.0
x86-64 clang 9.0.0
Options
Source code
global miraQueCoincidencia section .rodata gris: dd 0.114, 0.587, 0.299, 0.0 blanco: dd 255.0, 255.0, 255.0, 255.0 ;########### SECCION DE TEXTO (PROGRAMA) section .text ; (rdi: uint8_t* A, rsi: uint8_t* B, rdx: uint32_t N, rcx: uint8_t* laCoincidencia) -> void miraQueCoincidencia: ; si n=0 no hay nada para hacer test edx, edx jz .end ; r8d = high32(edx*edx) ; edx = low32(edx*edx) mulx r8d, edx, edx ; laburamos de a 16 bytes (4 píxeles) shr edx, 2 ; Armamos registros de todos unos pcmpeqd xmm7, xmm7 mov r8d, 0xFFFFFFFF ; Cargamos los coeficientes para convertir a escala de grises movups xmm6, [gris] ; Cargamos un registro lleno de floats blancos movups xmm5, [blanco] .loop: movdqu xmm1, [rdi] movdqu xmm2, [rsi] movdqu xmm0, xmm1 ; Revisamos si los píxeles son iguales pcmpeqd xmm0, xmm2 ; Invertimos la máscara (nos interesan los que son distintos) pxor xmm0, xmm7 ; Chequeamos si son todos unos ptest xmm0, xmm7 cmovc eax, r8d ; Si son todos unos entonces cargo los cuatro píxeles resultantes a eax jc .write_pixels ; Si son todos unos evito calcular la escala de grises para los píxeles ; Registro para el resultado final pxor xmm4, xmm4 pmovzxbd xmm3, xmm1 ; Cargo un pixel extendiendo cada canal a dword cvtdq2ps xmm3, xmm3 ; Convierto cada canal a float dpps xmm3, xmm6, 0b1110_0001 ; Calculo la escala de grises orps xmm4, xmm3 ; Agrego el resultado al acumulador psrldq xmm1, 4 ; Avanzo al siguiente píxel pmovzxbd xmm3, xmm1 cvtdq2ps xmm3, xmm3 dpps xmm3, xmm6, 0b1110_0010 orps xmm4, xmm3 psrldq xmm1, 4 pmovzxbd xmm3, xmm1 cvtdq2ps xmm3, xmm3 dpps xmm3, xmm6, 0b1110_0100 orps xmm4, xmm3 psrldq xmm1, 4 pmovzxbd xmm3, xmm1 cvtdq2ps xmm3, xmm3 dpps xmm3, xmm6, 0b1110_1000 orps xmm4, xmm3 blendvps xmm4, xmm5 ; Reemplazo por blanco los píxeles que eran iguales (xmm0 es máscara implícita) cvttps2dq xmm4, xmm4 ; Convierto a entero (truncando) packusdw xmm4, xmm4 ; Empaqueto dword a word packuswb xmm4, xmm4 ; Empaqueto word a byte movd eax, xmm4 ; Cargo los 4 píxeles resultantes en eax .write_pixels: mov [rcx], eax add rdi, 16 add rsi, 16 add rcx, 4 dec rdx jnz .loop .end: ret
Become a Patron
Sponsor on GitHub
Donate via PayPal
Source on GitHub
Mailing list
Installed libraries
Wiki
Report an issue
How it works
Contact the author
CE on Mastodon
About the author
Statistics
Changelog
Version tree