![]() |
alpaka
Abstraction Library for Parallel Kernel Acceleration
|
Namespaces | |
| namespace | trait |
| The warp traits. | |
Classes | |
| struct | ConceptWarp |
| class | WarpSingleThread |
| The single-threaded warp to emulate it on CPUs. More... | |
| class | WarpUniformCudaHipBuiltIn |
| The GPU CUDA/HIP warp. More... | |
Functions | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | activemask (TWarp const &warp) -> decltype(trait::Activemask< interface::ImplementationBase< ConceptWarp, TWarp > >::activemask(warp)) |
| Returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if the Nth thread of the warp is active. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | all (TWarp const &warp, std::int32_t predicate) -> std::int32_t |
| Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | any (TWarp const &warp, std::int32_t predicate) -> std::int32_t |
| Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | ballot (TWarp const &warp, std::int32_t predicate) |
| Evaluates predicate for all non-exited threads in a warp and returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | getSize (TWarp const &warp) -> std::int32_t |
| Returns warp size. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC constexpr auto | getSizeCompileTime () -> std::int32_t |
| If the warp size is available as a compile-time constant returns its value; otherwise returns 0. | |
| template<typename TWarp > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC constexpr auto | getSizeUpperLimit () -> std::int32_t |
| If the warp size is available as a compile-time constant returns its value; otherwise returns an upper limit on the possible warp size values. | |
| template<typename TWarp , typename T > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl (TWarp const &warp, T value, std::int32_t srcLane, std::int32_t width=0) |
| Exchange data between threads within a warp. | |
| template<typename TWarp , typename T > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_down (TWarp const &warp, T value, std::uint32_t offset, std::int32_t width=0) |
| Exchange data between threads within a warp. It copies from a lane with higher ID relative to caller. The lane ID is calculated by adding delta to the caller’s lane ID. | |
| template<typename TWarp , typename T > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_up (TWarp const &warp, T value, std::uint32_t offset, std::int32_t width=0) |
| Exchange data between threads within a warp. It copies from a lane with lower ID relative to caller. The lane ID is calculated by subtracting delta from the caller’s lane ID. | |
| template<typename TWarp , typename T > | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_xor (TWarp const &warp, T value, std::int32_t mask, std::int32_t width=0) |
| Exchange data between threads within a warp. It copies from a lane based on bitwise XOR of own lane ID. The lane ID is calculated by performing a bitwise XOR of the caller’s lane ID with mask. | |
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::activemask | ( | TWarp const & | warp | ) | -> decltype(trait::Activemask<interface::ImplementationBase<ConceptWarp, TWarp>>::activemask(warp)) |
Returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if the Nth thread of the warp is active.
Note: decltype for return type is required there, otherwise compilation with a CPU and a GPU accelerator enabled fails as it tries to call device function from a host-device one. The reason is unclear, but likely related to deducing the return type.
Note:
| TWarp | The warp implementation type. |
| warp | The warp implementation. |
Definition at line 121 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::all | ( | TWarp const & | warp, |
| std::int32_t | predicate | ||
| ) | -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them.
It follows the logic of __all(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __all_sync(__activemask(), predicate).
Note:
| TWarp | The warp implementation type. |
| warp | The warp implementation. |
| predicate | The predicate value for current thread. |
Definition at line 145 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::any | ( | TWarp const & | warp, |
| std::int32_t | predicate | ||
| ) | -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them.
It follows the logic of __any(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __any_sync(__activemask(), predicate).
Note:
| TWarp | The warp implementation type. |
| warp | The warp implementation. |
| predicate | The predicate value for current thread. |
Definition at line 168 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::ballot | ( | TWarp const & | warp, |
| std::int32_t | predicate | ||
| ) |
Evaluates predicate for all non-exited threads in a warp and returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.
It follows the logic of __ballot(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __ballot_sync(__activemask(), predicate). Return type is 64-bit to fit all platforms.
Note:
| TWarp | The warp implementation type. |
| warp | The warp implementation. |
| predicate | The predicate value for current thread. |
Definition at line 195 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::getSize | ( | TWarp const & | warp | ) | -> std::int32_t |
Returns warp size.
| TWarp | The warp implementation type. |
| warp | The warp implementation. |
Definition at line 73 of file Traits.hpp.
|
constexpr |
If the warp size is available as a compile-time constant returns its value; otherwise returns 0.
| TWarp | The warp implementation type. |
Definition at line 84 of file Traits.hpp.
|
constexpr |
If the warp size is available as a compile-time constant returns its value; otherwise returns an upper limit on the possible warp size values.
| TWarp | The warp implementation type. |
Definition at line 96 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl | ( | TWarp const & | warp, |
| T | value, | ||
| std::int32_t | srcLane, | ||
| std::int32_t | width = 0 |
||
| ) |
Exchange data between threads within a warp.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return values[width*(threadIdx.x/width) + srcLane%width];
However, it does not use shared memory.
Notes:
| TWarp | warp implementation type |
| warp | warp implementation |
| value | value to broadcast (only meaningful from threadIdx == srcLane) |
| srcLane | source lane sending value |
| width | number of threads receiving a single value |
Definition at line 229 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_down | ( | TWarp const & | warp, |
| T | value, | ||
| std::uint32_t | offset, | ||
| std::int32_t | width = 0 |
||
| ) |
Exchange data between threads within a warp. It copies from a lane with higher ID relative to caller. The lane ID is calculated by adding delta to the caller’s lane ID.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return (threadIdx.x % width + delta < width) ? values[threadIdx.x + delta] : values[threadIdx.x];
However, it does not use shared memory.
Notes:
| TWarp | warp implementation type |
| T | value type |
| warp | warp implementation |
| value | value to broadcast |
| offset | corresponds to the delta used to compute the lane ID |
| width | size of the group participating in the shuffle operation |
Definition at line 305 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_up | ( | TWarp const & | warp, |
| T | value, | ||
| std::uint32_t | offset, | ||
| std::int32_t | width = 0 |
||
| ) |
Exchange data between threads within a warp. It copies from a lane with lower ID relative to caller. The lane ID is calculated by subtracting delta from the caller’s lane ID.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return (threadIdx.x % width >= delta) ? values[threadIdx.x - delta] : values[threadIdx.x];
However, it does not use shared memory.
Notes:
| TWarp | warp implementation type |
| T | value type |
| warp | warp implementation |
| value | value to broadcast |
| offset | corresponds to the delta used to compute the lane ID |
| width | size of the group participating in the shuffle operation |
Definition at line 267 of file Traits.hpp.
| ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_xor | ( | TWarp const & | warp, |
| T | value, | ||
| std::int32_t | mask, | ||
| std::int32_t | width = 0 |
||
| ) |
Exchange data between threads within a warp. It copies from a lane based on bitwise XOR of own lane ID. The lane ID is calculated by performing a bitwise XOR of the caller’s lane ID with mask.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); int lane = threadIdx.x ^ mask; return values[lane / width > threadIdx.x / width ? threadIdx.x : lane];
However, it does not use shared memory.
Notes:
| TWarp | warp implementation type |
| T | value type |
| warp | warp implementation |
| value | value to broadcast |
| mask | corresponds to the mask used to compute the lane ID |
| width | size of the group participating in the shuffle operation |
Definition at line 343 of file Traits.hpp.