alpaka
Abstraction Library for Parallel Kernel Acceleration
|
Namespaces | |
trait | |
The warp traits. | |
Classes | |
struct | ConceptWarp |
class | WarpSingleThread |
The single-threaded warp to emulate it on CPUs. More... | |
class | WarpUniformCudaHipBuiltIn |
The GPU CUDA/HIP warp. More... | |
Functions | |
template<typename TWarp > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | activemask (TWarp const &warp) -> decltype(trait::Activemask< interface::ImplementationBase< ConceptWarp, TWarp >>::activemask(warp)) |
Returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if the Nth thread of the warp is active. More... | |
template<typename TWarp > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | all (TWarp const &warp, std::int32_t predicate) -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them. More... | |
template<typename TWarp > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | any (TWarp const &warp, std::int32_t predicate) -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them. More... | |
template<typename TWarp > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | ballot (TWarp const &warp, std::int32_t predicate) |
Evaluates predicate for all non-exited threads in a warp and returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active. More... | |
template<typename TWarp > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | getSize (TWarp const &warp) -> std::int32_t |
Returns warp size. More... | |
template<typename TWarp , typename T > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl (TWarp const &warp, T value, std::int32_t srcLane, std::int32_t width=0) |
Exchange data between threads within a warp. More... | |
template<typename TWarp , typename T > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_down (TWarp const &warp, T value, std::uint32_t offset, std::int32_t width=0) |
Exchange data between threads within a warp. It copies from a lane with higher ID relative to caller. The lane ID is calculated by adding delta to the caller’s lane ID. More... | |
template<typename TWarp , typename T > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_up (TWarp const &warp, T value, std::uint32_t offset, std::int32_t width=0) |
Exchange data between threads within a warp. It copies from a lane with lower ID relative to caller. The lane ID is calculated by subtracting delta from the caller’s lane ID. More... | |
template<typename TWarp , typename T > | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto | shfl_xor (TWarp const &warp, T value, std::int32_t mask, std::int32_t width=0) |
Exchange data between threads within a warp. It copies from a lane based on bitwise XOR of own lane ID. The lane ID is calculated by performing a bitwise XOR of the caller’s lane ID with mask. More... | |
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::activemask | ( | TWarp const & | warp | ) | -> decltype(trait::Activemask<interface::ImplementationBase<ConceptWarp, TWarp>>::activemask(warp)) |
Returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if the Nth thread of the warp is active.
Note: decltype for return type is required there, otherwise compilcation with a CPU and a GPU accelerator enabled fails as it tries to call device function from a host-device one. The reason is unclear, but likely related to deducing the return type.
Note:
TWarp | The warp implementation type. |
warp | The warp implementation. |
Definition at line 90 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::all | ( | TWarp const & | warp, |
std::int32_t | predicate | ||
) | -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them.
It follows the logic of __all(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __all_sync(__activemask(), predicate).
Note:
TWarp | The warp implementation type. |
warp | The warp implementation. |
predicate | The predicate value for current thread. |
Definition at line 114 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::any | ( | TWarp const & | warp, |
std::int32_t | predicate | ||
) | -> std::int32_t |
Evaluates predicate for all active threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them.
It follows the logic of __any(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __any_sync(__activemask(), predicate).
Note:
TWarp | The warp implementation type. |
warp | The warp implementation. |
predicate | The predicate value for current thread. |
Definition at line 137 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::ballot | ( | TWarp const & | warp, |
std::int32_t | predicate | ||
) |
Evaluates predicate for all non-exited threads in a warp and returns a 32- or 64-bit unsigned integer (depending on the accelerator) whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.
It follows the logic of __ballot(predicate) in CUDA before version 9.0 and HIP, the operation is applied for all active threads. The modern CUDA counterpart would be __ballot_sync(__activemask(), predicate). Return type is 64-bit to fit all platforms.
Note:
TWarp | The warp implementation type. |
warp | The warp implementation. |
predicate | The predicate value for current thread. |
Definition at line 164 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::getSize | ( | TWarp const & | warp | ) | -> std::int32_t |
Returns warp size.
TWarp | The warp implementation type. |
warp | The warp implementation. |
Definition at line 65 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl | ( | TWarp const & | warp, |
T | value, | ||
std::int32_t | srcLane, | ||
std::int32_t | width = 0 |
||
) |
Exchange data between threads within a warp.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return values[width*(threadIdx.x/width) + srcLane%width];
However, it does not use shared memory.
Notes:
TWarp | warp implementation type |
warp | warp implementation |
value | value to broadcast (only meaningful from threadIdx == srcLane) |
srcLane | source lane sending value |
width | number of threads receiving a single value |
Definition at line 198 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_down | ( | TWarp const & | warp, |
T | value, | ||
std::uint32_t | offset, | ||
std::int32_t | width = 0 |
||
) |
Exchange data between threads within a warp. It copies from a lane with higher ID relative to caller. The lane ID is calculated by adding delta to the caller’s lane ID.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return (threadIdx.x % width + delta < width) ? values[threadIdx.x + delta] : values[threadIdx.x];
However, it does not use shared memory.
Notes:
TWarp | warp implementation type |
T | value type |
warp | warp implementation |
value | value to broadcast |
offset | corresponds to the delta used to compute the lane ID |
width | size of the group participating in the shuffle operation |
Definition at line 274 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_up | ( | TWarp const & | warp, |
T | value, | ||
std::uint32_t | offset, | ||
std::int32_t | width = 0 |
||
) |
Exchange data between threads within a warp. It copies from a lane with lower ID relative to caller. The lane ID is calculated by subtracting delta from the caller’s lane ID.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); return (threadIdx.x % width >= delta) ? values[threadIdx.x - delta] : values[threadIdx.x];
However, it does not use shared memory.
Notes:
TWarp | warp implementation type |
T | value type |
warp | warp implementation |
value | value to broadcast |
offset | corresponds to the delta used to compute the lane ID |
width | size of the group participating in the shuffle operation |
Definition at line 236 of file Traits.hpp.
ALPAKA_NO_HOST_ACC_WARNING ALPAKA_FN_ACC auto alpaka::warp::shfl_xor | ( | TWarp const & | warp, |
T | value, | ||
std::int32_t | mask, | ||
std::int32_t | width = 0 |
||
) |
Exchange data between threads within a warp. It copies from a lane based on bitwise XOR of own lane ID. The lane ID is calculated by performing a bitwise XOR of the caller’s lane ID with mask.
Effectively executes:
__shared__ int32_t values[warpsize]; values[threadIdx.x] = value; __syncthreads(); int lane = threadIdx.x ^ mask; return values[lane / width > threadIdx.x / width ? threadIdx.x : lane];
However, it does not use shared memory.
Notes:
TWarp | warp implementation type |
T | value type |
warp | warp implementation |
value | value to broadcast |
mask | corresponds to the mask used to compute the lane ID |
width | size of the group participating in the shuffle operation |
Definition at line 312 of file Traits.hpp.