sellcuda.cu - OpenGrok history log for /petsc/src/mat/impls/sell/seq/seqcuda/sellcuda.cu

Revision	Date	Author	Comments
# 90d2215b	12-Jan-2021	Hong Zhang <hongzhang@anl.gov>	Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threa Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threads per row (assuming sliceheight=16), it does not fully utilize the GPU if nrows < 100K. - maxslicewidth is too big. Thanks-to: Peng Wang <penwang@nvidia.com> show more ...
# 1f0d1278	08-Jan-2021	Hong Zhang <hongzhang@anl.gov>	Save some operations when tranpose
# cca9ff8b	05-Jan-2021	Hong Zhang <hongzhang@anl.gov>	Add new kernels to MatMultAdd
# a9dd396c	05-Jan-2021	Hong Zhang <hongzhang@anl.gov>	Remove input argument totalslices for kernels 2-6 kernel 2-6 are used simply for performance comparison
# 4e58db63	31-Dec-2020	Hong Zhang <hongzhang@anl.gov>	Make slice height more flexible - The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve Make slice height more flexible - The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve coalesced memory access if needed show more ...
# 07e43b41	10-Sep-2020	Hong Zhang <hongzhang@anl.gov>	Further optimization of MatMult_SeqSELLCUDA - Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add thre Further optimization of MatMult_SeqSELLCUDA - Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add three new APIs to return the irregularity ratio, the maximum slice width and the average slice width Experiments show that column blocking gives much worse performance for wide matrices and permulation based on slice width has almost no impact on the performance. show more ...
# 2d1451d4	09-Jan-2020	Hong Zhang <hongzhang@anl.gov>	Initial commit for porting SELL to GPU - Add tiled SPMV and basic SpMVfor SeqSELL - Tested in serial - Offloadmask is used to determine when the matrix should be copied to GPU - Use different slice Initial commit for porting SELL to GPU - Add tiled SPMV and basic SpMVfor SeqSELL - Tested in serial - Offloadmask is used to determine when the matrix should be copied to GPU - Use different slice height for CUDA version - By checking the nonzerostate, PETSc can decide if the whole matrix need to be copied or just the values need to be copied - Make the convert function public so that the very slow MatConvert_Basic can be avoided sometimes. E.g. one can use a two-step convert method: AIJ->SELL,SELL->SELLCUDA instead of the direct convert AIJ->SELLCUDA - Make the FLOPS count for SELL same as that for AIJCUSPARSE. - MatDisAssemble is not needed. - Change slice height from 32 to 16 for GPU - To overlap communication with MatMult, VecScatterBegin() should be called before MatMult() for the diagonal part. - SLICE_HEIGHT is defined to be 32 to match the warp size of GPU. For other cases, it is still 8. Funded-by: Project: PETSc for GPU Time: 42 hours Reported-by: Thanks-to: show more ...
12