| #
90d2215b
|
| 12-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threa
Add the load-balancing kernel for MatMultAdd_SeqSELL and fine tune the heuristic
Kernel7 is significantly slower than kernel9x for the following two cases: - nrows is too small. Kernel7 uses 2 threads per row (assuming sliceheight=16), it does not fully utilize the GPU if nrows < 100K. - maxslicewidth is too big.
Thanks-to: Peng Wang <penwang@nvidia.com>
show more ...
|
| #
1f0d1278
|
| 08-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Save some operations when tranpose
|
| #
cca9ff8b
|
| 05-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Add new kernels to MatMultAdd
|
| #
a9dd396c
|
| 05-Jan-2021 |
Hong Zhang <hongzhang@anl.gov> |
Remove input argument totalslices for kernels 2-6
kernel 2-6 are used simply for performance comparison
|
| #
4e58db63
|
| 31-Dec-2020 |
Hong Zhang <hongzhang@anl.gov> |
Make slice height more flexible
- The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve
Make slice height more flexible
- The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve coalesced memory access if needed
show more ...
|
| #
07e43b41
|
| 10-Sep-2020 |
Hong Zhang <hongzhang@anl.gov> |
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add thre
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add three new APIs to return the irregularity ratio, the maximum slice width and the average slice width
Experiments show that column blocking gives much worse performance for wide matrices and permulation based on slice width has almost no impact on the performance.
show more ...
|
| #
2d1451d4
|
| 09-Jan-2020 |
Hong Zhang <hongzhang@anl.gov> |
Initial commit for porting SELL to GPU
- Add tiled SPMV and basic SpMVfor SeqSELL - Tested in serial - Offloadmask is used to determine when the matrix should be copied to GPU - Use different slice
Initial commit for porting SELL to GPU
- Add tiled SPMV and basic SpMVfor SeqSELL - Tested in serial - Offloadmask is used to determine when the matrix should be copied to GPU - Use different slice height for CUDA version - By checking the nonzerostate, PETSc can decide if the whole matrix need to be copied or just the values need to be copied - Make the convert function public so that the very slow MatConvert_Basic can be avoided sometimes. E.g. one can use a two-step convert method: AIJ->SELL,SELL->SELLCUDA instead of the direct convert AIJ->SELLCUDA - Make the FLOPS count for SELL same as that for AIJCUSPARSE. - MatDisAssemble is not needed. - Change slice height from 32 to 16 for GPU - To overlap communication with MatMult, VecScatterBegin() should be called before MatMult() for the diagonal part. - SLICE_HEIGHT is defined to be 32 to match the warp size of GPU. For other cases, it is still 8.
Funded-by: Project: PETSc for GPU Time: 42 hours Reported-by: Thanks-to:
show more ...
|