| 4e58db63 | 31-Dec-2020 |
Hong Zhang <hongzhang@anl.gov> |
Make slice height more flexible
- The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve
Make slice height more flexible
- The slice height now does not have to match device memory alignment; it just need to be divisible by DEVICE_MEM_ALIGN - Pad each slice with extra columns to achieve coalesced memory access if needed
show more ...
|
| 07e43b41 | 10-Sep-2020 |
Hong Zhang <hongzhang@anl.gov> |
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add thre
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add three new APIs to return the irregularity ratio, the maximum slice width and the average slice width
Experiments show that column blocking gives much worse performance for wide matrices and permulation based on slice width has almost no impact on the performance.
show more ...
|