| 07e43b41 | 10-Sep-2020 |
Hong Zhang <hongzhang@anl.gov> |
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add thre
Further optimization of MatMult_SeqSELLCUDA
- Add more kernels - Use multiple threads per row for matrices with narrow slices - Use multiple blocks per slice for matrices with wide slices - Add three new APIs to return the irregularity ratio, the maximum slice width and the average slice width
Experiments show that column blocking gives much worse performance for wide matrices and permulation based on slice width has almost no impact on the performance.
show more ...
|