| b06137fd | 27-Jun-2013 |
Paul Mullowney <paulm@txcorp.com> |
Removing TXPETSCGPU from veccusp and mpiaijcusparse
In this next step of removing TXPETSCGPU, the host-device and device-host messaging code has been significantly simplified. In particular, all met
Removing TXPETSCGPU from veccusp and mpiaijcusparse
In this next step of removing TXPETSCGPU, the host-device and device-host messaging code has been significantly simplified. In particular, all methods VecCUSPCopyToGPU/FromGPU now use a cudaMemcpyAsync with a stream (and a stream synchronize()). This never hurts you. Moreover, it can help you in the case of the multi-GPU SpMV as this data transfer will overlap with the MatMult kernel. The more signficant change comes in VecCUSPCopyToGPUSome and VecCUSPCopyFromGPUSome. In this code, the data transfer now moves the smallest contiguous set of vector data containing ALL the indices in a single asynchronous data transfer. Then, the stream containing the data transfer is synchronized (not the entire device). While this can be wasteful in terms of messaging too much data, it has shown the best scalability performance across a wide range of matrices. Lastly the simplicity of the code is a significant advantage over the old way of doing the data transfer. Some old cold in these methods is "if 0"-ed out for reference and will be cleaned up later. One final optimization in the vector code involves registering the host buffer as page locked--which is done in VecCUSPAllocateCheck. Then, the buffer must be unregistered at VecDestroy_SeqCUSP. This shows a nice speedup in the data transfer for a parallel MatMult.
Also in this commit, I am removing the TXPETSCGPU dependence from the mpiaijcusparse class--it now depends only on CUDA. In order for the same stream to be used in the MatMult and MatMultAdd (necessary for an optimal Multi-GPU SpMV), the stream is built in the mpiaijcusparse and then passed in the seqaijcusparse data structure via a new method (MatCUSPARSESetStream). A similar method is added for the CUSPARSE library handle (context) as I think the stream needs to be attached to a particular context to work properly. When running in parallel, multiple GPUs, the references to the handle in the seqaijcusparse are cleared from the mpiaijcusparse classes with the method MatCUSPARSEClearHandle. Then, the mpiaijcusparse class deletes the handle.
One other non-trivial change was made to the seqaijcusparse. The alpha and beta parameters to the SpMV are now device data which is owned by the Mat_SEQAIJCUSPARSEMultStruct structure. This enables slightly better multi-GPU performance as this data does not need to be copied to the GPU at each kernel launch.
Multi-GPU SpMV now works without TXPETSCGPU and the performance is recovered as tested on up to 4 GPUs. Code is valgrind clean and cuda-memcheck clean.
Results of tests have been modified to have 1 less digit of precision. This yields consistent results across different GPUs. Lastly, the parallel test is set to run on a different matrix (shallow_water1) so that the iteration actually converges.
show more ...
|
| aa372e3f | 20-Jun-2013 |
Paul Mullowney <paulm@txcorp.com> |
Removal of TXPETSCGPU package from the SEQAIJCUSPARSE class
In this commit, I've removed the dependence of the SEQAIJCUSPARSE class on the TXPETSCGPU package. However, other classes such as SEQAIJCU
Removal of TXPETSCGPU package from the SEQAIJCUSPARSE class
In this commit, I've removed the dependence of the SEQAIJCUSPARSE class on the TXPETSCGPU package. However, other classes such as SEQAIJCUSP, VECCUSP, and MPIAIJCUSPARSE, and MPIAIJCUSP still depend on that package. These dependencies will be removed in subsequent commits once the design and structure is agreed upon.
The reason for this dependency removal is that SEQAIJCUSPARSE only depends on the Nvidia CUSPARSE library which comes standard with CUDA. Thus, the SEQAIJCUSPARSE class should be built whenever PETSc is built with CUDA support. This will be far more maintanable in the long term. Lastly, most of the CUSP dependencies have been removed from this class. The only remaining CUSP dependencies are in the vector data structures used in MatMult* and MatSolve* methods. These will be removed in a subsequent branch as it is not clear what the architecture should be yet.
In order to accomodate all the different functionality for various Krylov solves, two new data structures were defined in cusparsematimpl.h. The first is a Mat_SeqAIJCUSPARSEMultStruct struct. This contains an opaque pointer for a matrix, a MatDescription data structure, and indices vector which will be useful in MatMultAdd functions. The second new data structure is a Mat_SeqAIJCUSPARSETriFactorStruct struct. This contains an CSR Matrix pointer, a MatDescription data structure, a solve analysis data structure and an operation type.
Next, Mat_SeqAIJCUSPARSETriFactors was redefined to hold pointers to up to 4 different Mat_SeqAIJCUSPARSETriFactorStruct structs: one for lower and one for upper solves for both ILU and ICC. Two more for lower and upper solves in algorithms that require a transpose, such as BiCG. The latter two are necessary, as far as I can tell, because one doesn't know until runtime if data structures for the transpose are needed Thus, those are created on demand. Indexing vectors for reorderings are also stored in Mat_SeqAIJCUSPARSETriFactors.
Lastly, Mat_SeqAIJCUSPARSE is the data structure that holds the data needed in multiply. There are 2 pointers to Mat_SeqAIJCUSPARSEMultStruct structs for MatMult and MatMultTranspose. Several auxilliary data structures like workvectors and few other necessary data for MatMult are also stored in here. One important variable, the cudaStream_t, is stored here but it is not owned. Streams are necessary for the parallel SpMV (a subsequent commit will add code setting stream variables from the MPIAIJCUSPARSE class) and the matrices used in the MatMult and MatMultAdd will then use the same stream identifier to attain optimal performance. The MPIAIJCUSPARSE class will own the stream variable which is then used in the SEQAIJCUSPARSE methods.
In matregis.c as well as the petscmat.h and finclude/petscmat.h, I've changed the dependency of SEQAIJCUSPARSE to be on CUDA and not TXPETSCGPU.
The test series TESTEXAMPLES_TXPETSCGPU has been changed to TESTEXAMPLES_CUDA since SEQAIJCUSPARSE only depends on CUDA as discussed above. ksp/ksp/examples/tests/ex43-aijcusparse.c has been renamed to ksp/ksp/examples/tests/ex43.c, the targets in the makefile have been changed appropriately and the results fiels are renamed. Two new test targets were added in ksp/ksp/examples/tests/makefile that test aijcusparse using bicg (and thus the MatSolveTranspose and MatMultTranspose methods) as well as bicg with reordering. The previous results for runex43_2.out (formerly runex43-aijcusparse_2.out) were wrong and so I'm committing new results that agree with CPU based computation. The code is valgrind clean and cuda-memcheck clean.
show more ...
|