| 2acdef25 | 25-Apr-2019 |
Junchao Zhang <jczhang@mcs.anl.gov> |
Fix a vector assembly bug wrt VEC_SUBSET_OFF_PROC_ENTRIES
The bug is related to how we know we can reuse the communication pattern built in one assembly (the first assembly) in subsequent assemblies
Fix a vector assembly bug wrt VEC_SUBSET_OFF_PROC_ENTRIES
The bug is related to how we know we can reuse the communication pattern built in one assembly (the first assembly) in subsequent assemblies.
The old code checks x->recvhdr to see if the receive info is available. If it is, it assumes the pattern is already built, so it tries to reuse it; otherwise it assumes the pattern is not built and tried to build a new one.
This has bugs since in the first assembly, some processes can turn out to receive nothing but do send something. By the logic above, in subsequent assemblies, these processes mistakenly think there is no existing pattern since their receive info is empty, while others do think there is a pattern. This mismatch can lead to deadlocks in MPI send/recv.
The solution is to have all processes collectively come to the same conclusion. We introduced another flag x->first_assembly_done. When x->assembly_subset is true, x->first_assembly_done is set to true on all processes after an assembly.
In summary, x->assembly_subset=true means users want to reuse a communication pattern. x->first_assembly_done=true means a communication pattern is built for reuse. So we can test x->first_assembly_done to know if it is safe to reuse a pattern.
This commit also handles a scenario when users set VEC_SUBSET_OFF_PROC_ENTRIES to true, and then false, and then true, and so on.
When VEC_SUBSET_OFF_PROC_ENTRIES turns from true to false, we need to free memory allocated to the communication pattern, and also clear x->first_assembly_done.
Seealso a short test code in this PR.
show more ...
|