Lines Matching +full:linux +full:- +full:cuda +full:- +full:single
24 - `-log_view [:filename]` - Prints an ASCII version of performance data at the
26 and require little overhead; thus, `-log_view` is intended as the
28 - `-info [infofile]` - Prints verbose information about code to
33 - `-log_trace [logfile]` - Traces the beginning and ending of all
35 `-info`, is useful to see where a program is hanging without
43 ### Interpreting `-log_view` Output: The Basics
46 option `-log_view` `[:filename]` activates printing of profile data to standard
51 libraries, followed by any user-defined events (discussed in
74 lower-level operations in these packages. Note also that the nonlinear
90 - `KSPSetUp` - Set up linear solver
92 - `PCSetUp` - Set up preconditioner
94 - `MatILUFactor` - Factor matrix
96 - `MatILUFactorSymbolic` - Symbolic factorization phase
97 - `MatLUFactorNumeric` - Numeric factorization phase
99 - `KSPSolve` - Solve linear system
101 - `PCApply` - Apply preconditioner
103 - `MatSolve` - Forward/backward triangular solves
105 - `KSPGMRESOrthog` - Orthogonalization in GMRES
107 - `VecDot` or `VecMDot` - Inner products
108 - `VecAXPY` or `VecMAXPY` - vector updates
110 - `MatMult` - Matrix-vector product
112 - `MatMultAdd` - Matrix-vector product + vector addition
114 - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ...
116 The summaries printed via `-log_view` reflect this routine hierarchy.
117 For example, the performance summaries for a particular high-level
119 in the lower-level components that make up the routine.
121 The output produced with `-log_view` is flat, meaning that the hierarchy
126 so that interpreting the `-log_view` data should be relatively
134 ### Interpreting `-log_view` Output: Parallel Performance
139 output generated by the `-log_view` option. The program that generated
142 The code loads a matrix and right-hand-side vector from a binary file
145 four processors of an Intel x86_64 Linux cluster, using restarted GMRES
151 performance summary, including times, floating-point operations,
152 computational rates, and message-passing activity (such as the number
154 various user-defined stages of monitoring (as discussed in
163 mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
164 -matload_block_size 3 -pc_type bjacobi -options_left
167 Residual norm 1.088292e-05
169 Residual norm 3.871022e-02
170 ---------------------------------------------- PETSc Performance Summary: -------------------------…
172 ./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018
173 Using PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500
176 Time (sec): 1.849e-01 1.00002 1.849e-01
184 Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- -- Message Lengths -- …
186 …0: Main Stage: 5.9897e-04 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% …
187 …1: Load System 0: 2.9113e-03 1.6% 0.0000e+00 0.0% 3.550e+01 4.3% 5.984e+02 0.1% …
188 …2: KSPSetUp 0: 7.7349e-04 0.4% 9.9360e+03 0.0% 0.000e+00 0.0% 0.000e+00 0.0% …
189 …3: KSPSolve 0: 1.7690e-03 1.0% 2.9673e+05 0.0% 1.520e+02 18.4% 1.800e+02 0.1% …
190 …4: Load System 1: 1.0056e-01 54.4% 0.0000e+00 0.0% 3.700e+01 4.5% 5.657e+05 62.4% …
191 …5: KSPSetUp 1: 5.6883e-03 3.1% 2.1205e+07 2.3% 0.000e+00 0.0% 0.000e+00 0.0% …
192 …6: KSPSolve 1: 7.2578e-02 39.3% 9.1979e+08 97.7% 6.000e+02 72.8% 2.098e+04 37.5% …
194 ---------------------------------------------------------------------------------------------------…
198 ---------------------------------------------------------------------------------------------------…
202 --- Event Stage 3: KSPSolve 0
224 {\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max…
235 mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
236 -matload_block_size 3 -pc_type bjacobi -options_left
238 ---------------------------------------------- PETSc Performance Summary: -------------------------…
243 Time and Flop/sec: Max - maximum over all processors
244 Ratio - ratio of maximum to minimum over all processors
249 …Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and Pets…
250 %T - percent time in this phase %F - percent flop in this phase
251 %M - percent messages in this phase %L - percent message lengths in this phase
252 %R - percent reductions in this phase
254 ---------------------------------------------------------------------------------------------------…
255 …Count Time (sec) Flop/sec --- Global --- --- Stage ---- Total
257 ---------------------------------------------------------------------------------------------------…
260 --- Event Stage 5: KSPSetUp 1
262 MatLUFactorNum 1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 62…
263 MatILUFactorSym 1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 26…
264 MatGetRowIJ 1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0…
265 MatGetOrdering 1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1…
266 KSPSetUp 2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 1 9…
267 PCSetUp 2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90…
268 PCSetUpOnBlocks 1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90…
270 --- Event Stage 6: KSPSolve 1
272 MatMult 60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37 0 32…
273 MatSolve 61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25 0 0 0 26…
274 VecMDot 59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01 7 21 0 0 27 18…
275 VecNorm 61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01 1 1 0 0 28 4…
276 VecScale 61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1…
277 VecCopy 2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0…
278 VecSet 64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1…
279 VecAXPY 3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0…
280 VecMAXPY 61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 5 22 0 0 0 12…
281 VecScatterBegin 60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00 0 0 73 37 0 1…
282 VecScatterEnd 60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 8…
283 VecNormalize 61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01 2 2 0 0 28 5…
284 KSPGMRESOrthog 59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41 0 0 27 29…
285 KSPSolve 1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56 99…
286 PCSetUpOnBlocks 1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0…
287 PCApply 61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25 0 0 0 28…
288 ---------------------------------------------------------------------------------------------------…
293 higher-level PETSc routines include the statistics for the lower levels
295 matrix-vector products `MatMult()` consists of vector scatter
302 total computation and to any user-defined stages (discussed in
309 The additional option `-log_view_memory` causes the display of additional columns of information ab…
315 ### Using `-log_mpe` with Jumpshot
320 {cite}`mpich-web-page` implementation of MPI. The option
323 -log_mpe [logfile]
334 merely required adding `-llmpich` to the library list *before*
335 `-lmpich`.
346 One can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`.
350 The flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`.
359 mpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display
362 Note that user-defined stages (see {any}`sec_profstages`) will be ignored when
369 PETSc automatically logs object creation, times, and floating-point
372 steps involved in logging a user-defined portion of code, called an
396 Here `string` is a user-defined event name, and `color` is an
397 optional user-defined event color (for use with *Jumpshot* logging; see
410 with the event. For instance, in a matrix-vector product they would be
412 specifying 0 for `o1` - `o4`. The code between these two routine
419 the command line option `-log_sync` is used; however, we do check for collective
422 The user can log the number of floating-point operations for this
437 By default, the profiling produces a single set of statistics for all
455 whenever summaries are generated with `-log_view`. The following code fragment uses three profiling
477 `-log_view` for a program that employs several profiling stages. In
478 particular, this program is subdivided into six stages: loading a matrix and right-hand-side vector…
531 ## Interpreting `-info` Output: Informative Messages
534 data structures, etc. to the screen by using the option `-info` or by
538 {any}`sec_matsparse`, `-info` activates the printing of
556 about higher-level PETSc libraries (e.g., `TS` and `SNES`) without
562 -info [filename][:[~]<list,of,classnames>[:[~]self]]
601 profile user-defined segments of code.
607 using the command line option `-history [filename]`. If no file name
624 files generated by `-log_mpe`, which is described in
636 When this procedure is used in conjunction with the user-defined stages
662 preloading. The command line options `-preload` `true` and
663 `-preload` `false` may be used to turn on and off preloading at run
668 Nsight Systems will generate profiling data with a CUDA executable
673 nsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out
678 The Nsight Systems GUI, `nsys-ui`, can be used to navigate this file
679 (<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI
681 memory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes
684 To view the data, start `nsys-ui` without any arguments and then `Import` the
686 A side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be vie…
687 with `nsys-ui` in the future.
692 or Open MPI - we can run a parallel job on 4 MPI tasks as:
695 mpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -…
704 To check the version of Nsight on the compute node run `nsys-ui` and
709 …be displayed as ranges in trace files generated by `rocprof` by running with the `-log_roctx` flag.
710 See the `rocprof` [documentation](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest) for…
714 mpiexec -n 1 rocprofv3 --marker-trace -o file_name -- ./path/to/application -log_roctx
721 use of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`.
726 ./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install
736 mpiexec -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args>
746 ---------------------------------------------------------------------------------------
749 ---------------------------------------------------------------------------------------
773 ```{eval-rst}