Files
nvcc/docs/source/usage.rst
T
Cosmin Ștefan Ciocan 0bddf6a6e6 Add option to choose between NSYS and NCU profilers (#28)
* Add option to give nvcc extra arguments

* Add test for nvcc options that changes c++ dialect from c++17 to c++14

* Add make and the english language pack to devcontainer to be able to build the documentation

* Update documentation config to automatically import the current version of the package

* Document new --compiler-args argument

* Improve tests coverage by testing for bad arguments and the error output during a failed compilation

* Add IPython to docs requirements to allow the __version__ import for readthedocs env

* Change devcontainer base image to have the latest CUDA toolkit

* Mock the nsight compute tool with a bash script

* Add test to compile with opencv

* Add new page to documentation that contains a new notebook that explains compiling with external libraries

* Add autodocstring vscode extension to devcontainer

* Add function that modifies the default profiler/compiler arguments to allow reusing them in multiple magic command calls

* Update pylint exceptions

* Update contributing instructions

* Change version from 1.0.3 to 1.1.0 due to adding features in a backward-compatible manner

* Install latest CUDA toolkit on the test runner to pass the OpenCV compilation test

* Install opencv in test runner and update code coverage install

* Add CUDA bin to PATH in test and coverage runners

* Add cuda bin to path variable in .bashrc

* Update way to set environment variable PATH in github action

* Change devcontainer base image back to ubuntu:22.04 to match the environment from the test runner

* Add option to choose between NSYS and NCU profilers

* Add tests for choosing the profiler

* Add isort config to help it find local modules so they are not considered 3rd party libraries

* Replace experimental-string-processing black formatter config with enable-unstable-feature as it was removed in version 24.1.0

* Search for profiling tools executable paths when they are required

* Install dev dependencies in editable mode

* Add documentation for using Nsight Systems instead of the default Nsight Compute profiling tool

* Fix cuda typo

* Mention Nsight Systems in README.md
2024-03-20 11:42:27 +01:00

324 lines
11 KiB
ReStructuredText

Usage
=====
Installation
------------
To use nvcc4jupyter, first install it using pip:
.. code-block:: console
(venv) $ pip install nvcc4jupyter
Load the Extension
------------------
Now we need to load the IPython extension to be able to use its cell and line
magic commands:
.. code-block::
%load_ext nvcc4jupyter
Hello World
-----------
We will use the :ref:`cuda <cuda_magic>` cell magic command to run a simple
hello world program.
.. code-block:: c++
%%cuda
#include <stdio.h>
__global__ void hello(){
printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}
int main(){
hello<<<2, 2>>>();
cudaDeviceSynchronize();
}
Groups
------
Now we will demonstrate a more complex scenario that uses source file groups.
If you want to split your code into multiple source files, either for code reuse
or just to have an easier to read project, you want to use groups. A group of
source files will be compiled together. Because of this, you can include headers
from the same group and use the code defined in other ".cu" files. There is also
a special group named "shared" whose files will be compiled together with all
other groups, which is a great feature for error handling code as we'll show now:
.. code-block:: c++
%%cuda_group_save --group shared --name "error_handling.h"
// error checking macro
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
Now we can use that error handling macro in this vector addition program but
also in other programs that we define in other Jupyter cells:
.. code-block:: c++
%%cuda
#include <stdio.h>
#include "error_handling.h"
const int DSIZE = 4096;
const int block_size = 256;
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < ds) {
C[idx] = A[idx] + B[idx];
}
}
int main(){
float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
// allocate space for vectors in host memory
h_A = new float[DSIZE];
h_B = new float[DSIZE];
h_C = new float[DSIZE];
// initialize vectors in host memory to random values (except for the
// result vector whose values do not matter as they will be overwritten)
for (int i = 0; i < DSIZE; i++) {
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}
// allocate space for vectors in device memory
cudaMalloc(&d_A, DSIZE*sizeof(float));
cudaMalloc(&d_B, DSIZE*sizeof(float));
cudaMalloc(&d_C, DSIZE*sizeof(float));
cudaCheckErrors("cudaMalloc failure"); // error checking
// copy vectors A and B from host to device:
cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
// launch the vector adding kernel
vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE);
cudaCheckErrors("kernel launch failure");
// wait for the kernel to finish execution
cudaDeviceSynchronize();
cudaCheckErrors("kernel execution failure");
cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy D2H failure");
printf("A[0] = %f\n", h_A[0]);
printf("B[0] = %f\n", h_B[0]);
printf("C[0] = %f\n", h_C[0]);
return 0;
}
Above we use the :ref:`cuda <cuda_magic>` magic command which saves the code
in the cell to an anonymous source file group, compiles, and executes that
code. This only allows us to have one source file (besides the ones in the
"shared" group). In order to have multiple source files we need to use the
:ref:`cuda_group_save <cuda_group_save_magic>` and
:ref:`cuda_group_run <cuda_group_run_magic>` magics.
First, we save the vector addition function to its own file:
.. code-block:: c++
%%cuda_group_save --name "vector_add.cu" --group "vector_add"
// vector add kernel: C = A + B
__global__ void vadd(const float *A, const float *B, float *C, int ds){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < ds) {
C[idx] = A[idx] + B[idx];
}
}
Now we create a header file so the main cuda file knows the signature of "vadd":
.. code-block:: c++
%%cuda_group_save --name "vector_add.h" --group "vector_add"
__global__ void vadd(const float *A, const float *B, float *C, int ds);
To tie it all together, we save the main cuda file, which includes our vector
addition code:
.. code-block:: c++
%%cuda_group_save --name "main.cu" --group "vector_add"
#include <stdio.h>
#include "error_handling.h"
#include "vector_add.h"
const int DSIZE = 4096;
const int block_size = 256;
int main(){
float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
// allocate space for vectors in host memory
h_A = new float[DSIZE];
h_B = new float[DSIZE];
h_C = new float[DSIZE];
// initialize vectors in host memory to random values (except for the
// result vector whose values do not matter as they will be overwritten)
for (int i = 0; i < DSIZE; i++) {
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}
// allocate space for vectors in device memory
cudaMalloc(&d_A, DSIZE*sizeof(float));
cudaMalloc(&d_B, DSIZE*sizeof(float));
cudaMalloc(&d_C, DSIZE*sizeof(float));
cudaCheckErrors("cudaMalloc failure"); // error checking
// copy vectors A and B from host to device:
cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cudaMemcpy H2D failure");
// launch the vector adding kernel
vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE);
cudaCheckErrors("kernel launch failure");
// wait for the kernel to finish execution
cudaDeviceSynchronize();
cudaCheckErrors("kernel execution failure");
cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy D2H failure");
printf("A[0] = %f\n", h_A[0]);
printf("B[0] = %f\n", h_B[0]);
printf("C[0] = %f\n", h_C[0]);
return 0;
}
Now we can compile all the source files in the group and execute the main
function with the following command:
.. code-block:: c++
%cuda_group_run --group "vector_add"
Profiling
---------
Another important feature of nvcc4jupyter is its integration with the NVIDIA
Nsight Compute / NVIDIA Nsight Systems profilers, which you need to make sure
are installed and the executables can be found in a directory in your PATH
environment variable.
To profile using Nsight Compute with custom arguments:
.. code-block:: c++
%cuda_group_run --group "vector_add" --profile --profiler-args "--section SpeedOfLight"
Running the cell above will compile and execute the vector addition code in the
"vector_add" group and profile it, keeping only the metrics from the
"SpeedOfLight" section. The output will contain something similar to:
.. code-block::
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 4.65
SM Frequency cycle/usecond 544.31
Elapsed Cycles cycle 2,145
Memory Throughput % 3.19
DRAM Throughput % 3.19
Duration usecond 3.94
L1/TEX Cache Throughput % 6.67
L2 Cache Throughput % 1.98
SM Active Cycles cycle 383.65
Compute (SM) Throughput % 1.19
----------------------- ------------- ------------
To profile using Nsight Systems with custom arguments:
.. code-block:: c++
%cuda_group_run --group "vector_add" --profiler nsys --profile --profiler-args "profile --stats=true"
Running the cell above will compile and execute the vector addition code in the
"vector_add" group and profile it with Nsight Systems. The output will contain
multiple tables, one of which will look similar to this:
.. code-block::
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ----------- ----------------------
77.3 200,844,276 1 200,844,276.0 200,844,276.0 200,844,276 200,844,276 0.0 cudaMalloc
22.6 58,594,762 2 29,297,381.0 29,297,381.0 29,153,999 29,440,763 202,772.8 cudaMemcpy
0.1 305,450 1 305,450.0 305,450.0 305,450 305,450 0.0 cudaLaunchKernel
0.0 1,970 1 1,970.0 1,970.0 1,970 1,970 0.0 cuModuleGetLoadingMode
Compiler arguments
------------------
In the same way profiler arguments can be passed to the profiling tool,
compiling arguments can be passed to **nvcc**:
.. code-block:: c++
%cuda_group_run --group "vector_add" --compiler-args "--optimize 3"
Running the cell above will compile and execute the vector addition code in the
"vector_add" group. During compilation, **nvcc** receives the "\-\-optimize"
option which specifies the optimization level for host code.
Set default arguments
---------------------
In the case where you execute multiple magic commands with the same compiler or
profiler arguments you can avoid writing them every time by setting the default
arguments:
.. code-block:: python
from nvcc4jupyter import set_defaults
set_defaults(compiler_args="--optimize 3", profiler_args="--section SpeedOfLight")
The same effect can be achieved by running "set_defaults" once for each config
due to the fact that the default value is not changed if an a value is not
given to the "set_defaults" function.
.. code-block:: python
from nvcc4jupyter import set_defaults
set_defaults(compiler_args="--optimize 3")
set_defaults(profiler_args="--section SpeedOfLight")
Now we can run the following cell without specifying the compiler and profiler
arguments once again.
.. code-block:: c++
%cuda_group_run --group "vector_add" --profile