Usage ===== Installation ------------ To use nvcc4jupyter, first install it using pip: .. code-block:: console (venv) $ pip install nvcc4jupyter Load the Extension ------------------ Now we need to load the IPython extension to be able to use its cell and line magic commands: .. code-block:: %load_ext nvcc4jupyter Hello World ----------- We will use the :ref:`cuda ` cell magic command to run a simple hello world program. .. code-block:: c++ %%cuda #include __global__ void hello(){ printf("Hello from block: %u, thread: %u\n", blockIdx.x, threadIdx.x); } int main(){ hello<<<2, 2>>>(); cudaDeviceSynchronize(); } Groups ------ Now we will demonstrate a more complex scenario that uses source file groups. If you want to split your code into multiple source files, either for code reuse or just to have an easier to read project, you want to use groups. A group of source files will be compiled together. Because of this, you can include headers from the same group and use the code defined in other ".cu" files. There is also a special group named "shared" whose files will be compiled together with all other groups, which is a great feature for error handling code as we'll show now: .. code-block:: c++ %%cuda_group_save --group shared --name "error_handling.h" // error checking macro #define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ msg, cudaGetErrorString(__err), \ __FILE__, __LINE__); \ fprintf(stderr, "*** FAILED - ABORTING\n"); \ exit(1); \ } \ } while (0) Now we can use that error handling macro in this vector addition program but also in other programs that we define in other Jupyter cells: .. code-block:: c++ %%cuda #include #include "error_handling.h" const int DSIZE = 4096; const int block_size = 256; // vector add kernel: C = A + B __global__ void vadd(const float *A, const float *B, float *C, int ds){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < ds) { C[idx] = A[idx] + B[idx]; } } int main(){ float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate space for vectors in host memory h_A = new float[DSIZE]; h_B = new float[DSIZE]; h_C = new float[DSIZE]; // initialize vectors in host memory to random values (except for the // result vector whose values do not matter as they will be overwritten) for (int i = 0; i < DSIZE; i++) { h_A[i] = rand()/(float)RAND_MAX; h_B[i] = rand()/(float)RAND_MAX; } // allocate space for vectors in device memory cudaMalloc(&d_A, DSIZE*sizeof(float)); cudaMalloc(&d_B, DSIZE*sizeof(float)); cudaMalloc(&d_C, DSIZE*sizeof(float)); cudaCheckErrors("cudaMalloc failure"); // error checking // copy vectors A and B from host to device: cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice); cudaCheckErrors("cudaMemcpy H2D failure"); // launch the vector adding kernel vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE); cudaCheckErrors("kernel launch failure"); // wait for the kernel to finish execution cudaDeviceSynchronize(); cudaCheckErrors("kernel execution failure"); cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost); cudaCheckErrors("cudaMemcpy D2H failure"); printf("A[0] = %f\n", h_A[0]); printf("B[0] = %f\n", h_B[0]); printf("C[0] = %f\n", h_C[0]); return 0; } Above we use the :ref:`cuda ` magic command which saves the code in the cell to an anonymous source file group, compiles, and executes that code. This only allows us to have one source file (besides the ones in the "shared" group). In order to have multiple source files we need to use the :ref:`cuda_group_save ` and :ref:`cuda_group_run ` magics. First, we save the vector addition function to its own file: .. code-block:: c++ %%cuda_group_save --name "vector_add.cu" --group "vector_add" // vector add kernel: C = A + B __global__ void vadd(const float *A, const float *B, float *C, int ds){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < ds) { C[idx] = A[idx] + B[idx]; } } Now we create a header file so the main cuda file knows the signature of "vadd": .. code-block:: c++ %%cuda_group_save --name "vector_add.h" --group "vector_add" __global__ void vadd(const float *A, const float *B, float *C, int ds); To tie it all together, we save the main cuda file, which includes our vector addition code: .. code-block:: c++ %%cuda_group_save --name "main.cu" --group "vector_add" #include #include "error_handling.h" #include "vector_add.h" const int DSIZE = 4096; const int block_size = 256; int main(){ float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate space for vectors in host memory h_A = new float[DSIZE]; h_B = new float[DSIZE]; h_C = new float[DSIZE]; // initialize vectors in host memory to random values (except for the // result vector whose values do not matter as they will be overwritten) for (int i = 0; i < DSIZE; i++) { h_A[i] = rand()/(float)RAND_MAX; h_B[i] = rand()/(float)RAND_MAX; } // allocate space for vectors in device memory cudaMalloc(&d_A, DSIZE*sizeof(float)); cudaMalloc(&d_B, DSIZE*sizeof(float)); cudaMalloc(&d_C, DSIZE*sizeof(float)); cudaCheckErrors("cudaMalloc failure"); // error checking // copy vectors A and B from host to device: cudaMemcpy(d_A, h_A, DSIZE*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, DSIZE*sizeof(float), cudaMemcpyHostToDevice); cudaCheckErrors("cudaMemcpy H2D failure"); // launch the vector adding kernel vadd<<<(DSIZE+block_size-1)/block_size, block_size>>>(d_A, d_B, d_C, DSIZE); cudaCheckErrors("kernel launch failure"); // wait for the kernel to finish execution cudaDeviceSynchronize(); cudaCheckErrors("kernel execution failure"); cudaMemcpy(h_C, d_C, DSIZE*sizeof(float), cudaMemcpyDeviceToHost); cudaCheckErrors("cudaMemcpy D2H failure"); printf("A[0] = %f\n", h_A[0]); printf("B[0] = %f\n", h_B[0]); printf("C[0] = %f\n", h_C[0]); return 0; } Now we can compile all the source files in the group and execute the main function with the following command: .. code-block:: c++ %cuda_group_run --group "vector_add" Profiling --------- Another important feature of nvcc4jupyter is its integration with the NVIDIA Nsight Compute / NVIDIA Nsight Systems profilers, which you need to make sure are installed and the executables can be found in a directory in your PATH environment variable. To profile using Nsight Compute with custom arguments: .. code-block:: c++ %cuda_group_run --group "vector_add" --profile --profiler-args "--section SpeedOfLight" Running the cell above will compile and execute the vector addition code in the "vector_add" group and profile it, keeping only the metrics from the "SpeedOfLight" section. The output will contain something similar to: .. code-block:: Section: GPU Speed Of Light Throughput ----------------------- ------------- ------------ Metric Name Metric Unit Metric Value ----------------------- ------------- ------------ DRAM Frequency cycle/nsecond 4.65 SM Frequency cycle/usecond 544.31 Elapsed Cycles cycle 2,145 Memory Throughput % 3.19 DRAM Throughput % 3.19 Duration usecond 3.94 L1/TEX Cache Throughput % 6.67 L2 Cache Throughput % 1.98 SM Active Cycles cycle 383.65 Compute (SM) Throughput % 1.19 ----------------------- ------------- ------------ To profile using Nsight Systems with custom arguments: .. code-block:: c++ %cuda_group_run --group "vector_add" --profiler nsys --profile --profiler-args "profile --stats=true" Running the cell above will compile and execute the vector addition code in the "vector_add" group and profile it with Nsight Systems. The output will contain multiple tables, one of which will look similar to this: .. code-block:: [5/8] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ----------- ---------------------- 77.3 200,844,276 1 200,844,276.0 200,844,276.0 200,844,276 200,844,276 0.0 cudaMalloc 22.6 58,594,762 2 29,297,381.0 29,297,381.0 29,153,999 29,440,763 202,772.8 cudaMemcpy 0.1 305,450 1 305,450.0 305,450.0 305,450 305,450 0.0 cudaLaunchKernel 0.0 1,970 1 1,970.0 1,970.0 1,970 1,970 0.0 cuModuleGetLoadingMode Compiler arguments ------------------ In the same way profiler arguments can be passed to the profiling tool, compiling arguments can be passed to **nvcc**: .. code-block:: c++ %cuda_group_run --group "vector_add" --compiler-args "--optimize 3" Running the cell above will compile and execute the vector addition code in the "vector_add" group. During compilation, **nvcc** receives the "\-\-optimize" option which specifies the optimization level for host code. Set default arguments --------------------- In the case where you execute multiple magic commands with the same compiler or profiler arguments you can avoid writing them every time by setting the default arguments: .. code-block:: python from nvcc4jupyter import set_defaults set_defaults(compiler_args="--optimize 3", profiler_args="--section SpeedOfLight") The same effect can be achieved by running "set_defaults" once for each config due to the fact that the default value is not changed if an a value is not given to the "set_defaults" function. .. code-block:: python from nvcc4jupyter import set_defaults set_defaults(compiler_args="--optimize 3") set_defaults(profiler_args="--section SpeedOfLight") Now we can run the following cell without specifying the compiler and profiler arguments once again. .. code-block:: c++ %cuda_group_run --group "vector_add" --profile