Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Merge pull request #35 from andreinechaev/feature/readme-import-notebook-kaggle
2026-06-15 11:40:48 +05:30 · 2024-09-13 14:04:10 +02:00 · 2024-06-23 18:09:04 +02:00 · 2024-06-08 22:01:43 +00:00 · 2024-06-03 18:57:11 +02:00 · 2024-06-02 09:13:56 +00:00
31 changed files with 1727 additions and 94 deletions
@@ -1,15 +1,29 @@
-FROM ubuntu
+FROM ubuntu:22.04

 ARG VENV_PATH=/opt/dev-venv
 ENV VENV_ACTIVATE=${VENV_PATH}/bin/activate
+ENV DEBIAN_FRONTEND="noninteractive"

+# install the latest CUDA toolkit (https://developer.nvidia.com/cuda-downloads)
 RUN apt update
-RUN apt install -y python3.10-venv nvidia-cuda-toolkit gcc vim git
+RUN apt install -y wget
+RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+RUN dpkg -i cuda-keyring_1.1-1_all.deb
+RUN apt update
+RUN apt -y install cuda-toolkit-12-3
+RUN echo "PATH=\"\$PATH:/usr/local/cuda/bin\"" >> ~/.bashrc

-# the mkdir command bypasses a profiler error, which allows us to run it with
-# host code only to at least check that the profiler parameters are correctly
-# provided; without this line, some tests will fail
-RUN mkdir -p /usr/lib/x86_64-linux-gnu/nsight-compute/sections
+# install OpenCV to test compilation with external libraries
+RUN apt install -y libopencv-dev pkg-config
+
+# make & language-pack-en are for documentation
+RUN apt install -y \
+    gcc \
+    git \
+    language-pack-en \
+    make \
+    python3.10-venv \
+    vim

 # we create the virtualenv here so that the devcontainer.json setting
 # python.defaultInterpreterPath can be used to find it; if we do it in the
@@ -16,10 +16,12 @@
                "ms-python.isort",
                "ms-python.flake8",
                "ms-python.black-formatter",
-                "ryanluker.vscode-coverage-gutters"
+                "ryanluker.vscode-coverage-gutters",
+                "njpwerner.autodocstring"
            ],
            "settings": {
-                "python.defaultInterpreterPath": "/opt/dev-venv/bin/python"
+                "python.defaultInterpreterPath": "/opt/dev-venv/bin/python",
+                "autoDocstring.docstringFormat": "google-notypes"
            }
        }
    }
@@ -1,7 +1,7 @@
 #!/bin/bash

 # install developer dependencies
-pip install .[dev]
+pip install -e .[dev]

 # make sure the developer uses pre-commit hooks
 pre-commit install
@@ -38,7 +38,7 @@ jobs:

    steps:
    - name: Download all the dists
-      uses: actions/download-artifact@v3
+      uses: actions/download-artifact@v4.1.7
      with:
        name: python-package-distributions
        path: dist/
@@ -27,14 +27,19 @@ jobs:
        with:
          python-version: ${{ matrix.python-version }}

-      # the mkdir command bypasses a profiler error, which allows us to run it
-      # with host code only to at least check that the profiler parameters are
-      # correctly provided
-      - name: Install CUDA tools
+      - name: Install CUDA toolkit
        run: |
          sudo apt update
-          sudo apt install nvidia-cuda-toolkit
-          sudo mkdir -p /usr/lib/x86_64-linux-gnu/nsight-compute/sections
+          sudo apt install -y wget
+          wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+          sudo dpkg -i cuda-keyring_1.1-1_all.deb
+          sudo apt update
+          sudo apt -y install cuda-toolkit-12-3
+          echo "PATH=$PATH:/usr/local/cuda/bin" >> $GITHUB_ENV
+
+      - name: Install OpenCV
+        run: |
+          sudo apt install -y libopencv-dev pkg-config

      - name: Install Python dependencies
        run: |
@@ -65,11 +70,19 @@ jobs:
        with:
          python-version: "3.10"

-      - name: Install CUDA tools
+      - name: Install CUDA toolkit
        run: |
          sudo apt update
-          sudo apt install nvidia-cuda-toolkit
-          sudo mkdir -p /usr/lib/x86_64-linux-gnu/nsight-compute/sections
+          sudo apt install -y wget
+          wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+          sudo dpkg -i cuda-keyring_1.1-1_all.deb
+          sudo apt update
+          sudo apt -y install cuda-toolkit-12-3
+          echo "PATH=$PATH:/usr/local/cuda/bin" >> $GITHUB_ENV
+
+      - name: Install OpenCV
+        run: |
+          sudo apt install -y libopencv-dev pkg-config

      - name: Install Python dependencies
        run: |
@@ -28,9 +28,6 @@ pip-delete-this-directory.txt
 nosetests.xml
 coverage.xml

-# Virtual Environment
-*env*
-
 # Misc
 .pytest_cache/
 .DS_Store
@@ -51,3 +51,10 @@ repos:
    - id: bandit
      args: ["-c", "pyproject.toml"]
      additional_dependencies: ["bandit[toml]"]
+
+  # remove notebook cell output
+  - repo: https://github.com/kynan/nbstripout
+    rev: 0.7.1
+    hooks:
+      - id: nbstripout
+        files: ".ipynb"
@@ -45,7 +45,8 @@ to own a GPU yourself.
 Here are just a few of the things that nvcc4jupyter does well:

  - [Easily run CUDA C++ code](https://nvcc4jupyter.readthedocs.io/en/latest/usage.html#hello-world)
-  - [Profile your code with NVIDIA Nsight Compute](https://nvcc4jupyter.readthedocs.io/en/latest/usage.html#profiling)
+  - [Profile your code with NVIDIA Nsight Compute or Nsight Systems](https://nvcc4jupyter.readthedocs.io/en/latest/usage.html#profiling)
+  - [Compile your code with external libraries (e.g. OpenCV)](https://nvcc4jupyter.readthedocs.io/en/latest/notebooks.html#compiling-with-external-libraries)
  - [Share code between different programs in the same notebook / split your code into multiple files for improved readability](https://nvcc4jupyter.readthedocs.io/en/latest/usage.html#groups)

 ## Install
@@ -88,13 +89,14 @@ The official documentation is hosted on [readthedocs](https://nvcc4jupyter.readt

 ## Contributing

-Install the package with the development dependencies:
-```bash
-pip install .[dev]
-```
+The recommended setup for development is using the devcontainer in GitHub
+Codespaces or locally in VSCode.

-As a developer, make sure you install the pre-commit hook before commiting any changes:
+If not using the devcontainer you need to install the package with the
+development dependencies and install the pre-commit hook before commiting any
+changes:
 ```bash
+pip install -e .[dev]
 pre-commit install
 ```

@@ -1,2 +1,3 @@
 sphinx==7.1.2
 sphinx-rtd-theme==1.3.0rc1
+IPython>=8.19.0
@@ -6,11 +6,18 @@
 # -- Project information -----------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

+import os
+import sys
+
+sys.path.append(os.path.join("..", ".."))
+from nvcc4jupyter.__init__ import __version__  # noqa: E402
+
 project = "nvcc4jupyter"
 copyright = "2024, Andrei Nechaev & Cosmin Stefan Ciocan"
 author = "Andrei Nechaev & Cosmin Stefan Ciocan"
-release = "1.0.1"
-version = "1.0.1"
+release = __version__
+version = __version__
+

 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -10,4 +10,5 @@ which provides CUDA capable GPUs with the CUDA toolkit already installed.
   :caption: Contents:

   usage
+   notebooks
   magics
@@ -21,23 +21,47 @@ Usage
   - ``%%cuda``: Compile and run this cell.
   - ``%%cuda -p``: Also runs the Nsight Compute profiler.
   - ``%%cuda -p -a "<SPACE SEPARATED PROFILER ARGS>"``: Also runs the Nsight Compute profiler.
+   - ``%%cuda -c "<SPACE SEPARATED COMPILER ARGS"``: Passes additional arguments to "nvcc".
   - ``%%cuda -t``: Outputs the "timeit" built-in magic results.

 Options
 -------

+.. _timeit:
+
 -t, --timeit
   Boolean. If set, returns the output of the "timeit" built-in
   ipython magic instead of stdout.

+.. _profile:
+
 -p, --profile
-   Boolean. If set, runs the NVIDIA Nsight Compute profiler whose
-   output is appended to standard output.
+   Boolean. If set, runs the NVIDIA Nsight Compute (or NVIDIA Nsight Systems
+   if changed via the \-\-profiler option) profiler whose output is appended to
+   standard output.
+
+.. _profiler:
+
+-l, --profiler
+   String. Can either be "ncu" (the default) to use NVIDIA Nsight Compute
+   profiling tool, or "nsys" to use NVIDIA Nsight Systems profiling tool.
+
+.. _profiler_args:

 -a, --profiler-args
   String. Optional profiler arguments that can be space separated
-   by wrapping them in double quotes. See all options here:
-   `Nsight Compute CLI <https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options>`_
+   by wrapping them in double quotes. Will be passed to the profiler selected
+   by the \-\-profiler option.. See profiler options here:
+   `Nsight Compute <https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options>`_
+   or `Nsight Systems <https://docs.nvidia.com/nsight-systems/UserGuide/index.html#command-line-options>`_.
+
+.. _compiler_args:
+
+-c, --compiler-args
+   String. Optional compiler arguments that can be space separated
+   by wrapping them in double quotes. They will be passed to "nvcc".
+   See all options here:
+   `NVCC Options <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#nvcc-command-options>`_

 .. note::
   If both "\-\-profile" and "\-\-timeit" are used then no profiling is
@@ -47,10 +71,11 @@ Examples
 --------
 ::

-   # compile, run, and profile the code in the cell with the Nsight
-   # compute profiler while collecting only metrics from the
-   # "MemoryWorkloadAnalysis" section.
-   %%cuda --profile --profiler-args "--section MemoryWorkloadAnalysis"
+   # compile, run, and profile the code in the cell with the Nsight compute
+   # profiler while collecting only metrics from the "MemoryWorkloadAnalysis"
+   # section; also provides the "--optimize 3" option to "nvcc" during
+   # compilation to optimize host code
+   %%cuda -p -a "--section MemoryWorkloadAnalysis" -c "--optimize 3"

 ------

@@ -0,0 +1,34 @@
+*********
+Notebooks
+*********
+
+This page provides a list of useful Jupyter notebooks written with the
+**nvcc4jupyter** library.
+
+.. note::
+   These notebooks are written for Google's Colab, but you may run them in
+   other environments by installing all expected dependencies. If running in
+   Colab, make sure to set the runtime type to a GPU instance (at the time of
+   writing this, T4 is the GPU offered for free by Colab).
+
+------
+
+.. _compiling_with_external_libraries:
+
+Compiling with external libraries
+=================================
+
+[`NOTEBOOK <https://colab.research.google.com/drive/1iuY46DCwv4hy3SqDhJgFeO8kgpHnzjTh?usp=sharing>`_]
+
+If you need to compile CUDA C++ code that uses external libraries in the host
+code (e.g. OpenCV for reading and writing images to disk) then this section is
+for you.
+
+To achieve this, use the :ref:`compiler-args <compiler_args>` option of the
+:ref:`cuda <cuda_magic>` magic command to pass the correct compiler options
+of the OpenCV library to **nvcc** for it to link the OpenCV code with the
+code in your Jupyter cell. Those compiler options can be provided by the
+`pkg-config <https://www.freedesktop.org/wiki/Software/pkg-config/>`_ tool.
+
+In the notebook we show how to use OpenCV to load an image, blur it with a CUDA
+kernel, and then save it back to disk using OpenCV again.
@@ -225,10 +225,11 @@ Profiling
 ---------

 Another important feature of nvcc4jupyter is its integration with the NVIDIA
-Nsight Compute profiler, which you need to make sure is installed and its
-executable can be found in a directory in your PATH environment variable.
+Nsight Compute / NVIDIA Nsight Systems profilers, which you need to make sure
+are installed and the executables can be found in a directory in your PATH
+environment variable.

-In order to use it and provide the profiler with custom arguments, simply run:
+To profile using Nsight Compute with custom arguments:

 .. code-block:: c++

@@ -255,3 +256,68 @@ Running the cell above will compile and execute the vector addition code in the
    SM Active Cycles                cycle       383.65
    Compute (SM) Throughput             %         1.19
    ----------------------- ------------- ------------
+
+To profile using Nsight Systems with custom arguments:
+
+.. code-block:: c++
+
+    %cuda_group_run --group "vector_add" --profiler nsys --profile --profiler-args "profile --stats=true"
+
+Running the cell above will compile and execute the vector addition code in the
+"vector_add" group and profile it with Nsight Systems. The output will contain
+multiple tables, one of which will look similar to this:
+
+.. code-block::
+
+    [5/8] Executing 'cuda_api_sum' stats report
+
+    Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name
+    --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ----------------------
+        77.3      200,844,276          1  200,844,276.0  200,844,276.0  200,844,276  200,844,276          0.0  cudaMalloc
+        22.6       58,594,762          2   29,297,381.0   29,297,381.0   29,153,999   29,440,763    202,772.8  cudaMemcpy
+         0.1          305,450          1      305,450.0      305,450.0      305,450      305,450          0.0  cudaLaunchKernel
+         0.0            1,970          1        1,970.0        1,970.0        1,970        1,970          0.0  cuModuleGetLoadingMode
+
+Compiler arguments
+------------------
+
+In the same way profiler arguments can be passed to the profiling tool,
+compiling arguments can be passed to **nvcc**:
+
+.. code-block:: c++
+
+    %cuda_group_run --group "vector_add" --compiler-args "--optimize 3"
+
+Running the cell above will compile and execute the vector addition code in the
+"vector_add" group. During compilation, **nvcc** receives the "\-\-optimize"
+option which specifies the optimization level for host code.
+
+Set default arguments
+---------------------
+
+In the case where you execute multiple magic commands with the same compiler or
+profiler arguments you can avoid writing them every time by setting the default
+arguments:
+
+.. code-block:: python
+
+    from nvcc4jupyter import set_defaults
+    set_defaults(compiler_args="--optimize 3", profiler_args="--section SpeedOfLight")
+
+The same effect can be achieved by running "set_defaults" once for each config
+due to the fact that the default value is not changed if an a value is not
+given to the "set_defaults" function.
+
+.. code-block:: python
+
+    from nvcc4jupyter import set_defaults
+    set_defaults(compiler_args="--optimize 3")
+    set_defaults(profiler_args="--section SpeedOfLight")
+
+
+Now we can run the following cell without specifying the compiler and profiler
+arguments once again.
+
+.. code-block:: c++
+
+    %cuda_group_run --group "vector_add" --profile
@@ -0,0 +1,3 @@
+# Importing the example notebooks in Kaggle from GitHub
+
+![Import Notebook in Kaggle](https://nvcc4jupyter.s3.eu-central-1.amazonaws.com/notebooks/import-notebook-kaggle.gif)
@@ -0,0 +1,405 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4_qHS_1yjcrK"
+   },
+   "source": [
+    "# Compiling CUDA with OpenCV\n",
+    "In this notebook we will show how to compile CUDA C++ code with the OpenCV library, which we will use to load an image and save it back to disk after applying a blur kernel.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "2XyG-R3nk1p6"
+   },
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "J2iU1zyrK4FF"
+   },
+   "source": [
+    "Install and load the nvcc4jupyter extension, which will allow us to run CUDA C++ code using cell magics. See [the documentation](https://nvcc4jupyter.readthedocs.io/en/latest/usage.html) for details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "K1pfzEEsjlFD",
+    "outputId": "f14a4888-9215-4280-e723-1a3e1477135d"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install nvcc4jupyter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "w-4m2tFtlt_M",
+    "outputId": "9118eb6e-63ee-426f-a618-fb35d9750b81"
+   },
+   "outputs": [],
+   "source": [
+    "%load_ext nvcc4jupyter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Make sure OpenCV is installed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!apt update && apt install -y libopencv-dev"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "hBZR7kZkk4UV"
+   },
+   "source": [
+    "## Code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "JppQ-78qLYjP"
+   },
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "vYyG7Gd0vJJ0"
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import subprocess\n",
+    "from pathlib import Path\n",
+    "from IPython.display import Image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Zhs1emaLLemz"
+   },
+   "source": [
+    "### Download an image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1vyu_yKStyKG"
+   },
+   "source": [
+    "Get a random image of IMG_SIZE x IMG_SIZE pixels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 417
+    },
+    "id": "T3yN-g8dvZ4P",
+    "outputId": "d5b8602c-6a1a-4de0-9fbf-8b72ef9f6071"
+   },
+   "outputs": [],
+   "source": [
+    "IMG_SIZE = 400\n",
+    "os.environ[\"IMG_SIZE\"] = str(IMG_SIZE)\n",
+    "IMG_FNAME = \"image.jpg\"\n",
+    "os.environ[\"IMG_FNAME\"] = IMG_FNAME\n",
+    "\n",
+    "!wget -O $IMG_FNAME https://picsum.photos/$IMG_SIZE.jpg &> /dev/null\n",
+    "Image(filename=IMG_FNAME)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "BeOimvKtNluI"
+   },
+   "source": [
+    "### Compiler arguments\n",
+    "These are options that you need to pass to the compiler in order to link with the OpenCV library."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# make sure the file containing the necessary compilation flags for the opencv library\n",
+    "# exists in a directory in the PKG_CONFIG_PATH environment variable\n",
+    "for path in Path('/usr/lib').rglob('opencv4.pc'):\n",
+    "    os.environ[\"PKG_CONFIG_PATH\"] = os.path.dirname(str(path))\n",
+    "    break\n",
+    "    \n",
+    "# get the compilation flags required to compile our CUDA C++ code with opencv\n",
+    "COMPILER_ARGS = subprocess.check_output([\"pkg-config\", \"--cflags\", \"--libs\", \"opencv4\"]).decode().strip()\n",
+    "COMPILER_ARGS = f\"-I/usr/include/opencv4 {COMPILER_ARGS}\"\n",
+    "print(COMPILER_ARGS)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "SAQ1x3_JNvhT"
+   },
+   "source": [
+    "### Blurring the image"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda_group_save -n \"error_handling.h\" -g \"shared\"\n",
+    "\n",
+    "// error checking macro\n",
+    "#define cudaCheckErrors(msg) \\\n",
+    "    do { \\\n",
+    "        cudaError_t __err = cudaGetLastError(); \\\n",
+    "        if (__err != cudaSuccess) { \\\n",
+    "            fprintf(stderr, \"Fatal error: %s (%s at %s:%d)\\n\", \\\n",
+    "                msg, cudaGetErrorString(__err), \\\n",
+    "                __FILE__, __LINE__); \\\n",
+    "            fprintf(stderr, \"*** FAILED - ABORTING\\n\"); \\\n",
+    "            exit(1); \\\n",
+    "        } \\\n",
+    "    } while (0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda_group_save -n \"blur_kernel.h\" -g \"shared\"\n",
+    "\n",
+    "#define BLUR_SIZE 5\n",
+    "\n",
+    "// kernel taken from https://stackoverflow.com/a/65973288\n",
+    "__global__ void blurKernel(\n",
+    "    unsigned char* in,\n",
+    "    unsigned char* out,\n",
+    "    int width,\n",
+    "    int height,\n",
+    "    int num_channel,\n",
+    "    int channel\n",
+    ") {\n",
+    "\n",
+    "    int col = blockIdx.x * blockDim.x + threadIdx.x;\n",
+    "    int row = blockIdx.y * blockDim.y + threadIdx.y;\n",
+    "\n",
+    "    if(col < width && row < height) {\n",
+    "        int pixVal = 0;\n",
+    "        int pixels = 0;\n",
+    "        for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE + 1; ++blurRow) {\n",
+    "            for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE + 1; ++blurCol) {\n",
+    "                int curRow = row + blurRow;\n",
+    "                int curCol = col + blurCol;\n",
+    "                if(curRow > -1 && curRow < height && curCol > -1 && curCol < width) {\n",
+    "                    pixVal += in[curRow * width * num_channel + curCol * num_channel + channel];\n",
+    "                    pixels++;\n",
+    "                }\n",
+    "            }\n",
+    "        }\n",
+    "        out[row * width * num_channel + col * num_channel + channel] = (unsigned char)(pixVal/pixels);\n",
+    "    }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "hhkNS3HJlFlC",
+    "outputId": "bb0b3fad-f132-4a26-e247-32bf5e29c1a4"
+   },
+   "outputs": [],
+   "source": [
+    "%%cuda --compiler-args \"$COMPILER_ARGS\"\n",
+    "\n",
+    "#include <iostream>\n",
+    "\n",
+    "#include <opencv2/core.hpp>\n",
+    "#include <opencv2/imgcodecs.hpp>\n",
+    "\n",
+    "#include \"error_handling.h\"\n",
+    "#include \"blur_kernel.h\"\n",
+    "\n",
+    "#define R 0\n",
+    "#define G 1\n",
+    "#define B 2\n",
+    "\n",
+    "int main()\n",
+    "{\n",
+    "    std::string image_path = cv::samples::findFile(\"image.jpg\");\n",
+    "    cv::Mat img = imread(image_path, cv::IMREAD_COLOR);\n",
+    "    if(img.empty())\n",
+    "    {\n",
+    "        std::cerr << \"Could not read the image: \" << image_path << std::endl;\n",
+    "        return 1;\n",
+    "    }\n",
+    "\n",
+    "    // image shape\n",
+    "    int width = img.cols;\n",
+    "    int height = img.rows;\n",
+    "    int n_channels = 3; // hard-coded RGB processing\n",
+    "\n",
+    "    // we will read and write directly into the data array of the OpenCV Matrix\n",
+    "    unsigned char *host_image_input = img.data;\n",
+    "    unsigned char *host_image_output = img.data;\n",
+    "\n",
+    "    // allocate memory for device arrays\n",
+    "    unsigned char* dev_image_input = NULL;\n",
+    "    unsigned char* dev_image_output = NULL;\n",
+    "    cudaMalloc(\n",
+    "        (void**)&dev_image_input,\n",
+    "        sizeof(unsigned char) * height * width * n_channels\n",
+    "    );\n",
+    "    cudaMalloc(\n",
+    "        (void**)&dev_image_output,\n",
+    "        sizeof(unsigned char) * height * width * n_channels\n",
+    "    );\n",
+    "    cudaCheckErrors(\"cudaMalloc failure\");\n",
+    "\n",
+    "    // transfer data from host to device for processing on GPU\n",
+    "    cudaMemcpy(\n",
+    "        dev_image_input,\n",
+    "        host_image_input,\n",
+    "        sizeof(unsigned char) * height * width * n_channels,\n",
+    "        cudaMemcpyHostToDevice\n",
+    "    );\n",
+    "    cudaCheckErrors(\"cudaMemcpy H2D failure\");\n",
+    "\n",
+    "    // run a blur kernel on each channel\n",
+    "    dim3 blockSize(16, 16, 1);\n",
+    "    dim3 gridSize(width/blockSize.x, height/blockSize.y, 1);\n",
+    "    blurKernel<<<gridSize, blockSize>>>(dev_image_input, dev_image_output, width, height, n_channels, R);\n",
+    "    blurKernel<<<gridSize, blockSize>>>(dev_image_input, dev_image_output, width, height, n_channels, G);\n",
+    "    blurKernel<<<gridSize, blockSize>>>(dev_image_input, dev_image_output, width, height, n_channels, B);\n",
+    "\n",
+    "    cudaDeviceSynchronize();\n",
+    "    cudaCheckErrors(\"kernel failure\");\n",
+    "\n",
+    "    // copy results back to host\n",
+    "    cudaMemcpy(\n",
+    "        host_image_output,\n",
+    "        dev_image_output,\n",
+    "        sizeof(unsigned char) * height * width * n_channels,\n",
+    "        cudaMemcpyDeviceToHost\n",
+    "    );\n",
+    "    cudaCheckErrors(\"cudaMemcpy D2H failure\");\n",
+    "\n",
+    "    cudaFree(dev_image_input);\n",
+    "    cudaFree(dev_image_output);\n",
+    "\n",
+    "    // save the blurred image to disk\n",
+    "    cv::imwrite(\"image_blurred.jpg\", img);\n",
+    "    return 0;\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 417
+    },
+    "id": "Bn9X_cm4rALH",
+    "outputId": "ead0c3b5-013b-45a3-f91a-7866d33b2404"
+   },
+   "outputs": [],
+   "source": [
+    "Image(filename=\"image_blurred.jpg\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": [],
+   "toc_visible": true
+  },
+  "kaggle": {
+   "accelerator": "nvidiaTeslaT4",
+   "dataSources": [],
+   "dockerImageVersionId": 30716,
+   "isGpuEnabled": true,
+   "isInternetEnabled": true,
+   "language": "python",
+   "sourceType": "notebook"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,578 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "This notebook is an adaptation of [this session](https://www.olcf.ornl.gov/calendar/introduction-to-cuda-c/) (presentation and assignments) from [the CUDA training series](https://www.olcf.ornl.gov/cuda-training-series/) provided to the Oak Ridge National Laboratory by [Bob Crovella](https://developer.nvidia.com/blog/author/bob-crovella/), who is on the Solution Architecture team at NVIDIA. While not meant as a replacement to the course, this notebook goes over the main points and acts as a way to quickly put them into practice. Remember that the theoretical part may be a bit overwhelming at first, so make sure you run the examples and play around with them to get a better understanding of the concepts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup\n",
+    "- Make sure you run the notebook on either **Kaggle**, **Colab** or anywhere there is an NVIDIA GPU readily available.\n",
+    "- We need to install and load the [nvcc4jupyter](https://github.com/andreinechaev/nvcc4jupyter) extension, which enables us to run CUDA C++ code directly from the notebook cells"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install nvcc4jupyter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext nvcc4jupyter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction to CUDA C++\n",
+    "\n",
+    "This course covers what CUDA is, the CUDA programming model, CUDA syntax, the basics of error handling and running simple CUDA programs with little care for performance, as that will be covered in the next courses. It contains:\n",
+    "- The theoretical basics of CUDA\n",
+    "- Learning CUDA syntax\n",
+    "- Vector addition example\n",
+    "- Writing and launching CUDA C++ kernels\n",
+    "- Managing GPU Memory\n",
+    "- Matrix multiplication assignment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## CUDA basics\n",
+    "\n",
+    "CUDA is a **parallel computing platform and API** created by NVIDIA which allows developers to use NVIDIA cards for **accelerating repetitive computational tasks**. The most well known applications of this are **video games**, where each pixel is processed in parallel with many others, and in running **artificial neural networks**. Both applications use lots of matrix multiplications and other mathematical operations which can be run in parallel. We will call this accelerator card **\"the device\"** (also known as GPU).\n",
+    "\n",
+    "The device is a separate hardware component from the CPU (which we will call **\"the host\"**), but they are similar in many ways. Both have memory and compute resources to manage, which is the task of their operating system. Both have the concept of **cores**, which are \"independent\" processors which execute instructions. We will see that device cores are less independent than host cores, and they also are less complex, meaning you can fit a lot more of them in the same space. Having more cores means the GPU is capable of processing the same workload a lot faster if it can be parallelized. The host operating system is much more complex as well, having to manage the interactions with the network, disk, etc.\n",
+    "\n",
+    "<div style=\"text-align:center\"><img src=\"https://nvcc4jupyter.s3.eu-central-1.amazonaws.com/notebooks/introduction-to-cuda-cpp/host_device_communication.png\"/></div>\n",
+    "\n",
+    "The figure above shows the basic interactions between device and host. The **host is what decides what needs to be done (the kernel code) and on what data**, and the **device will execute that code on all of the data with as much parallelism as it can**. This interaction makes sense only if it is faster to send the data to the device, process it with its numerous cores, and have it sent back to the host, than to directly process it on the host. This is a matter of **exposing enough parallelism** in your problem to justify sending the data. In this session we will cover only [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) problems such as **vector addition** and **matrix multiplication** which should easily benefit from device acceleration.\n",
+    "\n",
+    "<div style=\"text-align:center\"><img src=\"https://nvcc4jupyter.s3.eu-central-1.amazonaws.com/notebooks/introduction-to-cuda-cpp/porting_to_cuda.png\"/></div>\n",
+    "\n",
+    "In most applications, there will be sequences of code that **cannot be parallelized** or **require I/O** (as shown in blue in the figure above) and which must be run on the much fewer in number, but faster and smarter cores of the host (CPU). The developer has to identify those functions which are **computationally intensive and parallelizable** (shown in green).\n",
+    "\n",
+    "## Cuda programming model\n",
+    "\n",
+    "### 3-step processing flow\n",
+    "\n",
+    "Using a GPU to accelerate a computation can be reduced into **3 main steps** (with unified memory the story is a bit more complex, but that is a topic of discussion for future sessions):\n",
+    "\n",
+    "1. Copy input data from host to device.\n",
+    "2. Send kernel code from host to device and process the data from step 1.\n",
+    "3. Copy output data from device to host.\n",
+    "\n",
+    "Communication is done through a high performance bus such as [PCIe](https://en.wikipedia.org/wiki/PCI_Express) or [NVLink](https://en.wikipedia.org/wiki/NVLink), who ensure data transfer at high bandwidth and as low latency as possible.\n",
+    "\n",
+    "It is important to keep in mind the fact that host and device memory are separate entities, so a pointer to host memory will not be usable in device code and vice versa.  \n",
+    "\n",
+    "### Host function definition\n",
+    "\n",
+    "CUDA code is compiled by [nvcc](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html), which is a **compiler driver** that **calls multiple compilers under the hood** to perform various compilation stages. It will separate host and device functions, host functions being compiled by standard host compilers such as [gcc](https://gcc.gnu.org/), and device functions by the NVIDIA compiler. Now we will take a look at how to write syntactically correct CUDA C++ code. The first step is writing a simple function to run on the device:\n",
+    "\n",
+    "```cpp\n",
+    "/* \n",
+    " * the \"__global__\" keyword signals to \"nvcc\" that this code must be compiled to\n",
+    " * run on the device, not the host; other than that, this is very much the same\n",
+    " * way you would define a function in C/C++\n",
+    " */\n",
+    "__global__ void my_kernel(void) {\n",
+    "    // FUNCTION CODE\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "Now for a more complex example:\n",
+    "\n",
+    "```cpp\n",
+    "/*\n",
+    " * this is the signature of the vector addition function we will encounter in \n",
+    " * the examples; we're passing the pointers for the two input vectors, A and \n",
+    " * B, one for the output vector C, and the length of those vectors so we know\n",
+    " * where to stop;\n",
+    " */\n",
+    "__global__ void vector_add_device(const float *A, const float *B, float *C, int vector_size){\n",
+    "    // FUNCTION CODE\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "### Host function calls\n",
+    "\n",
+    "Calling a device function is a topic on its own. Compared to C/C++, device functions are called with the **kernel launch** syntax:\n",
+    "\n",
+    "```cpp\n",
+    "vector_add_device<<1, 1>>(A, B, C, vector_size);\n",
+    "```\n",
+    "\n",
+    "For reference, here is how it would look for C/C++:\n",
+    "\n",
+    "```cpp\n",
+    "vector_add_host(A, B, C, vector_size);\n",
+    "```\n",
+    "\n",
+    "The kernel launch configuration is one of the ways we tell the device how to parallelize our function when we want to use multiple workers at the same time. Just as a quick example to get some intuition, this is how you would tell it to run **N workers in parallel** to perform vector addition:\n",
+    "\n",
+    "```cpp\n",
+    "vector_add_device<<N, 1>>(A, B, C, vector_size);\n",
+    "```\n",
+    "\n",
+    "### Block and thread indices\n",
+    "\n",
+    "Up until now we have only looked at function signatures, but we have not actually completed the `vector_add_device` function with its code. Let's take a look without worrying about understanding everything from the start:\n",
+    "\n",
+    "```cpp\n",
+    "__global__ void vector_add_device(const float *A, const float *B, float *C, int vector_size){\n",
+    "    int idx = threadIdx.x + blockIdx.x * blockDim.x;\n",
+    "    if (idx < vector_size) {\n",
+    "        C[idx] = A[idx] + B[idx];\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "It is time to have a brief explanation on what **threads** and **blocks** are. A thread is the **smallest unit of work** in CUDA. Take a quick look at the code above and you will notice there is no for-loop to process all elements of the two input vectors. The function only adds and saves the result of the elements at the index `idx`. What you see here is the view of one of the many threads that exist in a device function call. Each thread computes an index in the array, adds the elements at that index and saves the result in the output array at that index. This is an essential part of writing device functions: figuring out how to divide work among threads by giving each of them an index to process. Of course, we are simplifying a bit.\n",
+    "\n",
+    "Each thread is assigned into a block of multiple threads (at most 1024 threads per block). This is where understanding the hardware starts being important and we will cover the details in the following courses. For now, you only need to know that threads that run in the same block can more easily communicate and synchronize with one another because they are executed on the same Streaming Multiprocessor.\n",
+    "\n",
+    "<div style=\"text-align:center\"><img src=\"https://nvcc4jupyter.s3.eu-central-1.amazonaws.com/notebooks/introduction-to-cuda-cpp/kernel_execution_on_gpu.png\"/></div>\n",
+    "\n",
+    "Now take a look at this line:\n",
+    "\n",
+    "```cpp\n",
+    "int idx = threadIdx.x + blockIdx.x * blockDim.x;\n",
+    "```\n",
+    "\n",
+    "- `threadIdx`: Is a built-in structure (it is not defined anywhere in your code, but it is provided at runtime) that contains the index of the thread inside its own block. This index is not unique by itself as there are threads in other blocks that will share the same index.\n",
+    "- `blockIdx`: Is a built-in structure that contains the index of the block that contains this thread. Is constant for all threads in the same block.\n",
+    "- `blockDim`: Is a built-in structure that contains the number of threads per block. Is constant for all threads, no matter the block they belong to. When multiplied with `blockIdx`, it yields the index of the first thread in the block with index `blockIdx`. When also adding `threadIdx` you get the global index of the given thread. This example matches the figure above which represents a 1-dimensional grid of 1-dimensional blocks. Multiple dimensions are possible, but we only use one, hence accessing the first value of all 3 structures: `threadIdx.x`, `blockIdx.x`, and `blockDim.x`.\n",
+    "\n",
+    "\n",
+    "At this point we know enough to run our hello world example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda\n",
+    "#include <stdio.h>\n",
+    "\n",
+    "__global__ void hello(){\n",
+    "    int global_idx = threadIdx.x + blockIdx.x * blockDim.x;\n",
+    "    printf(\n",
+    "        \"Hello from thread with global index %u (threadIdx.x: %u, blockIdx.x: %u, blockDim.x: %u)\\n\", \n",
+    "        global_idx, threadIdx.x, blockIdx.x, blockDim.x\n",
+    "    );\n",
+    "}\n",
+    "\n",
+    "int main() {\n",
+    "    int n_blocks = 2;\n",
+    "    int n_threads_per_block = 3;\n",
+    "    hello<<<n_blocks, n_threads_per_block>>>();\n",
+    "    \n",
+    "    // wait for the execution of the asynchronous \"hello\" function call to finish\n",
+    "    cudaDeviceSynchronize();\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you run the hello world example with `n_blocks = 2` and `n_threads_per_block = 3` then you will be able to see that there are 6 threads with global indices from 0 to 5. There are two threads with thread id (`threadIdx.x`) equal to 0, one of them being on the first position of the first block and the second thread being on the first position of the second block. The block id (`blockIdx.x`) is shared between threads from the same block (3 threads have block id 0 and 3 have id 1). The block size (`blockDim.x`) is equal to 3 for all threads since this is a global setting (the number of threads per block) that is set in the kernel launch configuration.\n",
+    "\n",
+    "<div style=\"text-align:center\"><img src=\"https://nvcc4jupyter.s3.eu-central-1.amazonaws.com/notebooks/introduction-to-cuda-cpp/block_thread_indexing.png\"/></div>\n",
+    "\n",
+    "If you change `n_blocks = 4` and `n_threads_per_block = 8` then you will see how the globally unique index is computed for the example in the figure above. \n",
+    "\n",
+    "### Memory management\n",
+    "\n",
+    "To be able to transition from our simple hello world program to vector addition we need to see how to handle memory. The most important thing to remember is that **device memory and host memory are two different entities**. This means you cannot use pointers for one memory space in the other. This also means you need to have **two copies of each input and output array**, one in host memory, and one in device memory. Usually, memory handling follows this flow: \n",
+    "\n",
+    "1. Allocate host memory (input and output arrays)\n",
+    "    - `malloc()`\n",
+    "2. Allocate device memory (input and output arrays)\n",
+    "    - `cudaMalloc()`\n",
+    "3. Fill host input arrays from any of a number of sources: files, network, etc.\n",
+    "    - `memcpy()` or other means of populating the arrays\n",
+    "4. Copy host input arrays' content to device input arrays (this is done over the NVLink / PCIe bus)\n",
+    "    - `cudaMemcpy()`\n",
+    "5. Run kernel code to turn inputs into outputs\n",
+    "6. Copy device output arrays' content to the host output arrays (NVLink / PCIe bus)\n",
+    "    - `cudaMemcpy()`\n",
+    "7. Free device memory\n",
+    "    - `cudaFree()`\n",
+    "8. Free host memory\n",
+    "    - `free()`\n",
+    "    \n",
+    "We have **equivalents** between **C/C++** and the **CUDA API**:\n",
+    "- Memory allocation: \n",
+    "    - [malloc](https://en.cppreference.com/w/c/memory/malloc) / [cudaMalloc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html)\n",
+    "- Copying data from one array to another: \n",
+    "    - [memcpy](https://en.cppreference.com/w/c/memory/memcpy) / [cudaMemcpy](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html)\n",
+    "- Memory deallocation: \n",
+    "    - [free](https://en.cppreference.com/w/c/memory/free) / [cudaFree](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html)\n",
+    "    \n",
+    "Unlike **memcpy**, **cudaMemcpy** can copy and write data to/from either host or device memory. This means there are 4 types of data transfer, **host-to-host** (same as the regular memcpy), **host-to-device** (same as step 4 from above), **device-to-host** (same as step 6 from above), and **device-to-device**. The developer always needs to pay attention to what kind of pointers he provides to cudaMemcpy.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Useful functions\n",
+    "\n",
+    "We will define some functions which will be useful for both the vector addition example and the matrix multiplication assignment. You may skip reading those (especially those marked as assignment solutions). These functions will be available when adding `#include \"utils.h\"`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda_group_save --group shared --name \"utils.h\"\n",
+    "#include <math.h>\n",
+    "\n",
+    "// error checking macro\n",
+    "#define cudaCheckErrors(msg) \\\n",
+    "    do { \\\n",
+    "        cudaError_t __err = cudaGetLastError(); \\\n",
+    "        if (__err != cudaSuccess) { \\\n",
+    "            fprintf(stderr, \"Fatal error: %s (%s at %s:%d)\\n\", \\\n",
+    "                msg, cudaGetErrorString(__err), \\\n",
+    "                __FILE__, __LINE__); \\\n",
+    "            fprintf(stderr, \"*** FAILED - ABORTING\\n\"); \\\n",
+    "            exit(1); \\\n",
+    "        } \\\n",
+    "    } while (0)\n",
+    "    \n",
+    "int divide_ceil(int dividend, int divisor) {\n",
+    "    // equivalent to \"ceil(dividend / divisor)\" converted to int, but fast\n",
+    "    return (dividend + divisor - 1) / divisor;\n",
+    "}\n",
+    "\n",
+    "bool almost_equal(float first, float second, float abs_tol = 0.001, float rel_tol = 0.001)\n",
+    "{\n",
+    "    float diff = fabs(first - second);\n",
+    "    if (diff <= abs_tol)\n",
+    "        return true;\n",
+    "\n",
+    "    first = fabs(first);\n",
+    "    second = fabs(second);\n",
+    "    float largest = (second > first) ? second : first;\n",
+    "\n",
+    "    if (diff <= largest * rel_tol)\n",
+    "        return true;\n",
+    "    return false;\n",
+    "}\n",
+    "\n",
+    "void compare_arrays(float *x, float *y, int n, float abs_tol = 0.001, float rel_tol = 0.001) {\n",
+    "    for (int i = 0; i < n; i++) {\n",
+    "        if (!almost_equal(x[i], y[i], abs_tol, rel_tol)) {\n",
+    "            printf(\"[ERROR] Arrays/matrices are not equal. At index %d expected \\\"%f\\\", got \\\"%f\\\"\\n\", i, x[i], y[i]);\n",
+    "            return;\n",
+    "        }\n",
+    "    }\n",
+    "    printf(\"[SUCCESS] Arrays/matrices are equal.\\n\");\n",
+    "}\n",
+    "\n",
+    "// this was obfuscated so you do not accidentally see the solution; it is also\n",
+    "// a purely C/C++ solution so not a 1:1 replacement for the assignment\n",
+    "void assignment_solution(float *x, float *y, float *z, int w) {\n",
+    "   for (int i = 0; i < w; ++i) { for (int j = 0; j < w; ++j) { z[i*w+j] = 0; for (int k = 0; k < w; ++k) { z[i*w+j] += x[i*w+k] * y[k*w+j]; } } }\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vector addition example\n",
+    "\n",
+    "Now we can finally implement the vector addition program which will also give us a view into how to manage memory in CUDA C++. Read the comments for detailed explanations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda\n",
+    "#include <stdio.h>\n",
+    "#include \"utils.h\"\n",
+    "\n",
+    "const int VECTOR_SIZE = 4096;  // the size of the input and output vectors\n",
+    "const int BLOCK_SIZE = 256;  // the number of threads per block; limited to 1024\n",
+    "\n",
+    "// vector add kernel: C[i] <- A[i] + B[i] for i in [0, 1, ..., VECTOR_SIZE - 1]\n",
+    "__global__ void vadd(const float *A, const float *B, float *C, int vector_size) {\n",
+    "    // compute the globally unique index which tells the current thread \n",
+    "    // what element it is supposed to be processing\n",
+    "    int global_idx = threadIdx.x + blockIdx.x * blockDim.x;\n",
+    "    \n",
+    "    // if BLOCK_SIZE does not divide VECTOR_SIZE exactly, the last block\n",
+    "    // will have threads whose global index will be out of bounds\n",
+    "    if (global_idx < vector_size) {\n",
+    "        C[global_idx] = A[global_idx] + B[global_idx];\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "int main(){\n",
+    "    // all vectors are float pointers, but some of them will contain an\n",
+    "    // address in host space (h_A, h_B, h_C) and the others an address in\n",
+    "    // device space (d_A, d_B, d_C)\n",
+    "    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;\n",
+    "\n",
+    "    // allocate space for vectors in host memory\n",
+    "    h_A = new float[VECTOR_SIZE];\n",
+    "    h_B = new float[VECTOR_SIZE];\n",
+    "    h_C = new float[VECTOR_SIZE];\n",
+    "\n",
+    "    // assign random values to input vectors in host memory\n",
+    "    for (int i = 0; i < VECTOR_SIZE; i++) {\n",
+    "        h_A[i] = rand()/(float)RAND_MAX;\n",
+    "        h_B[i] = rand()/(float)RAND_MAX;\n",
+    "    }\n",
+    "\n",
+    "    // allocate space for vectors in device memory\n",
+    "    cudaMalloc(&d_A, VECTOR_SIZE*sizeof(float));\n",
+    "    cudaMalloc(&d_B, VECTOR_SIZE*sizeof(float));\n",
+    "    cudaMalloc(&d_C, VECTOR_SIZE*sizeof(float));\n",
+    "    // check that the last CUDA API call was successful; if not, exit\n",
+    "    cudaCheckErrors(\"cudaMalloc failure\"); \n",
+    "\n",
+    "    // copy vectors A and B from host to device:\n",
+    "    cudaMemcpy(d_A, h_A, VECTOR_SIZE*sizeof(float), cudaMemcpyHostToDevice);\n",
+    "    cudaMemcpy(d_B, h_B, VECTOR_SIZE*sizeof(float), cudaMemcpyHostToDevice);\n",
+    "    cudaCheckErrors(\"cudaMemcpy H2D failure\");\n",
+    "\n",
+    "    // we need to have as many threads as there are elements in the vector;\n",
+    "    // one option would be to call the function like this:\n",
+    "    //     vadd<<<VECTOR_SIZE, 1>>>(d_A, d_B, d_C, VECTOR_SIZE);\n",
+    "    // however, we want to showcase the use of multiple threads per block,\n",
+    "    // which is a feature that will benefit us in future problems for performance\n",
+    "    // optimization; the number of threads is N_BLOCKS * BLOCK_SIZE, both of which\n",
+    "    // are integers; this means we need to choose the number of blocks such that\n",
+    "    // N_BLOCKS * BLOCK_SIZE >= VECTOR_SIZE; this is the reason for the if \n",
+    "    // statement in the vector addition kernel function (we may have more threads\n",
+    "    // than elements in the vector so the extra threads will do nothing useful)\n",
+    "    int n_blocks = divide_ceil(VECTOR_SIZE, BLOCK_SIZE);\n",
+    "\n",
+    "    // launch the vector adding kernel\n",
+    "    vadd<<<n_blocks, BLOCK_SIZE>>>(d_A, d_B, d_C, VECTOR_SIZE);\n",
+    "    cudaCheckErrors(\"kernel launch failure\");\n",
+    "\n",
+    "    // wait for the kernel to finish execution\n",
+    "    cudaDeviceSynchronize();\n",
+    "    cudaCheckErrors(\"kernel execution failure\");\n",
+    "\n",
+    "    // copy output array data from device memory to host memory\n",
+    "    cudaMemcpy(h_C, d_C, VECTOR_SIZE*sizeof(float), cudaMemcpyDeviceToHost);\n",
+    "    cudaCheckErrors(\"cudaMemcpy D2H failure\");\n",
+    "    \n",
+    "    // verify results\n",
+    "    float *ground_truth = new float[VECTOR_SIZE];\n",
+    "    for (int i = 0; i < VECTOR_SIZE; i++) {\n",
+    "        ground_truth[i] = h_A[i] + h_B[i];\n",
+    "    }\n",
+    "    compare_arrays(ground_truth, h_C, VECTOR_SIZE);\n",
+    "    \n",
+    "    // print the first element of each host array to see that C[0] == A[0] + B[0]\n",
+    "    printf(\"A[0] = %f\\n\", h_A[0]);\n",
+    "    printf(\"B[0] = %f\\n\", h_B[0]);\n",
+    "    printf(\"C[0] = %f\\n\", h_C[0]);\n",
+    "    return 0;\n",
+    "}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assignment - Matrix Multiplication\n",
+    "\n",
+    "- **TODO_01** - For this task you need to add the indices for the dot product operation between the row of the first matrix and the column of the second matrix in order to finish the matrix multiplication code. Keep in mind that the matrices are stored in "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%cuda\n",
+    "#include <stdio.h>\n",
+    "#include <time.h>\n",
+    "#include \"utils.h\"\n",
+    "\n",
+    "const int MATRIX_SIZE = 512;\n",
+    "const int BLOCK_SIZE = 16;\n",
+    "\n",
+    "// matrix multiply (naive) kernel: C <- A * B\n",
+    "__global__ void matrix_multiply(const float *A, const float *B, float *C, int matrix_size) {\n",
+    "    // each thread must compute one element in the output matrix, which is \n",
+    "    // determined by the column and row index; together they form a globally \n",
+    "    // unique pair of indices since there will be only one thread that will\n",
+    "    // compute C[c_row][c_col]\n",
+    "    int c_col = threadIdx.x + blockDim.x * blockIdx.x;\n",
+    "    int c_row = threadIdx.y + blockDim.y * blockIdx.y;\n",
+    "\n",
+    "    if ((c_col < matrix_size) && (c_row < matrix_size)) {\n",
+    "        // compute in \"temp\" the dot product of row \"c_row\" from A \n",
+    "        // and column \"c_col\" in B; save the result in C[c_row][c_col]\n",
+    "        // assuming row-major ordering\n",
+    "        float temp = 0;\n",
+    "        /**************** TODO_01 *****************/\n",
+    "        for (int i = 0; i < matrix_size; i++) {\n",
+    "            temp += A[FIXME] * B[FIXME];\n",
+    "        }\n",
+    "        /******************************************/\n",
+    "        C[c_row * matrix_size + c_col] = temp;\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "int main(){\n",
+    "    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;\n",
+    "    int n_elements = MATRIX_SIZE * MATRIX_SIZE;\n",
+    "\n",
+    "    // variables used for computing the execution time\n",
+    "    clock_t t0, t1, t2, t3;\n",
+    "    double t1_diff = 0.0;\n",
+    "    double t2_diff = 0.0;\n",
+    "    double t3_diff = 0.0;\n",
+    "    t0 = clock();\n",
+    "    \n",
+    "    h_A = new float[n_elements];\n",
+    "    h_B = new float[n_elements];\n",
+    "    h_C = new float[n_elements];\n",
+    "    for (int i = 0; i < n_elements; i++){\n",
+    "        h_A[i] = rand()/(float)RAND_MAX;\n",
+    "        h_B[i] = rand()/(float)RAND_MAX;\n",
+    "    }\n",
+    "\n",
+    "    // Initialization timing\n",
+    "    t1 = clock();\n",
+    "    t1_diff = ((double)(t1-t0))/CLOCKS_PER_SEC;\n",
+    "    printf(\"Init took %f seconds. Begin compute\\n\", t1_diff);\n",
+    "\n",
+    "    // Allocate device memory and copy input data over to GPU\n",
+    "    cudaMalloc(&d_A, n_elements*sizeof(float));\n",
+    "    cudaMalloc(&d_B, n_elements*sizeof(float));\n",
+    "    cudaMalloc(&d_C, n_elements*sizeof(float));\n",
+    "    cudaCheckErrors(\"cudaMalloc failure\");\n",
+    "    cudaMemcpy(d_A, h_A, n_elements*sizeof(float), cudaMemcpyHostToDevice);\n",
+    "    cudaMemcpy(d_B, h_B, n_elements*sizeof(float), cudaMemcpyHostToDevice);\n",
+    "    cudaCheckErrors(\"cudaMemcpy H2D failure\");\n",
+    "\n",
+    "    // Launch kernel\n",
+    "    dim3 block(BLOCK_SIZE, BLOCK_SIZE);  // dim3 variable holds 3 dimensions\n",
+    "    dim3 grid(divide_ceil(MATRIX_SIZE, block.x), divide_ceil(MATRIX_SIZE, block.y));\n",
+    "    matrix_multiply<<<grid, block>>>(d_A, d_B, d_C, MATRIX_SIZE);\n",
+    "    cudaCheckErrors(\"kernel launch failure\");\n",
+    "\n",
+    "    // Copy results back to host\n",
+    "    cudaMemcpy(h_C, d_C, n_elements*sizeof(float), cudaMemcpyDeviceToHost);\n",
+    "\n",
+    "    // GPU timing\n",
+    "    t2 = clock();\n",
+    "    t2_diff = ((double)(t2-t1))/CLOCKS_PER_SEC;\n",
+    "    printf (\"Done. Compute took %f seconds\\n\", t2_diff);\n",
+    "\n",
+    "    cudaCheckErrors(\"kernel execution failure or cudaMemcpy H2D failure\");\n",
+    "\n",
+    "    // verify results\n",
+    "    float *ground_truth = new float[n_elements];\n",
+    "    assignment_solution(h_A, h_B, ground_truth, MATRIX_SIZE);\n",
+    "    t3 = clock();\n",
+    "    t3_diff = ((double)(t3-t2))/CLOCKS_PER_SEC;\n",
+    "    printf (\"Verifying results on CPU took %f seconds\\n\", t3_diff);\n",
+    "    compare_arrays(ground_truth, h_C, n_elements);\n",
+    "\n",
+    "    return 0;\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Solution\n",
+    "\n",
+    "We are not yet dealing with CUDA optimization problems. We just want to make this program produce the correct answer. The TODO is only a matter of mapping a 2D matrix into a 1D vector using [row-major ordering](https://en.wikipedia.org/wiki/Row-_and_column-major_order) like this:\n",
+    "\n",
+    "```cpp\n",
+    "/**************** TODO_01 *****************/\n",
+    "for (int i = 0; i < matrix_size; i++) {\n",
+    "    temp += A[c_row * matrix_size + i] * B[i * matrix_size + c_col];\n",
+    "}\n",
+    "/******************************************/\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Future sessions\n",
+    "\n",
+    "- CUDA Shared Memory\n",
+    "- CUDA GPU architecture and basic optimizations\n",
+    "- Atomics, Reductions, Warp Shuffle\n",
+    "- Using Managed Memory\n",
+    "- Concurrency (streams, copy/compute overlap, multi-GPU)\n",
+    "- Analysis Driven Optimization\n",
+    "- Cooperative Groups"
+   ]
+  }
+ ],
+ "metadata": {
+  "kaggle": {
+   "accelerator": "nvidiaTeslaT4",
+   "dataSources": [],
+   "dockerImageVersionId": 30698,
+   "isGpuEnabled": true,
+   "isInternetEnabled": true,
+   "language": "python",
+   "sourceType": "notebook"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -2,6 +2,7 @@
 nvcc4jupyter: CUDA C++ plugin for Jupyter Notebook
 """

+from .parsers import Profiler, set_defaults  # noqa: F401
 from .plugin import NVCCPlugin, load_ipython_extension  # noqa: F401

-__version__ = "1.0.3"
+__version__ = "1.2.1"
@@ -3,6 +3,63 @@ Parsers for the CUDA magic commands.
 """

 import argparse
+from enum import Enum
+from typing import Callable, Optional, Type, TypeVar
+
+
+class Profiler(Enum):
+    """Choice between Nsight Compute and Nsight Systems profilers."""
+
+    NCU = "ncu"
+    NSYS = "nsys"
+
+
+_default_profiler: Profiler = Profiler.NCU
+_default_profiler_args: str = ""
+_default_compiler_args: str = ""
+
+T = TypeVar("T")
+
+
+def set_defaults(
+    profiler: Optional[Profiler] = None,
+    compiler_args: Optional[str] = None,
+    profiler_args: Optional[str] = None,
+) -> None:
+    """
+    Set the default values for various arguments of the magic commands. These
+    values will be used if the user does not explicitly provide those arguments
+    to override this behaviour on a cell by cell basis.
+
+    Args:
+        profiler: If not None, this value becomes the new default profiler.
+            Defaults to None.
+        compiler_args: If not None, this value becomes the new default compiler
+            config. Defaults to None.
+        profiler_args: If not None, this value becomes the new default profiler
+            config. Defaults to None.
+    """
+
+    # pylint: disable=global-statement
+    global _default_profiler
+    if profiler is not None:
+        _default_profiler = profiler
+    global _default_compiler_args
+    if compiler_args is not None:
+        _default_compiler_args = compiler_args
+    global _default_profiler_args
+    if profiler_args is not None:
+        _default_profiler_args = profiler_args
+
+
+def str_to_lambda(arg: str) -> Callable[[], str]:
+    """Convert argparse string to lambda"""
+    return lambda: arg
+
+
+def class_to_lambda(arg: str, cls: Type[T]) -> Callable[[], T]:
+    """Convert string value to class and then to lambda"""
+    return lambda: cls(arg)


 def get_parser_cuda() -> argparse.ArgumentParser:
@@ -18,7 +75,28 @@ def get_parser_cuda() -> argparse.ArgumentParser:
    )
    parser.add_argument("-t", "--timeit", action="store_true")
    parser.add_argument("-p", "--profile", action="store_true")
-    parser.add_argument("-a", "--profiler-args", type=str, default="")
+
+    # the type of the following arguments is a lambda lambda function to allow
+    # changing the default value at runtime
+    parser.add_argument(
+        "-l",
+        "--profiler",
+        type=lambda arg: class_to_lambda(arg, cls=Profiler),
+        default=lambda: _default_profiler,
+    )
+    parser.add_argument(
+        "-a",
+        "--profiler-args",
+        type=str_to_lambda,
+        default=lambda: _default_profiler_args,
+    )
+    parser.add_argument(
+        "-c",
+        "--compiler-args",
+        type=str_to_lambda,
+        default=lambda: _default_compiler_args,
+    )
+
    return parser


@@ -0,0 +1,61 @@
+"""
+Helper functions relating to file paths.
+"""
+
+import os
+from glob import glob
+from typing import List, Optional
+
+CUDA_SEARCH_PATHS: List[str] = [
+    "/opt/nvidia/nsight-compute",
+    "/usr/local/cuda",
+    "/opt",
+    "/usr",
+]
+
+
+def is_executable(fpath: str) -> bool:
+    """Check if file exists and is executable"""
+    return os.path.isfile(fpath) and os.access(fpath, os.X_OK)
+
+
+def which(name: str) -> Optional[str]:
+    """Find an executable by name by searching the PATH directories"""
+    for path_dir in os.environ.get("PATH", "").split(os.pathsep):
+        exec_path = os.path.join(path_dir, name)
+        if is_executable(exec_path):
+            return exec_path
+    return None
+
+
+def find_executable(
+    name: str, search_paths: Optional[List[str]] = None
+) -> Optional[str]:
+    """
+    Find an executable, either by searching in the directories of the PATH
+    environment variable or, if that did not work, by searching recursively
+    in directories a list given as parameter.
+
+    Args:
+        name: The name of the executable to be found.
+        search_paths: If None, only executables that are available from PATH
+            will be found. Otherwise, will recursively search these
+            directories. Defaults to None.
+
+    Returns:
+        The path to the executable if it is found, and None otherwise.
+    """
+    if search_paths is None:
+        search_paths = []
+
+    which_path = which(name)
+    if which_path is not None:
+        return which_path
+
+    for search_path in search_paths:
+        search_path = os.path.abspath(search_path)
+        search_path = os.path.join(search_path, f"**/{name}")
+        for exec_path in glob(search_path, recursive=True):
+            return exec_path
+
+    return None
@@ -9,13 +9,21 @@ import shutil
 import subprocess
 import tempfile
 import uuid
-from typing import List, Optional
+from typing import Dict, List, Optional

 # pylint: disable=import-error
 from IPython.core.interactiveshell import InteractiveShell
 from IPython.core.magic import Magics, cell_magic, line_magic, magics_class

-from . import parsers
+from .parsers import (
+    Profiler,
+    get_parser_cuda,
+    get_parser_cuda_group_delete,
+    get_parser_cuda_group_run,
+    get_parser_cuda_group_save,
+)
+from .path_utils import CUDA_SEARCH_PATHS, find_executable
+from .setup_env import setup_environment

 DEFAULT_EXEC_FNAME = "cuda_exec.out"
 SHARED_GROUP_NAME = "shared"
@@ -37,14 +45,19 @@ class NVCCPlugin(Magics):
        super().__init__(shell)
        self.shell: InteractiveShell  # type hint not provided by parent class

-        self.parser_cuda = parsers.get_parser_cuda()
-        self.parser_cuda_group_save = parsers.get_parser_cuda_group_save()
-        self.parser_cuda_group_delete = parsers.get_parser_cuda_group_delete()
-        self.parser_cuda_group_run = parsers.get_parser_cuda_group_run()
+        self.parser_cuda = get_parser_cuda()
+        self.parser_cuda_group_save = get_parser_cuda_group_save()
+        self.parser_cuda_group_delete = get_parser_cuda_group_delete()
+        self.parser_cuda_group_run = get_parser_cuda_group_run()

        self.workdir = tempfile.mkdtemp()
        print(f'Source files will be saved in "{self.workdir}".')

+        self.profiler_paths: Dict[Profiler, Optional[str]] = {
+            Profiler.NCU: None,
+            Profiler.NSYS: None,
+        }
+
    def _save_source(
        self, source_name: str, source_code: str, group_name: str
    ) -> None:
@@ -87,7 +100,10 @@ class NVCCPlugin(Magics):
            shutil.rmtree(group_dirpath)

    def _compile(
-        self, group_name: str, executable_fname: str = DEFAULT_EXEC_FNAME
+        self,
+        group_name: str,
+        executable_fname: str = DEFAULT_EXEC_FNAME,
+        compiler_args: str = "",
    ) -> str:
        """
        Compiles all source files in a given group together with all source
@@ -97,6 +113,7 @@ class NVCCPlugin(Magics):
            group_name: The name of the source file group to be compiled.
            executable_fname: The output executable file name. Defaults to
                "cuda_exec.out".
+            compiler_args: The optional "nvcc" compiler arguments.

        Raises:
            RuntimeError: If the group does not exist or if does not have any
@@ -121,27 +138,52 @@ class NVCCPlugin(Magics):

        executable_fpath = os.path.join(group_dirpath, executable_fname)

-        args = [
-            "nvcc",
-            "-I" + shared_dirpath + "," + group_dirpath,
-        ]
+        args = ["nvcc"]
+        args.extend(compiler_args.split())
+        args.append("-I" + shared_dirpath + "," + group_dirpath)
        args.extend(source_files)
-        args.extend(
-            [
-                "-o",
-                executable_fpath,
-                "-Wno-deprecated-gpu-targets",
-            ]
-        )
+        args.extend(["-o", executable_fpath, "-Wno-deprecated-gpu-targets"])
+
        subprocess.check_output(args, stderr=subprocess.STDOUT)

        return executable_fpath

-    def _run(
+    def _get_profiler_path(self, profiler: Profiler) -> str:
+        """
+        Get the path of the executable of a given profiling tool. Searches
+        the directories of the PATH environment variable and some extra
+        directories where CUDA is usually installed.
+
+        Args:
+            profiler: The profiler whose executable should be found.
+
+        Raises:
+            RuntimeError: If the profiler executable could not be found.
+
+        Returns:
+            The file path of the executable.
+        """
+        profiler_path = self.profiler_paths[profiler]
+        if profiler_path is not None:
+            return profiler_path
+
+        profiler_path = find_executable(profiler.value, CUDA_SEARCH_PATHS)
+        if profiler_path is None:
+            raise RuntimeError(
+                f'Could not find the "{profiler.value}" profiling tool.'
+                " Consider searching for where it is installed and adding its"
+                " directory to the PATH environment variable."
+            )
+
+        self.profiler_paths[profiler] = profiler_path
+        return profiler_path
+
+    def _run(  # pylint: disable=too-many-arguments
        self,
        exec_fpath: str,
        timeit: bool = False,
        profile: bool = False,
+        profiler: Profiler = Profiler.NCU,
        profiler_args: str = "",
    ) -> str:
        """
@@ -152,8 +194,9 @@ class NVCCPlugin(Magics):
            timeit: If True, returns the result of the "timeit" magic instead
                of the standard output of the CUDA process. Defaults to False.
            profile: If True, the executable is profiled with NVIDIA Nsight
-                Compute profiling tool and its output is added to stdout.
-                Defaults to False.
+                Compute or NVIDIA Nsight Systems and the profiling output is
+                added to stdout. Defaults to False.
+            profiler: The profiling tool to use.
            profiler_args: The profiler arguments used to customize the
                information gathered by it and its overall behaviour. Defaults
                to an empty string.
@@ -175,7 +218,8 @@ class NVCCPlugin(Magics):
        else:
            run_args = []
            if profile:
-                run_args.extend(["ncu"] + profiler_args.split())
+                profiler_path = self._get_profiler_path(profiler)
+                run_args.extend([profiler_path] + profiler_args.split())
            run_args.append(exec_fpath)
            output = subprocess.check_output(
                run_args, stderr=subprocess.STDOUT
@@ -188,12 +232,16 @@ class NVCCPlugin(Magics):
        self, group_name: str, args: argparse.Namespace
    ) -> str:
        try:
-            exec_fpath = self._compile(group_name)
+            exec_fpath = self._compile(
+                group_name=group_name,
+                compiler_args=args.compiler_args(),
+            )
            output = self._run(
                exec_fpath=exec_fpath,
                timeit=args.timeit,
                profile=args.profile,
-                profiler_args=args.profiler_args,
+                profiler=args.profiler(),
+                profiler_args=args.profiler_args(),
            )
        except subprocess.CalledProcessError as e:
            output = e.output.decode("utf8")
@@ -317,5 +365,6 @@ def load_ipython_extension(shell: InteractiveShell):
    """
    Method used by IPython to load the extension.
    """
+    setup_environment()
    nvcc_plugin = NVCCPlugin(shell)
    shell.register_magics(nvcc_plugin)
@@ -0,0 +1,72 @@
+"""
+Setup steps for platforms such as Kaggle, Colab, etc. to allow our extension
+to work on them immediately after loading it.
+"""
+
+# pylint: disable=missing-function-docstring
+
+import os
+import traceback
+from subprocess import DEVNULL, STDOUT, check_call
+from typing import Optional
+
+PATH_PRIORITY_DIR = "/usr/bin/priority"
+KAGGLE_GCC_8_PATH = "/usr/bin/gcc-8"
+
+
+def print_platform(platform: str) -> None:
+    print(f'Detected platform "{platform}". Running its setup...')
+
+
+def kaggle_setup() -> None:
+    print("Updating the package lists...")
+    check_call(["/usr/bin/apt-get", "update"], stdout=DEVNULL, stderr=STDOUT)
+
+    print("Installing nvidia-cuda-toolkit, this may take a few minutes...")
+    check_call(
+        ["/usr/bin/apt-get", "install", "-y", "nvidia-cuda-toolkit"],
+        stdout=DEVNULL,
+        stderr=STDOUT,
+    )
+    os.makedirs(PATH_PRIORITY_DIR, exist_ok=True)
+
+    gcc_symlink_path = os.path.join(PATH_PRIORITY_DIR, "gcc")
+    if not os.path.exists(gcc_symlink_path):
+        os.symlink(KAGGLE_GCC_8_PATH, gcc_symlink_path)
+
+    if PATH_PRIORITY_DIR not in os.environ["PATH"].split(":"):
+        os.environ["PATH"] = f"{PATH_PRIORITY_DIR}:" + os.environ["PATH"]
+
+
+def colab_setup() -> None:
+    pass
+
+
+def setup_environment() -> None:
+    """
+    Detect the platform the extension was loaded on and run the necessary
+    steps (install dependencies, add executables to PATH, etc.) for the
+    extension to work.
+    """
+
+    if "NVCC4JUPYTER_NO_SETUP" in os.environ:
+        return
+
+    platform: Optional[str] = None
+    try:
+        if "KAGGLE_URL_BASE" in os.environ:
+            platform = "Kaggle"
+            print_platform(platform)
+            kaggle_setup()
+        elif "COLAB_RELEASE_TAG" in os.environ:
+            platform = "Colab"
+            print_platform(platform)
+            colab_setup()
+    except Exception:  # pylint: disable=broad-exception-caught
+        print(
+            f'Setup failed for detected platform "{platform}". Set the'
+            ' "NVCC4JUPYTER_NO_SETUP" environment variable to disable running'
+            " the setup on load. Please report the following error to"
+            " https://github.com/andreinechaev/nvcc4jupyter/issues:"
+            f" following error message:\n{traceback.format_exc()}"
+        )
@@ -69,6 +69,7 @@ exclude_lines = [

 [tool.isort]
 profile = "black"
+src_paths = ["nvcc4jupyter"] # tells isort where to find local modules to not consider them 3rd party libraries

 [tool.bandit]
 exclude_dirs = ["build","dist","tests","scripts"]
@@ -82,10 +83,15 @@ skips = ["B101", "B311", "B404", "B603"]
 [tool.black]
 line-length = 79
 fast = true
-experimental-string-processing = true
+preview = true
+enable-unstable-feature = ["string_processing"]

 [tool.coverage.run]
 branch = true
+omit = [
+    # cannot test installing dependencies on platforms such as kaggle
+    "nvcc4jupyter/setup_env.py",
+]

 [tool.pyright]
 include = ["src"]
@@ -286,6 +292,6 @@ deprecated-modules="optparse,tkinter.tix"

 [tool.pylint.'EXCEPTIONS']
 overgeneral-exceptions= [
-    "BaseException",
-    "Exception"
+    "builtins.BaseException",
+    "builtins.Exception"
 ]
@@ -0,0 +1,47 @@
+#include <cstdlib>
+#include <iostream>
+#include <set>
+#include <string>
+#include <iterator>
+
+#include <tuple>
+
+struct S {
+    int n;
+    std::string s;
+    float d;
+    bool operator<(const S& rhs) const
+    {
+        // compares n to rhs.n,
+        // then s to rhs.s,
+        // then d to rhs.d
+        return std::tie(n, s, d) < std::tie(rhs.n, rhs.s, rhs.d);
+    }
+};
+
+int main()
+{
+    std::set<S> mySet;
+
+    // pre C++17:
+    {
+	    S value{42, "Test", 3.14};
+	    std::set<S>::iterator iter;
+	    bool inserted;
+
+	    // unpacks the return val of insert into iter and inserted
+	    std::tie(iter, inserted) = mySet.insert(value);
+
+	    if (inserted)
+		    std::cout << "Value was inserted\n";
+    }
+
+	// with C++17:
+    {
+        S value{100, "abc", 100.0};
+        const auto [iter, inserted] = mySet.insert(value);
+
+        if (inserted)
+		    std::cout << "Value(" << iter->n << ", " << iter->s << ", ...) was inserted" << "\n";
+    }
+}
@@ -0,0 +1,8 @@
+#include <opencv2/core.hpp>
+#include <iostream>
+
+int main(int argc, char** argv)
+{
+    std::cout << cv::getBuildInformation() << std::endl;
+    return 0;
+}
@@ -1,9 +1,11 @@
+import argparse
 import glob
 import os

 import pytest
 from IPython.core.interactiveshell import InteractiveShell

+from nvcc4jupyter.parsers import Profiler
 from nvcc4jupyter.plugin import NVCCPlugin


@@ -27,10 +29,25 @@ def fixtures_path(tests_path):
    return os.path.join(tests_path, "fixtures")


+@pytest.fixture(scope="session")
+def scripts_path(fixtures_path: str):
+    return os.path.join(fixtures_path, "scripts")
+
+
+@pytest.fixture(scope="session")
+def compiler_cpp_17_fpath(fixtures_path: str):
+    return os.path.join(fixtures_path, "compiler", "cpp_17.cu")
+
+
+@pytest.fixture(scope="session")
+def compiler_opencv_fpath(fixtures_path: str):
+    return os.path.join(fixtures_path, "compiler", "opencv.cu")
+
+
@pytest.fixture(scope="session")
 def sample_magic_cu_line():
    # fmt: off
-    return '--profile --profiler-args "--metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum"'  # noqa: E501
+    return '--profile --profiler-args "--metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum" --compiler-args "--optimize 3"'  # noqa: E501
    # fmt: on


@@ -55,3 +72,14 @@ def multiple_source_fpaths(fixtures_path: str):
    pattern_h = os.path.join(fixtures_path, "multiple_files", "*.h")
    pattern_cu = os.path.join(fixtures_path, "multiple_files", "*.cu")
    return list(glob.glob(pattern_h)) + list(glob.glob(pattern_cu))
+
+
+@pytest.fixture(scope="session")
+def default_args():
+    return argparse.Namespace(
+        timeit=False,
+        profile=True,
+        profiler=lambda: Profiler.NCU,
+        profiler_args=lambda: "",
+        compiler_args=lambda: "",
+    )
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+echo "[NCU]"
+
+# this is a mock of nsight compute cli tool that just executes the program
+# given as the last argument
+"${@: -1}"
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+echo "[NSYS]"
+
+# this is a mock of nsight systems cli tool that just executes the program
+# given as the last argument
+"${@: -1}"
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+echo "This is just used to test the path_utils.find_executable function"
@@ -0,0 +1,16 @@
+import os
+
+from nvcc4jupyter.path_utils import find_executable
+
+
+def test_which():
+    assert find_executable("ls") == "/usr/bin/ls"
+
+
+def test_find_executable(fixtures_path: str):
+    exec_path = find_executable("searchforme", [fixtures_path])
+    assert exec_path is not None
+
+    exec_dir, exec_fname = os.path.split(exec_path)
+    assert exec_fname == "searchforme"
+    assert os.path.basename(exec_dir) == "scripts"
@@ -1,27 +1,26 @@
-import argparse
 import math
 import os
 import re
 import shutil
+import subprocess
+from argparse import ArgumentParser, Namespace
+from copy import deepcopy
 from typing import List

 import pytest

+from nvcc4jupyter.parsers import Profiler, get_parser_cuda, set_defaults
 from nvcc4jupyter.plugin import NVCCPlugin


-def check_profiler_output(output: str):
-    # the profiler output will be a line of "Hello World!" along with some
-    # warning lines which start with "==WARNING=="
+def check_profiler_output(output: str, profiler: str = "[NCU]"):
+    # the output from the profiler will first be a line containing only
+    # "[NCU]" or "[NSYS]" depending on what profiler was used and another
+    # line containing the string "Hello World!"
    lines = output.strip().split("\n")
-    warn_count = 0
-    for line in lines:
-        if not line.startswith("==WARNING=="):
-            assert line == "Hello World!"
-        else:
-            warn_count += 1
-    assert warn_count >= 1
-    assert warn_count == len(lines) - 1
+    assert len(lines) >= 2
+    assert lines[0] == profiler
+    assert lines[1] == "Hello World!"


 def copy_source_to_group(
@@ -36,11 +35,19 @@ def copy_source_to_group(
    return destination_fpath


+@pytest.fixture(autouse=True, scope="session")
+def before_all(scripts_path: str):
+    os.environ["PATH"] = scripts_path + os.pathsep + os.environ["PATH"]
+
+
@pytest.fixture(autouse=True, scope="function")
 def before_each(plugin: NVCCPlugin):
-    shutil.rmtree(plugin.workdir, ignore_errors=True)  # before test
+    # BEFORE TESTS
+    set_defaults(profiler=Profiler.NCU, compiler_args="", profiler_args="")
+    shutil.rmtree(plugin.workdir, ignore_errors=True)
    yield
-    pass  # after test
+    # AFTER TESTS
+    pass


 def test_save_source(plugin: NVCCPlugin, sample_cuda_code: str) -> None:
@@ -88,6 +95,49 @@ def test_compile(
        plugin._compile(gname)


+def test_compile_args(
+    plugin: NVCCPlugin,
+    compiler_cpp_17_fpath: str,
+    default_args: Namespace,
+):
+    gname = "test_compile_args"
+    copy_source_to_group(compiler_cpp_17_fpath, gname, plugin.workdir)
+
+    exec_fpath = plugin._compile(gname, compiler_args="--std c++17")
+    assert os.path.exists(exec_fpath)
+
+    # should fail due to the source file having c++ 17 features
+    with pytest.raises(subprocess.CalledProcessError):
+        exec_fpath = plugin._compile(gname, compiler_args="--std c++14")
+
+    args = deepcopy(default_args)
+    args.compiler_args = lambda: "--std c++14"
+    output = plugin._compile_and_run(group_name=gname, args=args)
+    assert "errors detected in the compilation of" in output
+
+
+def test_compile_opencv(
+    plugin: NVCCPlugin,
+    compiler_opencv_fpath: str,
+    default_args: Namespace,
+):
+    gname = "test_compile_opencv"
+    copy_source_to_group(compiler_opencv_fpath, gname, plugin.workdir)
+
+    # check that "pkg-config" exists
+    assert subprocess.check_call(["which", "pkg-config"]) == 0
+
+    pkg_config_args = ["pkg-config", "--cflags", "--libs", "opencv4"]
+    opencv_compile_options = (
+        subprocess.check_output(args=pkg_config_args).decode().strip()
+    )
+
+    args = deepcopy(default_args)
+    args.compiler_args = lambda: opencv_compile_options
+    output = plugin._compile_and_run(group_name=gname, args=args)
+    assert "General configuration for OpenCV" in output
+
+
 def test_run(
    plugin: NVCCPlugin,
    sample_cuda_fpath: str,
@@ -133,7 +183,9 @@ def test_run_profile(plugin: NVCCPlugin, sample_cuda_fpath: str):


 def test_compile_and_run_multiple_files(
-    plugin: NVCCPlugin, multiple_source_fpaths: List[str]
+    plugin: NVCCPlugin,
+    multiple_source_fpaths: List[str],
+    default_args: Namespace,
 ):
    """
    Compiles and executes 3 cuda source files from
@@ -142,14 +194,14 @@ def test_compile_and_run_multiple_files(
    gname = "test_compile_and_run_multiple_files"
    for fpath in multiple_source_fpaths:
        copy_source_to_group(fpath, gname, plugin.workdir)
-    output = plugin._compile_and_run(
-        gname, argparse.Namespace(timeit=False, profile=True, profiler_args="")
-    )
+    output = plugin._compile_and_run(group_name=gname, args=default_args)
    check_profiler_output(output)


 def test_compile_and_run_multiple_files_shared(
-    plugin: NVCCPlugin, multiple_source_fpaths: List[str]
+    plugin: NVCCPlugin,
+    multiple_source_fpaths: List[str],
+    default_args: Namespace,
 ):
    """
    Compiles and executes 3 cuda source files from
@@ -164,14 +216,12 @@ def test_compile_and_run_multiple_files_shared(
            copy_source_to_group(fpath, gname, plugin.workdir)
        else:
            copy_source_to_group(fpath, "shared", plugin.workdir)
-    output = plugin._compile_and_run(
-        gname, argparse.Namespace(timeit=False, profile=True, profiler_args="")
-    )
+    output = plugin._compile_and_run(group_name=gname, args=default_args)
    check_profiler_output(output)


 def test_read_args(plugin: NVCCPlugin):
-    parser = argparse.ArgumentParser()
+    parser = ArgumentParser()
    parser.add_argument("-a", type=str, required=True)
    parser.add_argument("-b", type=float, required=True)
    args = plugin._read_args(
@@ -181,6 +231,29 @@ def test_read_args(plugin: NVCCPlugin):
    assert math.isclose(args.b, 0.75)


+def test_set_defaults():
+    parser = get_parser_cuda()
+    args = parser.parse_args([])
+    assert args.profiler_args() == ""
+    assert args.compiler_args() == ""
+    set_defaults(profiler_args="123")
+    args = parser.parse_args([])
+    assert args.profiler_args() == "123"
+    assert args.compiler_args() == ""
+    set_defaults(compiler_args="456")
+    args = parser.parse_args([])
+    assert args.profiler_args() == "123"
+    assert args.compiler_args() == "456"
+    set_defaults(profiler_args="")
+    args = parser.parse_args([])
+    assert args.profiler_args() == ""
+    assert args.compiler_args() == "456"
+    set_defaults(profiler_args="123")
+    args = parser.parse_args(["--profiler-args", "789"])
+    assert args.profiler_args() == "789"
+    assert args.compiler_args() == "456"
+
+
 def test_magic_cuda(
    capsys,
    plugin: NVCCPlugin,
@@ -191,6 +264,28 @@ def test_magic_cuda(
    check_profiler_output(capsys.readouterr().out)


+def test_magic_cuda_set_default_profiler(
+    capsys,
+    plugin: NVCCPlugin,
+    sample_cuda_code: str,
+    sample_magic_cu_line: str,
+):
+    # set the default profiler to Nsight Systems
+    set_defaults(profiler=Profiler.NSYS)
+    plugin.cuda(sample_magic_cu_line, sample_cuda_code)
+    check_profiler_output(capsys.readouterr().out, profiler="[NSYS]")
+
+
+def test_magic_cuda_bad_args(
+    capsys,
+    plugin: NVCCPlugin,
+    sample_cuda_code: str,
+):
+    plugin.cuda("--this-is-an-unrecognized-argument", sample_cuda_code)
+    output = capsys.readouterr().out
+    assert output.startswith("usage: ")
+
+
 def test_magic_cuda_group_save(plugin: NVCCPlugin, sample_cuda_code: str):
    gname = "test_save_source"
    sname = "sample.cu"
Author	SHA1	Message	Date
dependabot[bot]	28f872a2f9	Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 3 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v3...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2024-09-13 14:04:10 +02:00
Cosmin Ștefan Ciocan	801584cceb	Merge pull request #35 from andreinechaev/feature/readme-import-notebook-kaggle Add README.md with GIF describing steps to import an example notebook…	2024-06-23 18:09:04 +02:00
Cosmin Ștefan Ciocan	97d8889238	Add README.md with GIF describing steps to import an example notebook in Kaggle from GitHub	2024-06-08 22:01:43 +00:00
Cosmin Ștefan Ciocan	0b60c3637e	Merge pull request #34 from andreinechaev/feature/first-example-notebooks Add example notebooks for compiling with opencv and an adaptation of a CUDA training series	2024-06-03 18:57:11 +02:00
Cosmin Ștefan Ciocan	1b238ff107	Create separate files for blur kernel and the error handling macro to make the main function more readable	2024-06-02 09:13:56 +00:00
Cosmin Ștefan Ciocan	e15f41d213	Add example notebooks for compiling with opencv and an adaptation of a CUDA training series	2024-05-19 12:45:58 +00:00
Cosmin Ștefan Ciocan	326b0a57a8	Remove duplicated compiler args from magics.rst	2024-05-04 08:47:07 +02:00
Cosmin Ștefan Ciocan	5741c52254	Setup Kaggle environment on extension load (#31 ) * Automatically setup kaggle environment https://github.com/andreinechaev/nvcc4jupyter/issues/29 * Update package lists before installing cuda toolkit	2024-04-03 17:23:33 +02:00
Cosmin Ștefan Ciocan	4664a4ef47	Update __init__.py Bump version to 1.2.0	2024-03-20 12:13:23 +01:00
Cosmin Ștefan Ciocan	0bddf6a6e6	Add option to choose between NSYS and NCU profilers (#28 ) * Add option to give nvcc extra arguments * Add test for nvcc options that changes c++ dialect from c++17 to c++14 * Add make and the english language pack to devcontainer to be able to build the documentation * Update documentation config to automatically import the current version of the package * Document new --compiler-args argument * Improve tests coverage by testing for bad arguments and the error output during a failed compilation * Add IPython to docs requirements to allow the __version__ import for readthedocs env * Change devcontainer base image to have the latest CUDA toolkit * Mock the nsight compute tool with a bash script * Add test to compile with opencv * Add new page to documentation that contains a new notebook that explains compiling with external libraries * Add autodocstring vscode extension to devcontainer * Add function that modifies the default profiler/compiler arguments to allow reusing them in multiple magic command calls * Update pylint exceptions * Update contributing instructions * Change version from 1.0.3 to 1.1.0 due to adding features in a backward-compatible manner * Install latest CUDA toolkit on the test runner to pass the OpenCV compilation test * Install opencv in test runner and update code coverage install * Add CUDA bin to PATH in test and coverage runners * Add cuda bin to path variable in .bashrc * Update way to set environment variable PATH in github action * Change devcontainer base image back to ubuntu:22.04 to match the environment from the test runner * Add option to choose between NSYS and NCU profilers * Add tests for choosing the profiler * Add isort config to help it find local modules so they are not considered 3rd party libraries * Replace experimental-string-processing black formatter config with enable-unstable-feature as it was removed in version 24.1.0 * Search for profiling tools executable paths when they are required * Install dev dependencies in editable mode * Add documentation for using Nsight Systems instead of the default Nsight Compute profiling tool * Fix cuda typo * Mention Nsight Systems in README.md	2024-03-20 11:42:27 +01:00
Cosmin Ștefan Ciocan	781ff5b76b	Feature: Passing arguments to NVCC compiler (#26 ) * Add option to give nvcc extra arguments * Add test for nvcc options that changes c++ dialect from c++17 to c++14 * Add make and the english language pack to devcontainer to be able to build the documentation * Update documentation config to automatically import the current version of the package * Document new --compiler-args argument * Improve tests coverage by testing for bad arguments and the error output during a failed compilation * Add IPython to docs requirements to allow the __version__ import for readthedocs env * Change devcontainer base image to have the latest CUDA toolkit * Mock the nsight compute tool with a bash script * Add test to compile with opencv * Add new page to documentation that contains a new notebook that explains compiling with external libraries * Add autodocstring vscode extension to devcontainer * Add function that modifies the default profiler/compiler arguments to allow reusing them in multiple magic command calls * Update pylint exceptions * Update contributing instructions * Change version from 1.0.3 to 1.1.0 due to adding features in a backward-compatible manner * Install latest CUDA toolkit on the test runner to pass the OpenCV compilation test * Install opencv in test runner and update code coverage install * Add CUDA bin to PATH in test and coverage runners * Add cuda bin to path variable in .bashrc * Update way to set environment variable PATH in github action * Change devcontainer base image back to ubuntu:22.04 to match the environment from the test runner	2024-02-12 17:29:26 +01:00
Cosmin Ștefan Ciocan	5cd225851b	Merge pull request #24 from andreinechaev/docs/readme-badge-rename Change "cosminc98" to "andreinechaev" in badge URLs	2024-01-23 16:14:31 +01:00