Submitted by Xilodyne on Wed, 04/29/2020 - 21:26

I needed to compile and debug a CUDA ".cu" file on my Windows 10 Pro with a Nvidia Quadra 2000 GPU card (yes, an old HP Z600 workstation with an old GPU). Installing CUDA 10.2, the last Quadro 2000 driver (377.83 from 2017) with Visual Studio 2019 worked great.

Until I needed to debug.

Nsight unable to debug error
`Break points ignored. Nsight message: A CUDA context was created on a GPU that is not currently debuggable. Breakpoints will be disabled. Adapter: Quadro 2000`

After some trail and error (errors at the bottom of this blog entry) I have a working environment that can debug in Visual Studio and compile on the command line. This required going back in time and using executables released more than five years ago. Luckily all of the problems I encounter have been long solved and easily searched online.

I first had to determine that the Nvidia Quadro 2000 GPU was a Fermi microarchitecture (https://en.wikipedia.org/wiki/CUDA). The last Nsight version (used for CUDA debugging) that works with Fermi is Nsight 4.7, which comes with CUDA 7.5. This means features in GPUs > Fermi will not work in this environment (i.e. feature callMallocManaged).

Install

CUDA: cuda_7.5.18_win10.exe (https://developer.nvidia.com/cuda-75-downloads-archive). Install all options.
Visual Studio Community 2013 with Update 5: en_visual_studio_community_2013_with_update_5_x86_6816332.exe (https://my.visualstudio.com/Downloads?q=visual%20studio%202013&wt.mc_id=o~msft~vscom~older-downloads)
VS Build Tools 2013 (for command line access): BuildTools_Full.exe (https://www.microsoft.com/en-us/download/confirmation.aspx?id=40760) Be sure to install the command line interface (CLI).

Verify Visual Studio 2013 works with CUDA .cu files and GPU

Following: https://riptutorial.com/cuda

In VS2013 Menu Bar: File -> Open -> Project Solution, Open the CUDA samples Samples_vs2013.sln.

In the Solution Explorer (panel on right hand side), Highlight 1_Utilities -> DeviceQuery
Right-click, Build Solution
VS2013 Menu Bar: DEBUG -> Start Without Debugging

Result (success) `DeviceQuery.cpp`
`E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery\../../bin/win64/Debug/deviceQuery.exe Starting...` `CUDA Device Query (Runtime API) version (CUDART static linking)` `Detected 1 CUDA Capable device(s)` Device 0: "Quadro 2000" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073741824 bytes) ( 4) Multiprocessors, ( 48) CUDA Cores/MP: 192 CUDA Cores GPU Max Clock rate: 1251 MHz (1.25 GHz) Memory Clock rate: 1304 Mhz Memory Bus Width: 128-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 15 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > `deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro 2000 Result = PASS Press any key to continue . . .`

Verify Visual Studio 2013 CUDA debugging

Creating a new CUDA project - https://www.youtube.com/watch?v=2EbHSCvGFM0

VS2013 Menu Bar: File -> New Project -> Installed -> Templates -> NVIDIA -> CUDA 7.5

Name the Project checkDebug, click OK

Build and run the kernel.cu without debugging (let's see it works first).

VS2013 Menu Bar: BUILD -> Build Solution
VS2013 Menu Bar: DEBUG -> Start Without Debugging

Result (success): checkDebug
`{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55} Press any key to continue . . .`

Run with debugging

In kernel.cu, set a breakpoint on line 18: const int arraySize = 5;
- (highlight code F9, or right click in far left column)
VS2013 Menu Bar: NSIGHT -> Start CUDA Debugging

Nsight will start (will require Administrator privileges to launch), the green icon will appear in the taskbar.

Nsight running icon

Press F10 to step through the lines. The Autos and Call Stack window should update appropriately

Result (success)
`{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}`

Create custom CUDA code in Visual Studio 2013

Try CPU code only

VS2013 Menu Bar: File -> New Project -> Installed -> Templates -> NVIDIA -> CUDA 7.5
Name the Project vectorAddCPU, click OK
Delete the code in kernel.cu and copy/paste with the vectorAddCPU.cu code (https://github.com/siddharthsharmanv/cudacasts/tree/master/YourFirstCUDACProgram).

`vectorAddCPU.cu`
`#include <stdio.h> #define SIZE 1024` `void VectorAdd(int a, int b, int c, int n) { int i;` `for (i=0; i < n; ++i) c[i] = a[i] + b[i]; }` `int main() { int a, b, c; a = (int )malloc(SIZE sizeof(int)); b = (int )malloc(SIZE sizeof(int)); c = (int )malloc(SIZE sizeof(int)); for (int i = 0; i < SIZE; ++i) { a[i] = i; b[i] = i; c[i] = 0; } VectorAdd(a, b, c, SIZE);` `for (int i = 0; i < 10; ++i) printf("c[%d] = %d\n", i, c[i]);` `free(a); free(b); free(c);` `return 0; }`

VS2013 Menu Bar: BUILD -> BUILD

Build (success): `vectorAddCPU`
1>------ Build started: Project: vectorAddCPU, Configuration: Debug Win32 ------ 1> Compiling CUDA source file kernel.cu... 1> 1> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\vectorAddCPU>"E:\CUDA\CUDA_v7.5\Toolkit\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "E:\Microsoft Visual Studio\2013\VC\bin" -IE:\CUDA\CUDA_v7.5\Toolkit\include -IE:\CUDA\CUDA_v7.5\Toolkit\include -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\vectorAddCPU\kernel.cu" 1> kernel.cu 1> vectorAddCPU.vcxproj -> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\Debug\vectorAddCPU.exe 1> copy "E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart*.dll" "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\Debug\" 1> E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart32_75.dll 1> E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart64_75.dll 1> 2 file(s) copied. ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

Run vectorAddCPU (aka. kernel.cu)

VS2013 Menu Bar: DEBUG -> Start Without Debugging

Result (success): run `vectorAddCPU.exe`
`c[0] = 0 c[1] = 2 c[2] = 4 c[3] = 6 c[4] = 8 c[5] = 10 c[6] = 12 c[7] = 14 c[8] = 16 c[9] = 18 Press any key to continue . . .`

Try GPU code

VS2013 Menu Bar: File -> New Project -> Installed -> Templates -> NVIDIA -> CUDA 7.5
Name the Project helloWorldGPU, click OK

HelloWorldGPU.cu
`#include<stdio.h> #include<stdlib.h>` `__global__ void print_from_gpu(void) { printf("Hello World! from thread [%d,%d] \ From device\n", threadIdx.x, blockIdx.x); }` `int main(void) { printf("Hello World from host!\n"); print_from_gpu << <1, 1 >> >(); cudaDeviceSynchronize(); return EXIT_SUCCESS; }`

Run HelloWorldGPU.cu

Result (success): HelloWorldGPU.exe
`Hello World from host! Hello World! from thread [0,0] From device Press any key to continue . . .`

Verify CUDA from command line

If missing, find the Visual Studio 2013 command prompt (https://stackoverflow.com/questions/21476588/where-is-developer-command-prompt-for-vs2013 )
Look in: C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Visual Studio 2013

Verify that CUDA appears in the PATH

>echo %PATH%

CUDA in VS2013 Command Prompt Path
E:\Microsoft Visual Studio\2013>echo %PATH% E:\Microsoft Visual Studio\2013\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86)\Microsoft SDKs\F#\3.1\Framework\v4.0\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0;C:\Program Files (x86)\MSBuild\12.0\bin;E:\Microsoft Visual Studio\2013\Common7\IDE\;E:\Microsoft Visual Studio\2013\VC\BIN;E:\Microsoft Visual Studio\2013\Common7\Tools;C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319;E:\Microsoft Visual Studio\2013\VC\VCPackages;C:\Program Files (x86)\HTML Help Workshop;E:\Microsoft Visual Studio\2013\Team Tools\Performance Tools;C:\Program Files (x86)\Windows Kits\8.1\bin\x86;C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\;E:\CUDA\CUDA_v7.5\Toolkit\bin;E:\CUDA\CUDA_v7.5\Toolkit\libnvvp;;;;;;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;E:\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64;C:\Users\adminroot\.dnx\bin;C:\Program Files\Microsoft DNX\Dnvm\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\120\Tools\Binn\;C:\Users\aholi_000\AppData\Local\Microsoft\WindowsApps; `E:\Microsoft Visual Studio\2013>`

With the VS2013 command prompt, verify nvcc works

>nvcc --version

Result (success) nvcc --version
`E:\Microsoft Visual Studio\2013>nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:49:10_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17` `E:\Microsoft Visual Studio\2013>`

Navigate to where your CUDA 7.5 Samples is stored, into the 1_Utililties directory, in my case: E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

Try running and compiling CUDA sample DeviceQuery.cpp

>nvcc -o testDevQuery deviceQuery.cpp

Error: fatal error C1083: Cannot open include file: 'helper_cuda.h': No such file or directory

Include the header files with nvcc.

>nvcc -o devQuery -I E:\CUDA\CUDA_v7.5\Samples\common\inc deviceQuery.cpp

Run devQuery.exe

Result (success): `devQuery.exe`
`E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>devQuery.exe devQuery.exe Starting...` `CUDA Device Query (Runtime API) version (CUDART static linking)` `Detected 1 CUDA Capable device(s)` Device 0: "Quadro 2000" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 2.1 Total amount of global memory: 1024 MBytes (1073741824 bytes) ( 4) Multiprocessors, ( 48) CUDA Cores/MP: 192 CUDA Cores GPU Max Clock rate: 1251 MHz (1.25 GHz) Memory Clock rate: 1304 Mhz Memory Bus Width: 128-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (65535, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 15 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > `deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro 2000 Result = PASS` `E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>`

Create and Run New Project from Command Line

Create a new file hello.cu (https://riptutorial.com/cuda/example/9316/let-s-launch-a-single-cuda-thread-to-say-hello)
Using the VS2013 command prompt, compile and run
- Make sure to include the necessary headers from the CUDA toolkit:

`hello.c`
`#include <stdio.h>` `// __global__ functions, or "kernels", execute on the device __global__ void hello_kernel(void) { printf("Hello, world from the device!\n"); }` `int main(void) { // greet from the host printf("Hello, world from the host!\n");` `// launch a kernel with a single thread to greet from the device hello_kernel<<<1,1>>>();` `// wait for the device to finish so that we see the message cudaDeviceSynchronize();` `return 0; }`

Compile

>nvcc -o hello hello.cu

Build (success) `hello.cu`
`E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>nvcc -o hello hello.cu hello.cu Creating library hello.lib and object hello.exp` `E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>`

Run

>nvcc -o hello hello.cu

Result (success) `hello.cu`
`E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>hello Hello, world from the host! Hello, world from the device!` `E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>`

Errors

Nsight debug problems

Nsight unable to debug error
`Break points ignored. Nsight message: A CUDA context was created on a GPU that is not currently debuggable. Breakpoints will be disabled. Adapter: Quadro 2000`

My Nsight 5.2 doesn't work with Fermi family GPUs (i.e. Quadro 2000)

https://stackoverflow.com/questions/43030274/a-cuda-context-was-created-on-a-gpu-that-is-not-currently-debuggable

Missing `#include <stdio.h>`

`nvcc` compile (fails): missing `stdio.h`
`E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>nvcc -o test test.cu test.cu test.cu(8): error: identifier "printf" is undefined` `1 error detected in the compilation of "C:/Users/AHOLI_~1/AppData/Local/Temp/tmpxft_000020dc_00000000-6_test.cpp4.ii".`

Correct code

#include <stdio.h>

__global__ void foo() {}

int main() { foo<<<1,1>>>();

cudaDeviceSynchronize(); printf("CUDA error: %s\n", cudaGetErrorString(cudaGetLastError()));

return 0; }

`test.bat`
`nvcc -o test test.cu`

Executes normally.

Result (success): test.exe
`E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>test.bat` `E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>nvcc -o test test.cu test.cu Creating library test.lib and object test.exp` `E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>test CUDA error: no error` `E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>`

CUDA 7.5 cannot use `cudaMallocManaged`

Using the revised code from https://www.youtube.com/watch?v=2EbHSCvGFM0

Convert VectorAddCPU.cu to VectorAddGPU.cu

`VectorAddGPU.cu`
`#include <stdio.h> #define SIZE 1024` `__global__ void VectorAdd(int a, int b, int c, int n) {` `int i = threadIdx.x;` `if (i < n) c[i] = a[i] + b[i]; }` `__global__ void print_from_gpu(void) { printf("Hello World! from thread [%d,%d] \ From device\n", threadIdx.x, blockIdx.x); }` `int main() { int a, b, c;` `printf("Hello World from host!\n"); print_from_gpu << <1, 1 >> >(); cudaDeviceSynchronize();` `cudaMallocManaged(&a, SIZE * sizeof(int)); cudaMallocManaged(&b, SIZE * sizeof(int)); cudaMallocManaged(&c, SIZE * sizeof(int));` `printf("passed cudaMallocManaged\n");` `for (int i = 0; i < SIZE; ++i) { a[i] = i; b[i] = i; c[i] = 0; }` `printf("passed var addition");` `VectorAdd <<<1, SIZE>>> (a, b, c, SIZE); cudaDeviceSynchronize();` `for (int i = 0; i < 10; ++i) printf("c[%d] = %d\n", i, c[i]);` `cudaFree(a); cudaFree(b); cudaFree(c);` `return 0; }`

Builds correctly.

Build (success): `VectorAddGPU.cu`
1>------ Build started: Project: vectorAddGPU, Configuration: Debug Win32 ------ 1> Compiling CUDA source file vectorAddGPU.cu... 1> 1> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\vectorAddGPU>"E:\CUDA\CUDA_v7.5\Toolkit\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "E:\Microsoft Visual Studio\2013\VC\bin" -IE:\CUDA\CUDA_v7.5\Toolkit\include -IE:\CUDA\CUDA_v7.5\Toolkit\include -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\vectorAddGPU.cu.obj "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\vectorAddGPU\vectorAddGPU.cu" 1> vectorAddGPU.cu 1> vectorAddGPU.vcxproj -> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\Debug\vectorAddGPU.exe 1> copy "E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart*.dll" "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\Debug\" 1> E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart32_75.dll 1> E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart64_75.dll 1> 2 file(s) copied. ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

VS2012 Menu Bar: DEBUG -> Start Without Debugging

Correct answer should be as listed at top of blog for VectorAddCPU.cu

Does not return correct answer.

Result (failure): `VectorAddGPU.exe`
`Hello World from host! Hello World! from thread [0,0] From device passed cudaMallocManaged Press any key to continue . . .`

Running in debug, VS2012 Menu Bar: NSIGHT -> Start CUDA Debugging

Fails on Line 37: a[i] = i;

Unhandled Exception
`Unhandled exception at 0x004E152B in vectorAddGPU.exe: 0xC0000005: Access violation writing location 0x00000000.`

Variable "a" is never initalized.

Related to Quadro 2000 is a CUDA 2.1 compute capability and cudaMallocManaged is CUDA >=3.0 compute capability
http://selkie.macalester.edu/csinparallel/modules/TimingCUDA/build/html/0-Introduction/Introduction.html

cudaMemcpy, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost must be used instead of cudaMallocManaged on CUDA 7.5

https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
// Transfer data from host to device memory cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);

Step by Step: CUDA 7.5 on Windows

Install

Verify Visual Studio 2013 works with CUDA .cu files and GPU

Verify Visual Studio 2013 CUDA debugging

Run with debugging

Create custom CUDA code in Visual Studio 2013

Try CPU code only

Try GPU code

Verify CUDA from command line

Create and Run New Project from Command Line

Errors

Nsight debug problems

Missing `#include <stdio.h>`

CUDA 7.5 cannot use `cudaMallocManaged`

Builds correctly.

Links

Tags

Step by Step: CUDA 7.5 on Windows

Install

Verify Visual Studio 2013 works with CUDA .cu files and GPU

Verify Visual Studio 2013 CUDA debugging

Run with debugging

Create custom CUDA code in Visual Studio 2013

Try CPU code only

Try GPU code

Verify CUDA from command line

Create and Run New Project from Command Line

Errors

Nsight debug problems

Missing #include <stdio.h>

CUDA 7.5 cannot use cudaMallocManaged

Builds correctly.

Links

Tags

Missing `#include <stdio.h>`

CUDA 7.5 cannot use `cudaMallocManaged`