Step by Step: CUDA 7.5 on Windows

Submitted by Xilodyne on Wed, 04/29/2020 - 21:26
CUDA 7.5 logo

I needed to compile and debug a CUDA ".cu" file on my Windows 10 Pro with a Nvidia Quadra 2000 GPU card (yes, an old HP Z600 workstation with an old GPU).  Installing CUDA 10.2, the last Quadro 2000 driver (377.83 from 2017) with Visual Studio 2019 worked great. 

Until I needed to debug.

Nsight unable to debug error
Break points ignored. Nsight message: A CUDA context was created on a GPU that is not currently debuggable. Breakpoints will be disabled.
Adapter: Quadro 2000

After some trail and error (errors at the bottom of this blog entry) I have a working environment that can debug in Visual Studio and compile on the command line.  This required going back in time and using executables released more than five years ago.  Luckily all of the problems I encounter have been long solved and easily searched online.  

I first had to determine that the Nvidia Quadro 2000 GPU was a Fermi microarchitecture (https://en.wikipedia.org/wiki/CUDA).  The last Nsight version (used for CUDA debugging) that works with Fermi is Nsight 4.7, which comes with CUDA 7.5.  This means features in GPUs > Fermi will not work in this environment (i.e. feature callMallocManaged).

Install

 

Verify Visual Studio 2013 works with CUDA .cu files and GPU

Following:  https://riptutorial.com/cuda

In VS2013 Menu Bar:  File -> Open -> Project Solution,  Open the CUDA samples Samples_vs2013.sln.

  • In the Solution Explorer (panel on right hand side), Highlight 1_Utilities -> DeviceQuery
  • Right-click, Build Solution
  • VS2013 Menu Bar:  DEBUG -> Start Without Debugging
Result (success) DeviceQuery.cpp

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery\../../bin/win64/Debug/deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro 2000"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 4) Multiprocessors, ( 48) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            1251 MHz (1.25 GHz)
  Memory Clock rate:                             1304 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro 2000
Result = PASS
Press any key to continue . . .

 

Verify Visual Studio 2013 CUDA debugging

Creating a new CUDA project -  https://www.youtube.com/watch?v=2EbHSCvGFM0

VS2013 Menu Bar:  File -> New Project -> Installed -> Templates -> NVIDIA -> CUDA 7.5

 Name the Project checkDebug, click OK

Visual Studio new Nvidia CUDA project

Build and run the kernel.cu without debugging (let's see it works first).

  • VS2013 Menu Bar: BUILD -> Build Solution
  • VS2013 Menu Bar: DEBUG -> Start Without Debugging
Result (success): checkDebug
{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}
Press any key to continue . . .

 

Run with debugging

  • In kernel.cu, set a breakpoint on line 18:  const int arraySize = 5;
    • (highlight code F9, or right click in far left column)
  • VS2013 Menu Bar:  NSIGHT -> Start CUDA Debugging

 

Nsight will start (will require Administrator privileges to launch), the green icon will appear in the taskbar.

Nsight running icon

 

 

 

 

Press F10 to step through the lines.  The Autos and Call Stack window should update appropriately

Result (success)
{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}

 

Create custom CUDA code in Visual Studio 2013

Try CPU code only

 vectorAddCPU.cu

#include <stdio.h>
#define SIZE    1024

void VectorAdd(int *a, int *b, int *c, int n)
{
    int i;

    for (i=0; i < n; ++i)
        c[i] = a[i] + b[i];
}

int main()
{
    int *a, *b, *c;
    
    a = (int *)malloc(SIZE * sizeof(int));
    b = (int *)malloc(SIZE * sizeof(int));
    c = (int *)malloc(SIZE * sizeof(int));
    
    for (int i = 0; i < SIZE; ++i)
    {
        a[i] = i;
        b[i] = i;
        c[i] = 0;
    }
    
    VectorAdd(a, b, c, SIZE);

    for (int i = 0; i < 10; ++i)
        printf("c[%d] = %d\n", i, c[i]);

    free(a);
    free(b);
    free(c);

    return 0;
}

  • VS2013 Menu Bar: BUILD -> BUILD
Build (success): vectorAddCPU
1>------ Build started: Project: vectorAddCPU, Configuration: Debug Win32 ------
1>  Compiling CUDA source file kernel.cu...
1>  
1>  E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\vectorAddCPU>"E:\CUDA\CUDA_v7.5\Toolkit\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "E:\Microsoft Visual Studio\2013\VC\bin"  -IE:\CUDA\CUDA_v7.5\Toolkit\include -IE:\CUDA\CUDA_v7.5\Toolkit\include  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\vectorAddCPU\kernel.cu"
1>  kernel.cu
1>  vectorAddCPU.vcxproj -> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\Debug\vectorAddCPU.exe
1>  copy "E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart*.dll" "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddCPU\Debug\"
1>  E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart32_75.dll
1>  E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart64_75.dll
1>          2 file(s) copied.
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

 

Run vectorAddCPU (aka. kernel.cu)

  • VS2013 Menu Bar:  DEBUG -> Start Without Debugging
Result (success): run vectorAddCPU.exe

c[0] = 0
c[1] = 2
c[2] = 4
c[3] = 6
c[4] = 8
c[5] = 10
c[6] = 12
c[7] = 14
c[8] = 16
c[9] = 18
Press any key to continue . . .

 

Try GPU code

  • VS2013 Menu Bar:  File -> New Project -> Installed -> Templates -> NVIDIA -> CUDA 7.5
  • Name the Project helloWorldGPU, click OK

 

HelloWorldGPU.cu

#include<stdio.h>
#include<stdlib.h>

__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] \
        From device\n", threadIdx.x, blockIdx.x);
}

int main(void) {
    printf("Hello World from host!\n");
    print_from_gpu << <1, 1 >> >();
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Run HelloWorldGPU.cu

Result (success): HelloWorldGPU.exe

Hello World from host!
Hello World! from thread [0,0]                     From device
Press any key to continue . . .

 

Verify CUDA from command line

If missing, find the Visual Studio 2013 command prompt (https://stackoverflow.com/questions/21476588/where-is-developer-command-prompt-for-vs2013 )
Look in:  C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Visual Studio 2013

Verify that CUDA appears in the PATH

>echo %PATH%

CUDA in VS2013 Command Prompt Path

E:\Microsoft Visual Studio\2013>echo %PATH%
E:\Microsoft Visual Studio\2013\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86)\Microsoft SDKs\F#\3.1\Framework\v4.0\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0;C:\Program Files (x86)\MSBuild\12.0\bin;E:\Microsoft Visual Studio\2013\Common7\IDE\;E:\Microsoft Visual Studio\2013\VC\BIN;E:\Microsoft Visual Studio\2013\Common7\Tools;C:\WINDOWS\Microsoft.NET\Framework\v4.0.30319;E:\Microsoft Visual Studio\2013\VC\VCPackages;C:\Program Files (x86)\HTML Help Workshop;E:\Microsoft Visual Studio\2013\Team Tools\Performance Tools;C:\Program Files (x86)\Windows Kits\8.1\bin\x86;C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\;E:\CUDA\CUDA_v7.5\Toolkit\bin;E:\CUDA\CUDA_v7.5\Toolkit\libnvvp;;;;;;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;E:\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64;C:\Users\adminroot\.dnx\bin;C:\Program Files\Microsoft DNX\Dnvm\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Microsoft SQL Server\120\Tools\Binn\;C:\Users\aholi_000\AppData\Local\Microsoft\WindowsApps;

E:\Microsoft Visual Studio\2013>

With the VS2013 command prompt, verify nvcc works

>nvcc --version

Result (success) nvcc --version

E:\Microsoft Visual Studio\2013>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:49:10_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

E:\Microsoft Visual Studio\2013>

Navigate to where your CUDA 7.5 Samples is stored, into the 1_Utililties directory, in my case:  E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

Navigate to deviceQuery.cu

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>dir
 Volume in drive E is Data
 Volume Serial Number is AAD3-8C76

 Directory of E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

04/30/2020  10:02 AM    <DIR>          .
04/30/2020  10:02 AM    <DIR>          ..
05/27/2015  04:39 PM            13,208 deviceQuery.cpp
08/16/2015  02:32 PM               871 deviceQuery_vs2010.sln
08/16/2015  02:32 PM             4,712 deviceQuery_vs2010.vcxproj
08/16/2015  02:32 PM               871 deviceQuery_vs2012.sln
08/16/2015  02:32 PM             4,757 deviceQuery_vs2012.vcxproj
04/29/2020  01:38 PM        18,219,008 deviceQuery_vs2013.sdf
04/29/2020  11:58 AM               950 deviceQuery_vs2013.sln
04/27/2020  09:39 AM             4,753 deviceQuery_vs2013.vcxproj
08/16/2015  02:32 PM               176 readme.txt
04/29/2020  11:12 AM    <DIR>          x64
              13 File(s)     18,449,834 bytes
               3 Dir(s)  1,450,845,192,192 bytes free

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>

 

Try running and compiling CUDA sample DeviceQuery.cpp

>nvcc -o testDevQuery deviceQuery.cpp

Error: fatal error C1083: Cannot open include file: 'helper_cuda.h': No such file or directory

nvcc compile failure: missing helper_cuda.h

E:\Microsoft Visual Studio\2013>cd \CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>dir
 Volume in drive E is Data
 Volume Serial Number is AAD3-8C76

 Directory of E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

04/29/2020  01:38 PM    <DIR>          .
04/29/2020  01:38 PM    <DIR>          ..
05/27/2015  04:39 PM            13,208 deviceQuery.cpp
08/16/2015  02:32 PM               871 deviceQuery_vs2010.sln
08/16/2015  02:32 PM             4,712 deviceQuery_vs2010.vcxproj
08/16/2015  02:32 PM               871 deviceQuery_vs2012.sln
08/16/2015  02:32 PM             4,757 deviceQuery_vs2012.vcxproj
04/29/2020  01:38 PM        18,219,008 deviceQuery_vs2013.sdf
04/29/2020  11:58 AM               950 deviceQuery_vs2013.sln
04/27/2020  09:39 AM             4,753 deviceQuery_vs2013.vcxproj
04/29/2020  12:14 PM           198,144 devQuery.exe
04/29/2020  12:14 PM               648 devQuery.exp
04/29/2020  12:14 PM             1,736 devQuery.lib
08/16/2015  02:32 PM               176 readme.txt
04/29/2020  11:12 AM    <DIR>          x64
              12 File(s)     18,449,834 bytes
               3 Dir(s)  1,450,922,340,352 bytes free

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>nvcc -o devQuery deviceQuery.cpp
deviceQuery.cpp
deviceQuery.cpp(20) : fatal error C1083: Cannot open include file: 'helper_cuda.h': No such file or directory

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>

Include the header files with nvcc.

>nvcc -o devQuery -I E:\CUDA\CUDA_v7.5\Samples\common\inc  deviceQuery.cpp

nvcc compile success

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>nvcc -o devQuery -I E:\CUDA\CUDA_v7.5\Samples\common\inc  deviceQuery.cpp
deviceQuery.cpp
   Creating library devQuery.lib and object devQuery.exp

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>dir
 Volume in drive E is Data
 Volume Serial Number is AAD3-8C76

 Directory of E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery

04/30/2020  10:02 AM    <DIR>          .
04/30/2020  10:02 AM    <DIR>          ..
05/27/2015  04:39 PM            13,208 deviceQuery.cpp
08/16/2015  02:32 PM               871 deviceQuery_vs2010.sln
08/16/2015  02:32 PM             4,712 deviceQuery_vs2010.vcxproj
08/16/2015  02:32 PM               871 deviceQuery_vs2012.sln
08/16/2015  02:32 PM             4,757 deviceQuery_vs2012.vcxproj
04/29/2020  01:38 PM        18,219,008 deviceQuery_vs2013.sdf
04/29/2020  11:58 AM               950 deviceQuery_vs2013.sln
04/27/2020  09:39 AM             4,753 deviceQuery_vs2013.vcxproj
04/30/2020  10:04 AM           198,144 devQuery.exe
04/30/2020  10:04 AM               648 devQuery.exp
04/30/2020  10:04 AM             1,736 devQuery.lib
04/30/2020  10:02 AM                 0 nvcc
08/16/2015  02:32 PM               176 readme.txt
04/29/2020  11:12 AM    <DIR>          x64
              13 File(s)     18,449,834 bytes
               3 Dir(s)  1,450,922,340,352 bytes free

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>

Run devQuery.exe

Result (success): devQuery.exe

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>devQuery.exe
devQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro 2000"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 4) Multiprocessors, ( 48) CUDA Cores/MP:     192 CUDA Cores
  GPU Max Clock rate:                            1251 MHz (1.25 GHz)
  Memory Clock rate:                             1304 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro 2000
Result = PASS

E:\CUDA\CUDA_v7.5\Samples\1_Utilities\deviceQuery>

 

Create and Run New Project from Command Line

hello.c

#include <stdio.h>

// __global__ functions, or "kernels", execute on the device
__global__ void hello_kernel(void)
{
  printf("Hello, world from the device!\n");
}

int main(void)
{
  // greet from the host
  printf("Hello, world from the host!\n");

  // launch a kernel with a single thread to greet from the device
  hello_kernel<<<1,1>>>();

  // wait for the device to finish so that we see the message
  cudaDeviceSynchronize();

  return 0;
}

Compile

>nvcc -o hello hello.cu

Build (success) hello.cu

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>nvcc -o hello hello.cu
hello.cu
   Creating library hello.lib and object hello.exp

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>

Run

>nvcc -o hello hello.cu

Result (success) hello.cu

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>hello
Hello, world from the host!
Hello, world from the device!

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\hello>

 

Errors

Nsight debug problems

Nsight unable to debug error
Break points ignored. Nsight message: A CUDA context was created on a GPU that is not currently debuggable. Breakpoints will be disabled.
Adapter: Quadro 2000

My Nsight 5.2 doesn't work with Fermi family GPUs (i.e. Quadro 2000)

https://stackoverflow.com/questions/43030274/a-cuda-context-was-created-on-a-gpu-that-is-not-currently-debuggable
 

Missing #include <stdio.h>

nvcc compile (fails): missing stdio.h

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>nvcc -o test test.cu
test.cu
test.cu(8): error: identifier "printf" is undefined

1 error detected in the compilation of "C:/Users/AHOLI_~1/AppData/Local/Temp/tmpxft_000020dc_00000000-6_test.cpp4.ii".

Correct code

#include <stdio.h>

__global__ void foo() {}

int main()
{
  foo<<<1,1>>>();

  cudaDeviceSynchronize();
  printf("CUDA error: %s\n", cudaGetErrorString(cudaGetLastError()));

  return 0;
}

test.bat
nvcc -o test test.cu

Executes normally.

Result (success): test.exe

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>test.bat

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>nvcc -o test test.cu
test.cu
   Creating library test.lib and object test.exp

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>test
CUDA error: no error

E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\test>

 


CUDA 7.5 cannot use cudaMallocManaged

Using the revised code from https://www.youtube.com/watch?v=2EbHSCvGFM0

Convert VectorAddCPU.cu to VectorAddGPU.cu

VectorAddGPU.cu

#include <stdio.h>
#define SIZE    1024


__global__ void VectorAdd(int *a, int *b, int *c, int n)
{

    int i = threadIdx.x;

    if (i < n)
        c[i] = a[i] + b[i];
}

__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] \
        From device\n", threadIdx.x, blockIdx.x);
}

 

int main()
{
    int *a, *b, *c;

    printf("Hello World from host!\n");
    print_from_gpu << <1, 1 >> >();
    cudaDeviceSynchronize();

    cudaMallocManaged(&a, SIZE * sizeof(int));
    cudaMallocManaged(&b, SIZE * sizeof(int));
    cudaMallocManaged(&c, SIZE * sizeof(int));

    printf("passed cudaMallocManaged\n");

    for (int i = 0; i < SIZE; ++i)
    {
        a[i] = i;
        b[i] = i;    
        c[i] = 0;
    }

    printf("passed var addition");


    VectorAdd <<<1, SIZE>>> (a, b, c, SIZE);
    cudaDeviceSynchronize();


    for (int i = 0; i < 10; ++i)
        printf("c[%d] = %d\n", i, c[i]);


    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    return 0;
}

Builds correctly.

Build (success): VectorAddGPU.cu
1>------ Build started: Project: vectorAddGPU, Configuration: Debug Win32 ------
1>  Compiling CUDA source file vectorAddGPU.cu...
1>  
1>  E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\vectorAddGPU>"E:\CUDA\CUDA_v7.5\Toolkit\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "E:\Microsoft Visual Studio\2013\VC\bin"  -IE:\CUDA\CUDA_v7.5\Toolkit\include -IE:\CUDA\CUDA_v7.5\Toolkit\include  -G   --keep-dir Debug -maxrregcount=0  --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\vectorAddGPU.cu.obj "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\vectorAddGPU\vectorAddGPU.cu"
1>  vectorAddGPU.cu
1>  vectorAddGPU.vcxproj -> E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\Debug\vectorAddGPU.exe
1>  copy "E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart*.dll" "E:\Projects\CudaTest (VisualStudio)\CUDA 7.5 - Check Dev Env\vectorAddGPU\Debug\"
1>  E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart32_75.dll
1>  E:\CUDA\CUDA_v7.5\Toolkit\bin\cudart64_75.dll
1>          2 file(s) copied.
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

VS2012 Menu Bar:  DEBUG -> Start Without Debugging

Correct answer should be as listed at top of blog for VectorAddCPU.cu

Does not return correct answer.

Result (failure): VectorAddGPU.exe

Hello World from host!
Hello World! from thread [0,0]                     From device
passed cudaMallocManaged
Press any key to continue . . .

 

Running in debug, VS2012 Menu Bar: NSIGHT -> Start CUDA Debugging

Fails on Line 37:  a[i] = i;

Unhandled Exception
Unhandled exception at 0x004E152B in vectorAddGPU.exe: 0xC0000005: Access violation writing location 0x00000000.

Unhandled Exception

 

 

 

 

 

 

Variable "a" is never initalized.

Related to Quadro 2000 is a CUDA 2.1 compute capability and cudaMallocManaged is CUDA >=3.0 compute capability
http://selkie.macalester.edu/csinparallel/modules/TimingCUDA/build/html/0-Introduction/Introduction.html

cudaMemcpy, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost must be used instead of cudaMallocManaged on CUDA 7.5

 https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
    // Transfer data from host to device memory
    cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);

 

Links