Windows CUDA install and testing

To get CUDA 8.0 on Win 64bit up and running, download and install:

CUDA 8.0

Microsoft Visual Studio Community 2017 with Update 3 x64

Folllowing the testing directions at:


I've finally started dabbling in tensorflow, using the nicely explain python kernel for MNIST (the ability to recognize hand written digits 0-9) at

Using pycharm / anaconda in ubuntu 16.04 running in a VMware Player 12 on a slow celeron 3GB laptop, I kept hitting performance and then memory errors on the tf validation, manifesting itself as

training_accuracy / validation_accuracy => 0.96 / 0.98 for step 2499
starting validation
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

On my limited budget I recently bought a replacement workstation (my 10 year old trusty Dell XPS 410 w/ 3GB, Win 10 Pro) with a another used computer, an HP Z600 with 8GB, for 40000 yen ($350).   Pulled the RAID disks out of the Dell and into the new workstation and it worked like a charm.  I was pleasantly surprised that it came with a Nvidia Quadro 2000 GPU, which if I bought new would cost more than the used pc it came in.  What a great deal.

Installing CUDA

After updating my graphics driver and rebooting, I downloaded the CUDA 1.2GB install (cuda_8.0.61_win10.exe) and patch.  Halfway through the install it showed this message: 

Conveniently the link goes to visual studio and even better there is a community edition.  I cancelled the CUDA install, and installed Visual Studio 2017 Community Edition.  VS is massive with lots of options.  I ended installing the Universal Windows Platform development and the Desktop Development with C++ as I figured that would get the C++ libraries that I needed.  Installed, rebooted.

Again I tried the CUDA install.  Again the same notification that VS was not installed.  I decided to troubleshoot later and finished the install.

There is a nice tutorial to see if your CUDA is up and running:

Following those steps I could see in his screenshot that I was not only missing the CUDA in Visual Studio, but also the VS templates.   Doing some more googling, I found this page:

which shows VS 2015 as the supported version for CUDA 8.0.  Download a older VS requires a free subscription to Microsoft's developer program.  I uninstalled VS 2017, and installed VS Community 2015 with Update 3.   That generated this message:

I'm not sure yet how important this is for CUDA.


Back to

Creating a CUDA project, I get this error message:

So apparently a common problem:

With an easy solution (courtesy of Dydzej from the above link):

I  had the same problem and I solved it by installing "Visual C++ 2015 Tools for Windows Desktop".

Visual studio 2015 does not support for C++ by default, so you have to go to projects Visual C++ and click on "Install Visual C++ 2015 Tools for Windows Desktop".

The default program was slightly different that mentioned in the example. 

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b)
    int i = threadIdx.x;
    c[i] = a[i] + b[i];

int main()
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;

    return 0;

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;

    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;

    return cudaStatus;


When running it, the following output is produced:

{1,2,3,4,5} + {10,20,30,40,50} = {11,22,33,44,55}
Press any key to continue . . .