[Solved] Redirecting CUDA printf to a C++ stream

Problem statement

I am working on a large project which uses a logger for debugging. Since I like to keep a trace of what’s happening in some CUDA kernels, I tried to find a way to redirect the printf of my CUDA kernels to a stringstream (or any stream), which can then be forwarded to the logger.

Possible solution

I managed to do it by using the following code:

#include <cuda.h>
#include <stdio.h>
#include <unistd.h> // dup

#include <iostream>
#include <sstream> // stringstream
#include <fstream> // ofstream

char* output_file = "printf_redirect.log";

__global__ void printf_redirect(int* src, int* res)
    res[threadIdx.x] = threadIdx.x;
    printf("  %i: Hello World!n", res[threadIdx.x]);

int main()
    using namespace std;

    const uint N = 2;

    // Note: dummy arrays are not actually used, but this should prevent the
    //       compiler from discarding the printf in the kernel.

    int *d_A, *d_B, *h_A, *h_B;
    size_t size = N * sizeof (int);
    cudaMalloc (&d_A, size);
    cudaMalloc (&d_B, size);
    h_A = (int*) malloc (size);
    h_B = (int*) malloc (size);
    cudaMemcpy (d_A, h_A, size, cudaMemcpyHostToDevice);

    std::cout << "std::cout - start" << std::endl;
    printf ("stdout - startn");

    // Print to regular stdout
    std::cout << "Output to stdout:" << std::endl;
    printf_redirect<<<1,1>>> (d_A, d_B);
    cudaDeviceSynchronize ();

    std::stringstream ss;
    // Redirect std::cout to a stringstream
    std::streambuf* backup_cout = std::cout.rdbuf ();
    std::cout.rdbuf (ss.rdbuf ());
    // Redirect stdout to a buffer
    char buf[1024] = "";
    int backup_stdout = dup (fileno (stdout));
    freopen ("/dev/null", "w", stdout);
    setbuf (stdout, buf);

    std::cout << "Redirected output:" << std::endl;
    printf_redirect<<<1,N>>> (d_A, d_B);
    cudaDeviceSynchronize ();

    // Add CUDA buffer to a stringstream
    ss << buf;

    // Write stringstream to file
    std::ofstream outFile;
    outFile.open (output_file);
    outFile << ss.str ();
    outFile.close ();

    // Redirect back to initial stdout
    fflush (stdout);
    setbuf (stdout, NULL);
    fclose (stdout);
    FILE *fp = fdopen (backup_stdout, "w");
    fclose (stdout);
    *stdout = *fp;
    // Redirect back to initial std::cout
    std::cout.rdbuf (backup_cout);

    std::cout << "std::cout - end" << std::endl;
    printf ("stdout - endn");

    cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);

    free (h_A);
    free (h_B);

I used the following questions to achieve this:

Running the program, we get in the console:

std::cout - start
stdout - start
Output to stdout:
  0: Hello World!
std::cout - end
stdout - end

And in printf_redirect.log:

Redirected output:
  0: Hello World!
  1: Hello World!


Is there any easier way to achieve this? (e.g. hidden CUDA option or neat C/C++ trick)

Note that the final solution will end up in a utility class to make this less verbose in the actual code.

Enquirer: BenC


Solution #1:

Device side printf() causes implicit serialization of the threads that are printing so you probably wouldn’t want to use it in production code.

Device side printf() works by having the kernel copy messages to a preallocated ring buffer. Upon implicit or explit device synchronization (cudaDeviceSynchronize()), CUDA dumps any contents in the buffer to stdout and then clears it.

You could simply implement your own device printf(). Its performance would probably not be any worse than the built in one. The only disadvantage is that you would have to pass the ring buffer to the kernel and add a call to process it after the kernel returns.

Implementation would be something like this:

  • Create a buffer with room for a fixed numer of printf() formatting strings and associated 32-bit or 64-bit parameters.

  • Create a device function that uses atomicInc() to keep track of the current print location and takes a formatting string and parameters and copies them into the current location.

  • Pass the ring buffer to the kernel, which then passes it to the device function together with the print parameters.

  • Create a host function that takes the ring buffer, runs the formatting strings and parameters through host side sprintf() and passes the results to the logger.

Respondent: Roger Dahl

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.