![]() |
Site Archive (Complete) | |||
|
ABOUT US |
CONTACT |
ADVERTISE |
SUBSCRIBE |
SOURCE CODE |
CURRENT PRINT ISSUE |
NEWSLETTERS
|
RESOURCES
|
BLOGS
|
PODCASTS
|
CAREERS
|
||||
May 13, 2008
CUDA, Supercomputing for the Masses: Part 3Rob Farber
Error handling and global memory performance limitations
Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at rmfarber@gmail.com.
Congratulations! Thanks to Part 1 and Part 2 of this series on CUDA (short for "Compute Unified Device Architecture"), you are now a CUDA-enabled programmer with the ability to create and run programs that can use many hundreds of simultaneous threads on CUDA-enabled devices. In Part 2, incrementArrays.cu, I provided a working example of a pervasive CUDA application pattern -- move data to device, run one or mores kernels to perform a calculation, and retrieve the result(s). Essentially, incrementArrays.cu can be morphed into whatever application you desire simply by substituting your own kernel and loading your own data (which is what I do for the example in this column). Subsequent columns will discuss CUDA asynchronous I/O and streams.
"You now know enough to be dangerous!" is a humorous and accurate way to summarize the previous paragraph. The good news about CUDA is that it provides a natural way to translate your thoughts as a programmer into massively-parallel programs. The bad news is that more understanding is required to make those programs both robust and efficient.
Don't be cautious. Start experimenting and go for it! CUDA provides both programming tools and constructs to create excellent software and the only way to really learn it is to experiment. In reality, these columns will supplement your experimentation and learning process by highlighting CUDA features through short examples and by bringing good sources of information on the Internet to your attention. Remember that the CUDA Zone is a great place for all things CUDA and the forums are a great place to look for answers to your questions, plus they have the advantage of being interactive so you can post questions and get answers.
This and the next few columns utilize a simple array reversal application to expand your knowledge and highlight the performance impact of shared memory. I discuss error checks and performance behavior, along with the CUDA profiling tool. I've also included the source listing for the next column so you can see how to implement array reversal with shared memory. The program reverseArray_multiblock.cu implements an obvious, yet low performance, way to reverse an array in global memory on a CUDA device. Do not use it as a model for your applications because global memory is not the best memory type to use for this application -- plus this version also performs uncoalesced memory accesses, which adversely affects global memory performance. The best global memory bandwidth is achieved when simultaneous memory accesses can be coalesced into a single memory transaction. In subsequent columns, I discuss the differences between global and shared memory as well as the various requirements for memory accesses to coalesce based on the compute capability of the device.
CUDA Error Handling
Detecting and handling errors is essential to creating robust and usable software. Users tend to get very grumpy when their applications fail or produce incorrect results. For developers, adding error-handling code can be annoying, and tedious. It can clutter up the elegance of the code, and slow the development process in attempting to deal with every conceivable error. Yes, error handling is a thankless job but keep in mind you are not doing it for yourself (although I have been saved countless times through good error checking) -- rather it is being done for the people who are going to use the program. If something can fail, users need to know why it failed and, more importantly, what they can do to fix the problem. Enough said, good error-handling and recovery can really make your application a hit with the users. Commercial developers should especially take note.
The CUDA designers are aware of the importance of good error handling. To facilitate this, every CUDA call -- with the exception of kernel launches -- returns an error code of type cudaError_t. Upon successful completion, cudaSuccess is returned. Otherwise, an error code is returned.
A human-readable description of the error can be obtained from:
char *cudaGetErrorString(cudaError_t code);
C-language programmers will recognize a similarity between this method and the C library, which uses the variable errno to indicate errors and the reporting of human-readable error messages with perror and strerror. The C library paradigm has worked well for many millions of lines of C-code, and there is no doubt it should work well in the future for CUDA software.
CUDA also provides a method, cudaGetLastError, which reports the last error for any previous runtime call in the host thread. This has multiple implications:
Looking at the Source Code
Looking at the source code for reverseArray_multiblock.cu, you notice that the structure of the program is very, very similar to the structure of moveArrays.cu from Part 2. An error routine, checkCUDAError is provided so the host can print out a human-readable message and exit when an error is reported by cudaGetLastError. As can be seen, checkCUDAError is judiciously utilized throughout the program to check for errors.
The program reverseArray_multiblock.cu essentially creates a 1D array of integers, h_a, containing the integer values [0 .. dimA-1]. Array h_a is moved via cudaMemcpy to array d_a, which resides in global memory on the device. The host then launches the reverseArrayBlock kernel to copy the array contents in reverse order from d_a to d_b, which is another global memory array. Again, cudaMemcpy is used to transfer data -- this time from d_b to the host. A check is then performed on the host to verify that the device produced the correct result (e.g, [dimA-1 .. 0]).
// includes, system #include <stdio.h> #include <assert.h>
A key design feature of this program is that both arrays d_a and d_b reside in global memory on the device. The CUDA SDK provides an example program, bandwidthTest, which provides some information about the device characteristics. On my system, the global memory bandwidth is slightly over 60 GB/s. This is excellent until you consider that this bandwidth must service 128 hardware threads -- each of which can deliver a large number of floating-point operations. Since a 32-bit floating-point value occupies four (4) bytes, global memory bandwidth limited applications on this hardware will only be able to deliver around 15 GF/s -- or only a small percentage of the available performance capability. (This assumes the application only reads from global memory and does not write to it.) Obviously, higher performance applications must reuse data in some fashion. This is the function of shared and register memory and it is our job as programmers to gain the maximum benefit of these memory types. To gain a better understanding of machine balance as floating-point capability relates to memory bandwidth (and other machine characteristics), read my article HPC Balance and Common Sense.
Shared Memory Version
The following source listing is for arrayReversal_multiblock_fast.cu, which I discuss in the next installment. I provide it now for your convenience so you can see how to use shared memory on this problem right now.
// includes, system #include <stdio.h> #include <assert.h>
In the next column, I begin looking at the use of shared memory to increase performance. Until then, I suggest looking into the CUDA memory types -- specifically __shared__, __constant__, and register memory.
For More Information
Click here for more information on CUDA and here for more information on NVIDIA.
|
|
||||||||||||||||||||||||||
|
|