Gaussian Blur using OpenCL and the built-in Images/Textures

If used correctly, OpenCL images / textures can give you large speedups on GPUs. In this post, I’ll show you a very short example of how to use OpenCL to blur/smooth an image. The goal is to show how images/textures are used in OpenCL and the benefits of using them.

The source code can be download from by GitHub page.

How Gaussian blurring works

Blurring is to smooth an image as shown in the top of this post. In practice, this is done by discrete convolution of the image and a mask. Gaussian blurring entails using an approximation of the Gaussian distribution as the convolution mask. Discrete convolution performs the same set of instructions on each pixel and is thus ideal for execution on a GPU. Each pixel gets a new value that is a weighted average of its closest neighbors. Mathematically, convolution of an image I, and a Gaussian mask G, with a specific \(\sigma\), can be expressed as:

\((I * G_\sigma)(\vec v) = \frac{1}{Z} \sum^N_{a=-N} \sum^N_{b=-N} I(\vec v+(a,b)) e^{-\frac{a^2+b^2}{2\sigma^2}}\)

where Z is a normalization constant equal to the sum of the mask.

OpenCL – Images and Textures

Most modern GPUs have a separate texture cache. These texture caches exists on GPUs because a lot of video games and 3D applications use texture mapping to map an image to 3D objects to create a realistic 3D scene. Textures are simply images, either 1, 2 or 3 dimensional. When a specific pixel in the texture is requested, the GPU will store the data and the neighboring data in a special buffer that is close to where the actual calculations are performed. Unlike regular linear storage buffers which only have caching in one dimension, textures can cache neighboring data in 2 or 3 dimensions. Thus, when a pixel is requested, the neighboring pixels above and below as well as those to the left and right are cached.

Note that writing to a 3D texture from inside a kernel is not supported in OpenCL by default. It is enabled through an extension called cl_khr_3d_image_writes. AMD supports this extension, while NVIDIA does not.

Caching is one of the main features of textures, but other nice features are:

  • Interpolation – Request a point in the image and get the linear approximation of that point using the closest neighboring pixels (enabled with CLK_FILTER_LINEAR flag in sampler)
  • Data type conversion – The texture fetch units on the GPU can convert between different data types. Used with the suffixes of the read_image and write_image functions (f for float, i for integer and ui for unsigned integer).
  • Normalized data types – These data types (CL_SNORM_INT8, CL_UNORM_INT8, CL_SNORM_INT16, CL_UNORM_INT16) can reduce the amount of memory used and transferred and may give you large speedups. Use them if you are using floats with a normalized range (0.0-1.0, -1.0-1.0) and accuracy is not that important. These data types store floats as a 16 or 8 bit integer in the texture and the texture fetch unit converts it to a float with a normalized range when requested.
  • Out of bounds handling – Reading outside a texture will not create a seg fault. Instead, you can define what should be returned by the read_image function in the sampler (CLK_ADDRESS_CLAMP_TO_EDGE, CLK_ADDRESS_CLAMP, CLK_ADDRESS_NONE..).
  • Channels – Each pixel in a texture can contain 1 to 4 channels. If you want to pack several values in each pixel this can be very useful. Defined in the creation of the image. ImageFormat(CL_RGBA, CL_FLOAT) is for instance a 4 channel (RGBA: red, green, blue, alpha) image of floats.

The code

OpenCL Kernel Code

Below is the OpenCL code for the Gaussian blur kernel. The functions write_image{f|i|ui} are used to write to an image in a kernel and read_image{f|i|ui} to read. Note that the read_image functions will always return a vector of size 4 and since we only use one channel in this application we retrieve the first component (.x) read_imagef(..).x

Another thing to note, is that you can only read OR write to a an image in an kernel. You can’t read and write to the same image. If you want to use images in several kernel calls after another I suggest using a dual buffering approach.

When reading images you have to define a sampler which you can see in the first line in the code below. Normalized coordinates enables accessing the image using normalized floating point coordinates (0.0 to 1.0). CLK_ADDRESS_CLAMP_TO_EDGE means that if a pixel outside the image is requested, it will use the pixel value that is closest to the edge. The last flag CLK_FILTER_NEAREST instructs the fetch unit to fetch the pixel that is closest to the coordinate. The alternative is CLK_FILTER_LINEAR which fetches a linear approximation of the 4(2D) or 8(3D) closest pixels. For more info on samplers, see the specification.

__kernel void gaussian_blur(
        __read_only image2d_t image,
        __constant float * mask,
        __global float * blurredImage,
        __private int maskSize
    ) {
    const int2 pos = {get_global_id(0), get_global_id(1)};
    // Collect neighbor values and multiply with Gaussian
    float sum = 0.0f;
    for(int a = -maskSize; a < maskSize+1; a++) {
        for(int b = -maskSize; b < maskSize+1; b++) {
            sum += mask[a+maskSize+(b+maskSize)*(maskSize*2+1)]
                *read_imagef(image, sampler, pos + (int2)(a,b)).x;
    blurredImage[pos.x+pos.y*get_global_size(0)] = sum;

Host Code

The host code is written in C++ and depends on two small libraries (SIPL and OpenCLUtilities). The function createBlurMask creates a Gaussian blur mask with a specific sigma. The mask size N, is calculated from the sigma. Larger sigma gives a larger mask size. The memory flag CL_MEM_COPY_HOST_PTR orders OpenCL to copy the contents of the last argument pointer to the the device. Each image is created with a specific image format using the ImageFormat class. The first argument is the channel format and the second is the data type. A list of all channel formats and data types can be found here. The rest should be self-explanatory.

#include "SIPL/Core.hpp"
#include "OpenCLUtilities/openCLUtilities.hpp"
using namespace cl;
float * createBlurMask(float sigma, int * maskSizePointer) {
    int maskSize = (int)ceil(3.0f*sigma);
    float * mask = new float[(maskSize*2+1)*(maskSize*2+1)];
    float sum = 0.0f;
    for(int a = -maskSize; a < maskSize+1; a++) {
        for(int b = -maskSize; b < maskSize+1; b++) {
            float temp = exp(-((float)(a*a+b*b) / (2*sigma*sigma)));
            sum += temp;
            mask[a+maskSize+(b+maskSize)*(maskSize*2+1)] = temp;
    // Normalize the mask
    for(int i = 0; i < (maskSize*2+1)*(maskSize*2+1); i++)
        mask[i] = mask[i] / sum;
    *maskSizePointer = maskSize;
    return mask;
int main(int argc, char ** argv) {
    // Load image
    SIPL::Image<float> * image = new SIPL::Image<float>("images/sunset.jpg");
    // Create OpenCL context
    Context context = createCLContextFromArguments(argc, argv);
    // Compile OpenCL code
    Program program = buildProgramFromSource(context, "");
    // Select device and create a command queue for it
    VECTOR_CLASS<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
    CommandQueue queue = CommandQueue(context, devices[0]);
    // Create an OpenCL Image / texture and transfer data to the device
    Image2D clImage = Image2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ImageFormat(CL_R, CL_FLOAT), image->getWidth(), image->getHeight(), 0, (void*)image->getData());
    // Create a buffer for the result
    Buffer clResult = Buffer(context, CL_MEM_WRITE_ONLY, sizeof(float)*image->getWidth()*image->getHeight());
    // Create Gaussian mask
    int maskSize;
    float * mask = createBlurMask(10.0f, &maskSize);
    // Create buffer for mask and transfer it to the device
    Buffer clMask = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*(maskSize*2+1)*(maskSize*2+1), mask);
    // Run Gaussian kernel
    Kernel gaussianBlur = Kernel(program, "gaussian_blur");
    gaussianBlur.setArg(0, clImage);
    gaussianBlur.setArg(1, clMask);
    gaussianBlur.setArg(2, clResult);
    gaussianBlur.setArg(3, maskSize);
        NDRange(image->getWidth(), image->getHeight()),
    // Transfer image back to host
    float* data = new float[image->getWidth()*image->getHeight()];
    queue.enqueueReadBuffer(clResult, CL_TRUE, 0, sizeof(float)*image->getWidth()*image->getHeight(), data); 
    // Save image to disk and display it
    image->save("images/result.jpg", "jpeg");

Download and run the example

Download the source code at or clone it using git (Note that the example uses two small submodules: SIPL and OpenCLUtilities. The two git submodule commands below downloads these as well):

# Download
git clone git://
cd OpenCL-Gaussian-Blur
git submodule init
git submodule update
# Compile and run (use ./blur --device cpu/gpu to run on different devices)
cmake CMakeLists.txt
./blur --device gpu

You may also like...

37 Responses

  1. Sebastian says:

    Hi, thanks for this useful example! Despite being a complete n00b I’ve been messing around with your code, trying to write a simple chroma keying algorithm. For this to work I need an RGB image passed to the kernel. But as soon as I change CL_R to CL_RGB in the Image2D(…) line, clCreateImage fails and terminates the program.
    Assuming that SIPL getData() provides or, at least can provide RGB values, what am I doing wrong?

    • Erik Smistad says:

      Hi Sebastian
      CL_RGB is reserved for only some special data types, see here:
      You should instead use CL_RGBA, where A stands for alpha channel (transparency).
      However this example needs several changes in order to work with color:
      * The kernel code only goes over 1 channel (.x). You need to do this for all channels (rgb -> .xyz)
      * If you import a color image with SIPL, you only get RGB, so you need to pad it with an extra channel to get in RGBA format as OpenCL requires this.

  2. Kavya says:

    HI, Thanks for the code! I’m a beginner of OpneCL and when I tried running this code, I get an error:
    “Attempt to unlock mutex that was not locked
    Aborted (core dumped)”.

    Could you please help me

    • Erik Smistad says:

      You need to update SIPL. Go to the SIPL source folder and write the following:
      git checkout master
      git pull origin master

      Then recompile and run.

  3. Royi says:

    Great post.

    I would like to know. how do you handle boundaries?

    Thank You.

  4. Nico says:

    Can you explain me please how does the last lines in kernel work? I don’t understand what is porpose for…

    sum += mask[a+maskSize+(b+maskSize)*(maskSize*2+1)] *read_imagef(image, sampler, pos + (int2)(a,b)).x;
    blurredImage[pos.x+pos.y*get_global_size(0)] = sum;

    Thank you

    • Erik Smistad says:

      The summation in the for loops is the convolution as described with the equation on the top of this page. Every element in the mask is multiplied with pixels in a neighborhood in the image (read_imagef). a and b are the neighborhood offset for the x and y direction. In the end, the sum is assigned to the new (blurred) image.

      You can read more about this here:

  5. presam01 says:

    Many Thanks for the headstart. I used libjpeg to read the image file and then as suggested used clCreateImage2D to create an OpenCL image.

  6. presam01 says:

    Is the image object created with line below of type cl_mem?
    SIPL::Image * image = new SIPL::Image(“images/lena.jpg”);

    I would like to write the same piece of code in C and use clCreateImage2D function. But I am not sure how to create a reference to the file “lena.jpg” in C., i.e. how to create the same image object as described above using C.


    • Erik Smistad says:

      That line reads the image from disk.

      This line creates an OpenCL image (cl_mem) and transfers the image to the device:
      Image2D clImage = Image2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ImageFormat(CL_R, CL_FLOAT), image->getWidth(), image->getHeight(), 0, image->getData());

      Thus, first you need a way to read the image from disk. Then get the pixel data as an array and finally you can put the data in an OpenCL memory object.

  7. yafish says:

    I edit some code of openCLUtilities.cpp and it is finished
    And now, new problem is Image2D this function =(
    It stop with an 0xC0000005:access violation on cl.hpp(3690)
    Sorry i have a lot of problems,because i am a beginner of OpenCL

  8. yafish says:

    It’s okay. Thanks.
    Now I have another problem,
    Context context = createCLContextFromArguments(argc, argv);
    I can’t create correct context of correct device.
    Because I don’t understand this form at all, I only can use
    context = clCreateContext(0,1,&devices,NULL,NULL,&err);
    Can you help me to solve this problem?

    • Erik Smistad says:

      I’m not sure I understand your problem.
      The program doesn’t work with ./blur –device gpu or ./blur –device cpu?
      I.e. the program can’t find any devices?

  9. yafish says:

    It is same as mine version ( I use windows 8 64 bit) ,and the install instruction I have read.
    I have a question that what content of CMakeLists.txt I need to type,

    So I google for other’s answer to type this file,it still have the problem of
    Error: missing binary operator before token “(”

    You have no solution to this,too? Or maybe I loss some important files?

  10. yafish says:

    When i compiled this program .I got an error is
    c:\program files (x86)\amd app sdk\2.9\include\gdk\gdkversionmacros.h(129):
    Error: missing binary operator before token “(”

    I had googled for severel weeks . What I sholud do?

  11. Vincent says:

    What kind of performance do you expect for a 1920×1080 grayscale image when running on a somewhat old NVIDIA GPU ( Quadro NVS 295 )? I added profiling and I measure it takes 14 seconds. Is this completely off or this is what I should expect?

  12. Will says:

    I can’t seem to get this to compile on OSX unless I replace the OCL vectors with std::vectors. Any idea why this is so? Or whether this is an issue?

    • Erik Smistad says:

      I have experienced the same thing on OSX. Not sure what the reason is, but I think it is a bug from Apple’s side. I haven’t experienced that it is an issue. However, if you have __NO_STD_VECTORS defined, it will not work on other platforms. So for cross-platform compatibility you can use the macro VECTOR_CLASS like this:

      VECTOR_CLASS devices = context.getInfo();

    • Will says:

      Ok, still having issues with this. When I create the Image2D, clCreateImage is throwing the error CL_INVALID_VALUE, and I can’t figure out why.

      • Will says:

        Just putting this here in case anyone stumbles upon it. I was having issues with creating the Image2d on the device. I’m not sure why, but using the deprecated function clCreateImage2d, rather than using the constructor (which calls the newer clCreateImage) works for me. It may be because my device only supports an older version of OpenCL, but I’m not sure.

        • Erik Smistad says:

          In OpenCL 1.2 they have replaced the clCreateImage2D and clCreateImage3D with one function clCreateImage. However, the cl.hpp file doesn’t seem to have been updated yet:\

          “cl.hpp – OpenCL 1.1 C++ Bindings Header File, implementing the C++ Bindings Specification. This header works for all versions of OpenCL, but has not yet been updated with new OpenCL 1.2 entry points. ”

          I did not know Apple had implemented 1.2. Which version of OSX are you using?

          AMD has 1.2, and they have a workaround for this problem “#define CL_USE_DEPRECATED_1_1_APIS”

  13. Elhassan says:


    I have a question regarding the use of Image2D from OpenCL. Is it better to convert the image that I read in the host to an array and send this array to gpu to convolve it?.
    I am wondering which way will be faster?.



    • Erik Smistad says:

      On GPUs, using textures/images for this sort of operation is usually faster than using arrays/buffers. This is due to the caching mechanisms of textures.

  14. Brandon says:

    What are the benefits of using Image2D over Buffer? It looks like it has more restrictions and is harder to use. I assume it can provide better performance. If that is true, how?

    • Erik Smistad says:

      As it states in the text above, textures on GPUs are optimized for 2D(and 3D) spatial cache locality. Thus, using textures (called images in OpenCL) can increase the cache hit ratio which will speed up memory access.

      I think images are easier to use and have less restrictions than buffers. For instance reading outside a buffer gives undefined results while with images you can define the outcome yourself and thus won’t have to check if the address is out of bounds.

      Another thing you don’t have to do with images is address translations. If you have x and y coordinate you can just insert that directly to the function. You don’t have to do x+y*width to calculate a linear address.

  15. Elmo says:

    You could mention that there is a Problem when using large Blur filters that would not fit in the Constant Memory. A Blur with a 101^2 float filter would result in an overflow of cosntant memory which will result in an execution error

  16. Liss says:

    You forgot to include CMakeLists.txt in your repository, so it is impossible to compile your project.

    I’m new to CMake and OpenCL so it is not obvious for me how to create CMakeLists.txt from scratch.

    I really appreciate if you fix your repository by adding CMakeLists.txt to it. Thanks.

    • Erik Smistad says:

      Sorry about that. I have now added the CMakeLists.txt file to the github repo.

      • Liss says:

        Thank you. The project gave me very strange compile errors like this:

        /usr/include/CL/cl.hpp: In function ‘cl_int cl::UnloadCompiler()’:
        /usr/include/CL/cl.hpp:1606:12: error: ‘::clUnloadCompiler’ has not been declared

        I do not post full log because it seems that this is known problem with OpenCL:

        So, according to recommendation from Khronos forum, I added:


        …to OpenCLUtilities/openCLUtilities.hpp and now everything compiles perfectly. But when I try to run it, it fails with “Error: images/lena.jpg not found”. So I fixed this by running:

        mkdir images; cd images; wget; cd ..

        And now when I try to run “./blur –device gpu” I get:

        Using platform vendor: NVIDIA Corporation

        ..and then the program hangs (if I just run ./blur the result is the same). I have NVidia GTX 295 and gcc 4.7.1 in case it matters. All I discovered so far that for some reason it hangs after saving the image. So, I get result.jpg, but the program cannot exit gracefully. Since I wanted to write my own image processing program based on your example with your library, I really need to solve this problem, but so far I could not find out why this happens.

        This bug reproducible for me even with 1×1 JPG but if I set breakpoint at image->save() call in main.cpp, run it and press “step” few times, the bug does not happen, the program completes successfully. I’m not sure how to debug something not reproducible in gdb. Here are few last lines after running ltrace ./blur:

        clReleaseProgram(0xc00d90, 0, 0xc01190, 134464, 0x7f0c43e7b420) = 0
        clReleaseContext(0xa2a6d0, 0, 0xc10c70, 134464, 0x7f0c43e7b420) = 0
        g_thread_join(0x9c4800, 0, 0x7f0c490ad490, 0x7f0c490ad490, 0x7f0c43e7b420

        After this it just hangs with 0% CPU usage. But, as I have said, if I slowly step with gdb, this does not happen and it exits normally.

        Since you wrote SIPL library, perhaps you have any idea what might be the cause of this issue? If not, then perhaps you can give some clue how to debug this?

        • Erik Smistad says:

          Thanks for the bug report! This was a bug with the library that I fixed a while ago, but this project did not have an updated version of SIPL. It should be fixed now.

          You can probably use a “git pull origin master” in the SIPL folder to update the SIPL library yourself. Or you can do the entire “Download and run the example” code from the post.

          • Liss says:

            Great! Everything works as expected now.

            Thank you very much for writing all these articles about OpenCL and for creating libraries so useful for image processing!

  1. July 11, 2013

    […] on the GPU to reduce memory access latency. Read more on textures in OpenCL my previous post on Gaussian Blur using OpenCL. If you want to look into further optimizing the level set computation you should look into the […]

Leave a Reply

Your email address will not be published.