Marching Cubes implementation using OpenCL and OpenGL

by Erik Smistad · Published October 7, 2011 · Updated May 30, 2016

In a school project I recently created a fast implementation of Marching Cubes that uses OpenCL to extract surfaces from volumetric datasets and OpenGL to render the surfaces on screen. I wrote a paper together with my two supervisors about the implementation and presented it at the Joint Workshop on High Performance and Distributed Computing for Medical Imaging at the MICCAI 2011 conference. Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory. The paper entitled “Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs” describing the implementation can be downloaded here. The source code of the implementation can be downloaded from my GitHub page.

Marching Cubes

Marching Cubes (MC) is a well known algorithm, introduced by Lorensen and Cline [1] for extracting surfaces from 3D datasets. It works by dividing the entire dataset into a grid of cubes. Triangles are then created in each cube by determining if each corner in the cube is inside or outside the object. Because 2^8=256 different cube configurations are possible, an efficient implementation using a table lookup of where to put the triangles is possible.

Marching Cubes in parallel on the GPU

MC is a completely data parallel algorithm since each cube can be calculated independently on the other cubes and thus ideal for running on GPUs. The problem with running MC in parallel is how to store the triangles in parallel. The dataset is usually very sparse, only a few percentage of the cubes actually produce output. Assuming that all cubes produce output will quickly exhaust the limited GPU memory. Thus we need some way to remove all the cubes that don’t produce any output. This is a common problem in GPU applications and the solution is often termed “Stream Compaction”, which refers to removing unwanted elements from a stream of elements.

Several stream compaction methods exists. NVIDIA for instance use prefix sum scan in their CUDA and OpenCL MC implementations. In our implementation we used a stream compaction method called Histogram Pyramids by Ziegler et al. [2] which have been used by Dyken et al. [3] on MC. This data structure has the benefit of exploiting the texture caching mechanisms on GPUs.

Conclusions

We compared performance of the implementation against NVIDIAs and Dyken et al. [3] implementations and got the following conclusions:

Our implementation can handle larger datasets due to a more compressed storage format of the Histogram Pyramids

Should be applicable for other frameworks as CUDA and shader programming also
Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory.

For smaller datasets our implementation is slower than other methods

This was found to be mainly due to an expensive OpenGL-OpenCL synchronization where the CPU is used
OpenCL and OpenGL extensions are proposed to deal with this
Hopefully this synchronization will happen on the GPU in the future

NVIDIA GPUs don’t support writing to 3D textures. It should be possible to pack the data into a 2D texture instead, but problems with NVIDIAs current OpenCL implementation has haltered this. Instead our implementation stores the results in regular buffers and then copy the contents to 3D textures afterwards. This is slower and use a bit more memory, but it works and is reasonably fast.
Histogram Pyramids is a very efficient stream compaction method well suited for GPUs
OpenCL is a good framework for writing cross-platform GPU applications

See our paper for more details. The source code of the implementation can be downloaded from my GitHub page

References

[1] Lorensen W E, Cline H E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics. 1987;21(4)
[2] Ziegler G, Tevs A, Theobalt C, Seidel H-P. On-the-fly Point Clouds through Histogram Pyramids. Vision, modeling, and visualization. 2006, 137-144
[3] C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel. High-speed Marching Cubes using HistoPyramids. Computer Graphics Forum, 27(8):2028–2039, Dec. 2008.

Tags: GPGPU GPU Marching Cubes OpenCL OpenCL-OpenGL interoperability OpenGL surface extraction

zhudongyu says:

February 20, 2024 at 03:14

Can I use the minimum distance function datasets for Marching Cubes? Is it possible to make relatively simple modifications？

Reply
saber says:

March 22, 2017 at 14:56

hi erik
i want to ask you a question
haw can i display a lot of imgs in viisuel studio + openGL + openCV
because i use all of this in marching cube in my school
I have no idea
pllz help me
thnx for u’r attention sir <3

Reply
Y.S. says:

March 24, 2016 at 03:22

Hi Erik, Are you aware of if any Nvidia card now is supporting 3d texture writing. Thank you.
Y.S.

Reply
- Erik Smistad says:
  
  April 1, 2016 at 12:56
  
  As far as I know, there are no NVIDIA cards that support 3D texture writing with OpenCL. It is, however, supported if you use CUDA. Hence, the hardware supports it, they just haven’t bothered implementing it in OpenCL.
  
  Reply
Ritika says:

June 30, 2015 at 05:21

can someone please post how do we use the mc algorithm for surface rendering?

Reply
mauro says:

April 29, 2015 at 11:11

many many thanks for sharing! it’s a very powerfull algorithm! i love you so much! 🙂 …thanks thanks thanks!

Reply
Y.S. says:

February 4, 2015 at 06:58

Hi Erik, I noticed the marching cube result is different at each running, which seems random. The number of triangles is also different. Is there a particular reason causing this. The isosurface result should be the same each time. Thank you.

Best regards,
Y.S.

Reply
- Erik Smistad says:
  
  February 4, 2015 at 10:13
  
  It should not be random, I have not experienced this.
  The source of randomness is probably that the program is reading out of bounds or some other uninitialized memory.
  I will have to investigate this.
  What is the size of your volume?
  
  Reply
  - Y.S. says:
    
    February 5, 2015 at 09:33
    
    The size is 128x128x128. The overall shape is the same. But it seems at some voxels, the triangles each time generated are different. It happens on the whole surface. I can observe this by showing the surface and keep updating it from the same volume. I am using a binary volume only with 0 and 255, which I don’t know if it is the reason. Also if I print out the number of triangles generated, each time it is different but it only differs a few. It should not be a out of bound problem, since the isosurface is located very inside of the volume.
    
    Thank you!
    Y.S.
    
    Reply
    - Erik Smistad says:
      
      February 7, 2015 at 22:11
      
      I am not able to reproduce the error. What GPU are you using and could you upload the volume somewhere?
      
      Reply
      - Y.S. says:
        
        February 15, 2015 at 06:27
        
        Erik, I am currently on vacation. I will get back to you later. Thank you very much.
        Y.S.
        
        Reply
      - Y.S. says:
        
        March 8, 2015 at 07:44
        
        Erik, I found the error was in my code. Sorry for the confusion. Thanks a lot!
        
        Y.S.
        
        Reply
Lalita says:

November 4, 2014 at 02:29

Hi Erik,
When I use OPenCL in my application, there are lots of errors about the cl:size_t and size_t,as following:
qglobal.h(1908) : error C2872: ‘size_t’ : ambiguous symbol could be ‘predefined C++ types (compiler internal)(19) : unsigned int size_t’
or ‘predefined C++ types (compiler internal)(19) : size_t’
or ‘\include\CL/cl.hpp(683) : cl::size_t’
How can I deal with these kind of problems?
Thank you !

Reply
- Erik Smistad says:
  
  November 6, 2014 at 09:57
  
  You can avoid using “using namespace cl;” in your application
  
  Reply
Lalita says:

October 14, 2014 at 09:23

I have another question. Do we must keep the raw data is a cube, such as 64*64*64, can it be 64*128*256? And the raw data size must be power of 2, if 129*129*129 can be run well?

Reply
- Erik Smistad says:
  
  October 14, 2014 at 13:34
  
  The input raw data does not have to be a power of 2 or equal in each dimension.
  
  Reply
  - Lalita says:
    
    October 15, 2014 at 03:18
    
    Thank you for your help. I will try the different size of raw data to test.
    
    Best regards,
    Lalita
    
    Reply
Lalita says:

October 14, 2014 at 09:11

Hi Erik,
Your application is useful for me.
I want get the GPU results to CPU, such as all of the vertex, normal and the triangles index. I try to use “queue.enqueueReadBuffer(cubeIndexesBuffer, CL_TRUE, 0, sizeof(ushort)*num, triangleIndex);” or “queue.enqueueReadBuffer(VBOBuffer, CL_FALSE, 0, sizeof(cl_float)*num, vertex);”, but the value sames not as my expects.
Can you tell me the method to get the surface info, vertex, normal and triangle index.
Thank you .

Reply
- Erik Smistad says:
  
  October 14, 2014 at 13:33
  
  The vertices and normals are stored in the VBO. The vertices and normals are interleaved and sorted for each triangle:
  VBO = [x00,y00,z00,nx00,ny00,nz00, x01,y01,z01,nx01,ny01,nz01, x02,y02,z02,nx02,ny02,nz02, x10,y10,z10,nx10,ny10,nz10, …]
  where x00 is the x position of vertex 0 in triangle 0, x10 is x position of vertex 0 in triangle 1. [nx,ny,nz] are the corresponding normal vectors.
  You can then extract the VBO data in this way:
  int num = 3*2*nrOfTriangles; float* VBOdata = new float[num]; queue.enqueueReadBuffer(VBOBuffer, CL_TRUE, 0, sizeof(cl_float)*num, VBOdata);
  
  The cube index buffer on the other hand is a flat 3D structure and has the same size as the raw data. Also, it is stored as uchar not ushort.
  
  Good luck.
  Cheers,
  – Erik
  
  Reply
  - Lalita says:
    
    October 15, 2014 at 03:12
    
    Thank you Erik. The method results are same as mine. Maybe I should analyze the results carefully before, in order to avoid doubting the results.
    Thank you again.
    
    Lalita
    
    Reply
Romain says:

September 10, 2014 at 19:29

Hi Erik, thank you very much for this !
I have compiled it under visual studio, but i had to replace all the log2 calls because my visual studio 12 express doesn’t know them. (replace log2 with log and i divide by log(2.0))

And when running, enqueueReadBuffer() line 113 in gpu-mc.cpp returns a -30 error.
Maybe my error is related to the log2 replacements i’ve made …

To be sure, for example, when you put (log2((float)SIZE)) (line 398), and if SIZE=256, this equals to 8, no ?

Thank you, i’m struggling so much to get it working on Nvidia configuration
: )
Romain

Reply
- Erik Smistad says:
  
  September 10, 2014 at 20:44
  
  Hi Romain!
  
  You are right about the log2 function, log(x)/log(2) is the way to do it.
  Though, I am not sure about the OpenCL error you get. What is the error message? -30 doesn’t tell me much 🙂
  
  – Erik
  
  Reply
  - Romain says:
    
    September 10, 2014 at 22:18
    
    Thank you for your answer !
    
    I’ve attached 3 simple screenshots so that you can see : http://star-en-maths.tv/tmp/tmp.html
    
    1st is the output (i’ve done a try catch around “clEnqueueReadBuffer()”)
    2nd is the error location in gpu-mc.cpp
    and 3rd is one of the log2 replacements i’ve done. Maybe the runtime error i get is due to those replacements …
    
    I use 256^3 size skull.raw from volvis and 3D textures are not supported by graphic card.
    
    If you can help me, i’ll show you the beautiful results of what i’m coding ^^,
    Thank you,
    Romain
    
    Reply
    - Erik Smistad says:
      
      September 11, 2014 at 10:43
      
      Unfortunately, I am not able to reproduce your error. The error type is “invalid value” which means that some of the arguments you send in to the enqueueReadBuffer function is wrong.
      
      I found some other bugs which I fixed including creating the log2 function which does not exist on windows. See changes here: https://github.com/smistad/GPU-Marching-Cubes/commit/9223cc79c2555134c0d33fc2933519ad8697fc5d
      
      Reply
      - Romain says:
        
        September 11, 2014 at 13:21
        
        Erik : ) !! It works !
        
        BUT, i had to add some code because you make a “new” without “delete” in your last code !
        
        So i initialize VBOBuffer to nullptr, and then i a add this just before your “new BufferGL” line 625 :
        
        if (VBOBuffer != nullptr)
        delete VBOBuffer;
        
        Indeed, without this, it worked but after pressing “e”, it crashed (histoPyramidConstruction() is called again).
        
        Another small thing on my configuration :
        
        1* I had to define round() because it’s not knowed :
        double round(double number)
        {
        return number < 0.0 ? ceil(number – 0.5) : floor(number + 0.5);
        }
        
        2* When you replace **HP_SIZE** with actual data size in the .cl file (line 423 : sourceCode = sourceCode.replace(pos, 11, to_string(MY_SIZE));),
        oddly it adds some "E**" to it and then .cl file compilation doesn't work at run time. I'm sure this replacement does well for you, but not for me.
        So, because i was just trying your code with skull.raw which is 256^3, i dirtily #define SIZE 256 in gpu-mc-morton.cl, just for this to work.
        
        So, thank you very much : )
        Romain
        
        Reply
        
        Erik Smistad says:
        
        September 11, 2014 at 15:37
        
        Great!
        
        I changed the code to use -DSIZE=X when compiling instead of doing the string replace operation. I also added the round function.
        
        I experience the same problem with deleting the VBOBuffer. Not sure why. However, the glDeleteBuffers call will the delete the data in the VBOBuffer, so memory leakage should not be a problem there.
        
        Cheers
        – Erik
        
        Romain says:
        
        September 12, 2014 at 13:07
        
        Yes, i undestand what you say about the VBOBuffer.
        However I think that every “new” deserves a “delete”.
        
        But, in fact, i just replaced your “new VBOBuffer” with a local variable
        “BufferGL VBOBuffer(context, CL_MEM_WRITE_ONLY, VBO_ID);” (local to the function histoPyramidTraversal()) and it works well.
        
        I noticed another thing : i tried to make it work with 64x64x64 or 128x128x128 volumes from volvis (for example Fuel or Hydrogen Atom)
        and it does not work for me. I remind you that i’m in “no 3d texture” mode. For 256x256x256 skull runs Ok.
        
        Where do you think the error is ?
        
        Thank you,
        Romain
        
        Romain says:
        
        September 13, 2014 at 15:45
        
        Hi Erik,
        
        My bad ! It runs perfectly.
        That was me trying some things …
        Thank you very much for helping me in the first place, and for your time.
        
        Bye,
        Romain
Ionutcelgroaznic says:

April 18, 2014 at 11:09

I tried to use it with fltk( fl_gl_window class).

It works, but if I interact with fltk the VBO seems to disapear, only native Opengl code seems to work.

Do you know if OpenCL and your implementation works with fltk?

Reply
- Erik Smistad says:
  
  April 24, 2014 at 10:09
  
  I have experimented previously with FLTK, OpenGL and OpenCL and that works. See the code for my experiment here: https://github.com/smistad/FLTK-OpenGL-OpenCL-Interoperability
  
  But I have not tried using this marching cubes implementation with FLTK.
  
  Reply
Anonymous says:

January 28, 2014 at 20:17

I tried to run it in Windows 7.It runs but when I try to change the isovalue I get an exception error in this line

VBOBuffer = BufferGL(context, CL_MEM_WRITE_ONLY, VBO_ID);

Do you have any idea why it crashes because I could not solve it

Reply
- Ionut says:
  
  January 28, 2014 at 21:41
  
  Actually I found a fix by creating the buffer each time.
  BufferGL* VBOBuffer=new BufferGL;
  
  My question is why do you keep this VBOBuffer for each frame?Is it necessary?
  
  Reply
  - Erik Smistad says:
    
    January 30, 2014 at 18:32
    
    It is probably not necessary. But remember to cleanup the object afterwards (delete VBOBuffer) or you will get memory leaks.
    
    Reply
Mark Essel says:

March 7, 2013 at 15:18

Pretty amazing application, digging what can be done with the proper tools. Grabbed the pdf you linked above for further reading.

Reply
- Erik Smistad says:
  
  March 7, 2013 at 15:57
  
  I discovered that the article linked in this post is an older version. The newer version is here: http://tapironline.no/last-ned/1050
  
  Reply
huangxiang says:

September 29, 2012 at 14:52

problem fixed.
it runs ok on NVIDIA GT240. thankyou for your code.
anyhow ati radeon hd 5145 neither support 3d textures or 32 bits via pointer..

Reply
- Erik Smistad says:
  
  September 30, 2012 at 17:01
  
  Not all of AMDs/ATIs devices have full OpenCL support. The list of supported devices is here: http://developer.amd.com/tools/hc/AMDAPPSDK/pages/DriverCompatibility.aspx
  
  Reply
huangxiang says:

September 26, 2012 at 15:32

NVIDIA GT240 cuda 4.0 vs2008 compile ok,run crash!
show some kind of access violation. seems the precompiled c-lglutility wrong? or just can’t run on N-card?

Reply
Henry says:

February 2, 2012 at 22:27

Great! Have u a version compatible with NVIDIA drivers? I mean, not using image_3d writing, but buffer objects instead?

Reply
- Erik Smistad says:
  
  February 3, 2012 at 10:13
  
  I have tried making two different versions to work on NVIDIA, one with buffers and one with packing 3D into 2D textures. Both should in theory work on NVIDIA cards, though maybe not as fast as the 3d-image version. But I encountered a lot of problems with NVIDIAs OpenCL implementation. I even got syntax errors in their generated assembly (PTX). This was a long time ago, so I might revisit my old implementation and see if works with a newer version of NVIDIAs OpenCL implementation.
  
  Reply
  - Henry says:
    
    February 5, 2012 at 10:22
    
    Can you try this NVIDIA compatible version, in order to verify if it works? i’m doing a “state of the art” for opencl surface reconstruction algorithm and i would appreciate it if you could make this code available. Tks a lot
    
    Reply
    - Adam says:
      
      April 21, 2016 at 16:47
      
      Hello Henry,
      i’m also doing a state of the art about GPU surface reconstruction, it would be so helpful if you can give me your previous state of the art.
      Thank you.
      
      Reply
Jesse says:

January 4, 2012 at 16:35

Thanks for the response, your paper very clearly answered my questions.

Reply
Jesse says:

January 4, 2012 at 15:10

Great post! Could you explain briefly why the first 5 or so planes of the 3d image only use 8 and 16 bit integers instead of 32 like the rest? It begins on line 347 in gpu-mcc.cpp. Thanks.

Reply
- Erik Smistad says:
  
  January 4, 2012 at 15:17
  
  Two reasons: reducing memory usage and the amount of data transfered from global memory to the chip.
  
  The max sum for each cell in the first HP level is 5, thus 8 bits is more than enough. Since the HP sum 2x2x2=8 cells into a cell in the next level, max sum for the second level 8*5=40 which also is within the 8 bit limit (256). The third level has to be able to store numbers up to 320, thus 16 bit is needed and so on.
  
  See section 3.2 in http://www.mendeley.com/download/public/2730891/4440929055/23c4462b09e9d9d17392a861b3bca6697f0c0d4c/dl.pdf for more details
  
  Reply