Marching Cubes implementation using OpenCL and OpenGL
In a school project I recently created a fast implementation of Marching Cubes that uses OpenCL to extract surfaces from volumetric datasets and OpenGL to render the surfaces on screen. I wrote a paper together with my two supervisors about the implementation and presented it at the Joint Workshop on High Performance and Distributed Computing for Medical Imaging at the MICCAI 2011 conference. Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory. The paper entitled “Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs” describing the implementation can be downloaded here. The source code of the implementation can be downloaded from my GitHub page.
Marching Cubes (MC) is a well known algorithm, introduced by Lorensen and Cline  for extracting surfaces from 3D datasets. It works by dividing the entire dataset into a grid of cubes. Triangles are then created in each cube by determining if each corner in the cube is inside or outside the object. Because 2^8=256 different cube configurations are possible, an efficient implementation using a table lookup of where to put the triangles is possible.
Marching Cubes in parallel on the GPU
MC is a completely data parallel algorithm since each cube can be calculated independently on the other cubes and thus ideal for running on GPUs. The problem with running MC in parallel is how to store the triangles in parallel. The dataset is usually very sparse, only a few percentage of the cubes actually produce output. Assuming that all cubes produce output will quickly exhaust the limited GPU memory. Thus we need some way to remove all the cubes that don’t produce any output. This is a common problem in GPU applications and the solution is often termed “Stream Compaction”, which refers to removing unwanted elements from a stream of elements.
Several stream compaction methods exists. NVIDIA for instance use prefix sum scan in their CUDA and OpenCL MC implementations. In our implementation we used a stream compaction method called Histogram Pyramids by Ziegler et al.  which have been used by Dyken et al.  on MC. This data structure has the benefit of exploiting the texture caching mechanisms on GPUs.
We compared performance of the implementation against NVIDIAs and Dyken et al.  implementations and got the following conclusions:
- Our implementation can handle larger datasets due to a more compressed storage format of the Histogram Pyramids
- Should be applicable for other frameworks as CUDA and shader programming also
- Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory.
- For smaller datasets our implementation is slower than other methods
- This was found to be mainly due to an expensive OpenGL-OpenCL synchronization where the CPU is used
- OpenCL and OpenGL extensions are proposed to deal with this
- Hopefully this synchronization will happen on the GPU in the future
- NVIDIA GPUs don’t support writing to 3D textures. It should be possible to pack the data into a 2D texture instead, but problems with NVIDIAs current OpenCL implementation has haltered this. Instead our implementation stores the results in regular buffers and then copy the contents to 3D textures afterwards. This is slower and use a bit more memory, but it works and is reasonably fast.
- Histogram Pyramids is a very efficient stream compaction method well suited for GPUs
- OpenCL is a good framework for writing cross-platform GPU applications
See our paper for more details. The source code of the implementation can be downloaded from my GitHub page
 Lorensen W E, Cline H E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics. 1987;21(4)
 Ziegler G, Tevs A, Theobalt C, Seidel H-P. On-the-fly Point Clouds through Histogram Pyramids. Vision, modeling, and visualization. 2006, 137-144
 C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel. High-speed Marching Cubes using HistoPyramids. Computer Graphics Forum, 27(8):2028–2039, Dec. 2008.
i want to ask you a question
haw can i display a lot of imgs in viisuel studio + openGL + openCV
because i use all of this in marching cube in my school
I have no idea
pllz help me
thnx for u’r attention sir <3
Hi Erik, Are you aware of if any Nvidia card now is supporting 3d texture writing. Thank you.
As far as I know, there are no NVIDIA cards that support 3D texture writing with OpenCL. It is, however, supported if you use CUDA. Hence, the hardware supports it, they just haven’t bothered implementing it in OpenCL.
can someone please post how do we use the mc algorithm for surface rendering?
many many thanks for sharing! it’s a very powerfull algorithm! i love you so much! 🙂 …thanks thanks thanks!
Hi Erik, I noticed the marching cube result is different at each running, which seems random. The number of triangles is also different. Is there a particular reason causing this. The isosurface result should be the same each time. Thank you.
It should not be random, I have not experienced this.
The source of randomness is probably that the program is reading out of bounds or some other uninitialized memory.
I will have to investigate this.
What is the size of your volume?
The size is 128x128x128. The overall shape is the same. But it seems at some voxels, the triangles each time generated are different. It happens on the whole surface. I can observe this by showing the surface and keep updating it from the same volume. I am using a binary volume only with 0 and 255, which I don’t know if it is the reason. Also if I print out the number of triangles generated, each time it is different but it only differs a few. It should not be a out of bound problem, since the isosurface is located very inside of the volume.
I am not able to reproduce the error. What GPU are you using and could you upload the volume somewhere?
Erik, I am currently on vacation. I will get back to you later. Thank you very much.
Erik, I found the error was in my code. Sorry for the confusion. Thanks a lot!
When I use OPenCL in my application, there are lots of errors about the cl:size_t and size_t,as following:
qglobal.h(1908) : error C2872: ‘size_t’ : ambiguous symbol could be ‘predefined C++ types (compiler internal)(19) : unsigned int size_t’
or ‘predefined C++ types (compiler internal)(19) : size_t’
or ‘\include\CL/cl.hpp(683) : cl::size_t’
How can I deal with these kind of problems?
Thank you !
You can avoid using “using namespace cl;” in your application
I have another question. Do we must keep the raw data is a cube, such as 64*64*64, can it be 64*128*256? And the raw data size must be power of 2, if 129*129*129 can be run well?
The input raw data does not have to be a power of 2 or equal in each dimension.
Thank you for your help. I will try the different size of raw data to test.
Your application is useful for me.
I want get the GPU results to CPU, such as all of the vertex, normal and the triangles index. I try to use “queue.enqueueReadBuffer(cubeIndexesBuffer, CL_TRUE, 0, sizeof(ushort)*num, triangleIndex);” or “queue.enqueueReadBuffer(VBOBuffer, CL_FALSE, 0, sizeof(cl_float)*num, vertex);”, but the value sames not as my expects.
Can you tell me the method to get the surface info, vertex, normal and triangle index.
Thank you .
The vertices and normals are stored in the VBO. The vertices and normals are interleaved and sorted for each triangle:
VBO = [x00,y00,z00,nx00,ny00,nz00, x01,y01,z01,nx01,ny01,nz01, x02,y02,z02,nx02,ny02,nz02, x10,y10,z10,nx10,ny10,nz10, …]
where x00 is the x position of vertex 0 in triangle 0, x10 is x position of vertex 0 in triangle 1. [nx,ny,nz] are the corresponding normal vectors.
You can then extract the VBO data in this way:
int num = 3*2*nrOfTriangles;
float* VBOdata = new float[num];
queue.enqueueReadBuffer(VBOBuffer, CL_TRUE, 0, sizeof(cl_float)*num, VBOdata);
The cube index buffer on the other hand is a flat 3D structure and has the same size as the raw data. Also, it is stored as uchar not ushort.
Thank you Erik. The method results are same as mine. Maybe I should analyze the results carefully before, in order to avoid doubting the results.
Thank you again.
Hi Erik, thank you very much for this !
I have compiled it under visual studio, but i had to replace all the log2 calls because my visual studio 12 express doesn’t know them. (replace log2 with log and i divide by log(2.0))
And when running, enqueueReadBuffer() line 113 in gpu-mc.cpp returns a -30 error.
Maybe my error is related to the log2 replacements i’ve made …
To be sure, for example, when you put (log2((float)SIZE)) (line 398), and if SIZE=256, this equals to 8, no ?
Thank you, i’m struggling so much to get it working on Nvidia configuration
You are right about the log2 function, log(x)/log(2) is the way to do it.
Though, I am not sure about the OpenCL error you get. What is the error message? -30 doesn’t tell me much 🙂
Thank you for your answer !
I’ve attached 3 simple screenshots so that you can see : http://star-en-maths.tv/tmp/tmp.html
1st is the output (i’ve done a try catch around “clEnqueueReadBuffer()”)
2nd is the error location in gpu-mc.cpp
and 3rd is one of the log2 replacements i’ve done. Maybe the runtime error i get is due to those replacements …
I use 256^3 size skull.raw from volvis and 3D textures are not supported by graphic card.
If you can help me, i’ll show you the beautiful results of what i’m coding ^^,
Unfortunately, I am not able to reproduce your error. The error type is “invalid value” which means that some of the arguments you send in to the enqueueReadBuffer function is wrong.
I found some other bugs which I fixed including creating the log2 function which does not exist on windows. See changes here: https://github.com/smistad/GPU-Marching-Cubes/commit/9223cc79c2555134c0d33fc2933519ad8697fc5d
Erik : ) !! It works !
BUT, i had to add some code because you make a “new” without “delete” in your last code !
So i initialize VBOBuffer to nullptr, and then i a add this just before your “new BufferGL” line 625 :
if (VBOBuffer != nullptr)
Indeed, without this, it worked but after pressing “e”, it crashed (histoPyramidConstruction() is called again).
Another small thing on my configuration :
1* I had to define round() because it’s not knowed :
double round(double number)
return number < 0.0 ? ceil(number – 0.5) : floor(number + 0.5);
2* When you replace **HP_SIZE** with actual data size in the .cl file (line 423 : sourceCode = sourceCode.replace(pos, 11, to_string(MY_SIZE));),
oddly it adds some "E**" to it and then .cl file compilation doesn't work at run time. I'm sure this replacement does well for you, but not for me.
So, because i was just trying your code with skull.raw which is 256^3, i dirtily #define SIZE 256 in gpu-mc-morton.cl, just for this to work.
So, thank you very much : )
I changed the code to use -DSIZE=X when compiling instead of doing the string replace operation. I also added the round function.
I experience the same problem with deleting the VBOBuffer. Not sure why. However, the glDeleteBuffers call will the delete the data in the VBOBuffer, so memory leakage should not be a problem there.
Yes, i undestand what you say about the VBOBuffer.
However I think that every “new” deserves a “delete”.
But, in fact, i just replaced your “new VBOBuffer” with a local variable
“BufferGL VBOBuffer(context, CL_MEM_WRITE_ONLY, VBO_ID);” (local to the function histoPyramidTraversal()) and it works well.
I noticed another thing : i tried to make it work with 64x64x64 or 128x128x128 volumes from volvis (for example Fuel or Hydrogen Atom)
and it does not work for me. I remind you that i’m in “no 3d texture” mode. For 256x256x256 skull runs Ok.
Where do you think the error is ?
My bad ! It runs perfectly.
That was me trying some things …
Thank you very much for helping me in the first place, and for your time.
I tried to use it with fltk( fl_gl_window class).
It works, but if I interact with fltk the VBO seems to disapear, only native Opengl code seems to work.
Do you know if OpenCL and your implementation works with fltk?
I have experimented previously with FLTK, OpenGL and OpenCL and that works. See the code for my experiment here: https://github.com/smistad/FLTK-OpenGL-OpenCL-Interoperability
But I have not tried using this marching cubes implementation with FLTK.
I tried to run it in Windows 7.It runs but when I try to change the isovalue I get an exception error in this line
VBOBuffer = BufferGL(context, CL_MEM_WRITE_ONLY, VBO_ID);
Do you have any idea why it crashes because I could not solve it
Actually I found a fix by creating the buffer each time.
BufferGL* VBOBuffer=new BufferGL;
My question is why do you keep this VBOBuffer for each frame?Is it necessary?
It is probably not necessary. But remember to cleanup the object afterwards (delete VBOBuffer) or you will get memory leaks.
Pretty amazing application, digging what can be done with the proper tools. Grabbed the pdf you linked above for further reading.
I discovered that the article linked in this post is an older version. The newer version is here: http://tapironline.no/last-ned/1050
it runs ok on NVIDIA GT240. thankyou for your code.
anyhow ati radeon hd 5145 neither support 3d textures or 32 bits via pointer..
Not all of AMDs/ATIs devices have full OpenCL support. The list of supported devices is here: http://developer.amd.com/tools/hc/AMDAPPSDK/pages/DriverCompatibility.aspx
NVIDIA GT240 cuda 4.0 vs2008 compile ok,run crash!
show some kind of access violation. seems the precompiled c-lglutility wrong? or just can’t run on N-card?
Great! Have u a version compatible with NVIDIA drivers? I mean, not using image_3d writing, but buffer objects instead?
I have tried making two different versions to work on NVIDIA, one with buffers and one with packing 3D into 2D textures. Both should in theory work on NVIDIA cards, though maybe not as fast as the 3d-image version. But I encountered a lot of problems with NVIDIAs OpenCL implementation. I even got syntax errors in their generated assembly (PTX). This was a long time ago, so I might revisit my old implementation and see if works with a newer version of NVIDIAs OpenCL implementation.
Can you try this NVIDIA compatible version, in order to verify if it works? i’m doing a “state of the art” for opencl surface reconstruction algorithm and i would appreciate it if you could make this code available. Tks a lot
i’m also doing a state of the art about GPU surface reconstruction, it would be so helpful if you can give me your previous state of the art.
Thanks for the response, your paper very clearly answered my questions.
Great post! Could you explain briefly why the first 5 or so planes of the 3d image only use 8 and 16 bit integers instead of 32 like the rest? It begins on line 347 in gpu-mcc.cpp. Thanks.
Two reasons: reducing memory usage and the amount of data transfered from global memory to the chip.
The max sum for each cell in the first HP level is 5, thus 8 bits is more than enough. Since the HP sum 2x2x2=8 cells into a cell in the next level, max sum for the second level 8*5=40 which also is within the 8 bit limit (256). The third level has to be able to store numbers up to 320, thus 16 bit is needed and so on.
See section 3.2 in http://www.mendeley.com/download/public/2730891/4440929055/23c4462b09e9d9d17392a861b3bca6697f0c0d4c/dl.pdf for more details