Marching Cubes implementation using OpenCL and OpenGL
In a school project I recently created a fast implementation of Marching Cubes that uses OpenCL to extract surfaces from volumetric datasets and OpenGL to render the surfaces on screen. I wrote a paper together with my two supervisors about the implementation and presented it at the Joint Workshop on High Performance and Distributed Computing for Medical Imaging at the MICCAI 2011 conference. Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory. The paper entitled “Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs” describing the implementation can be downloaded here. The source code of the implementation can be downloaded from my GitHub page.
Marching Cubes (MC) is a well known algorithm, introduced by Lorensen and Cline  for extracting surfaces from 3D datasets. It works by dividing the entire dataset into a grid of cubes. Triangles are then created in each cube by determining if each corner in the cube is inside or outside the object. Because 2^8=256 different cube configurations are possible, an efficient implementation using a table lookup of where to put the triangles is possible.
Marching Cubes in parallel on the GPU
MC is a completely data parallel algorithm since each cube can be calculated independently on the other cubes and thus ideal for running on GPUs. The problem with running MC in parallel is how to store the triangles in parallel. The dataset is usually very sparse, only a few percentage of the cubes actually produce output. Assuming that all cubes produce output will quickly exhaust the limited GPU memory. Thus we need some way to remove all the cubes that don’t produce any output. This is a common problem in GPU applications and the solution is often termed “Stream Compaction”, which refers to removing unwanted elements from a stream of elements.
Several stream compaction methods exists. NVIDIA for instance use prefix sum scan in their CUDA and OpenCL MC implementations. In our implementation we used a stream compaction method called Histogram Pyramids by Ziegler et al.  which have been used by Dyken et al.  on MC. This data structure has the benefit of exploiting the texture caching mechanisms on GPUs.
We compared performance of the implementation against NVIDIAs and Dyken et al.  implementations and got the following conclusions:
- Our implementation can handle larger datasets due to a more compressed storage format of the Histogram Pyramids
- Should be applicable for other frameworks as CUDA and shader programming also
- Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory.
- For smaller datasets our implementation is slower than other methods
- This was found to be mainly due to an expensive OpenGL-OpenCL synchronization where the CPU is used
- OpenCL and OpenGL extensions are proposed to deal with this
- Hopefully this synchronization will happen on the GPU in the future
- NVIDIA GPUs don’t support writing to 3D textures. It should be possible to pack the data into a 2D texture instead, but problems with NVIDIAs current OpenCL implementation has haltered this. Instead our implementation stores the results in regular buffers and then copy the contents to 3D textures afterwards. This is slower and use a bit more memory, but it works and is reasonably fast.
- Histogram Pyramids is a very efficient stream compaction method well suited for GPUs
- OpenCL is a good framework for writing cross-platform GPU applications
 Lorensen W E, Cline H E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics. 1987;21(4)
 Ziegler G, Tevs A, Theobalt C, Seidel H-P. On-the-fly Point Clouds through Histogram Pyramids. Vision, modeling, and visualization. 2006, 137-144
 C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel. High-speed Marching Cubes using HistoPyramids. Computer Graphics Forum, 27(8):2028–2039, Dec. 2008.