3D Reconstruction for Stereo Endoscopic Video

[ deep-learning computer-vision augmented-reality hololens ] 13 Dec 2019

Stereoscopic endoscope is used for da Vinci robotic surgery. Some surgeons also use them in laparoscopic surgeries. It provides depth perception to the users, who either sit in the robot console, or watch a stereoscopic monitor with a pair of shutter glass.

It has been a long desire to reconstruct the 3D surgical scene. With this capability, computers are able to better understand the procedure, to provide analytics and generate artificial intelligence. In addition, it can improve the visualization of endoscopy, in combination with mixed reality technologies.

Throughout my study, I have attempted to tackle the 3D reconstruction of endoscopic video using two methods, via traditional computer vision algorithms and deep learning.

Traditional Approach

The key step to 3D reconstruct from stereo images is to compute the disparity between the image pair. The depth of each pixel could be explicitly calculated from the disparity value, given the camera intrinsic parameters. In ARAMIS, we use the traditional approach to calculate the disparity map, semi-global matching algorithm (SGM). In SGM, the algorithm minimizes the matching cost and the disparity smoothness between adjacent pixels.

SGM works reasonably well and reasonably fast. However, to use it for real-time augmented reality visualization, the disparity calculation needs to be done within tens of milliseconds. We decide to use parallel programming (CUDA) to further optimize the time efficiency. Parallel programming can drastically help with the Cost Aggregation, which is done individually for each pixel across the image. As a side effect, all the preprocessing of the image also took advantage of the paralleled computation, e.g. image smoothness and grayscaling.

In the end, our implementation achieves 16.7 milliseconds to compute a dense disparity map for 1080x680 input images. With real-time streaming and rendering, ARAMIS was evaluated to be helpful for users to perform laparoscopic tasks.

More details about the implementation is available in our recently published MICCAI paper: ARAMIS, which I presented in Shenzhen, China.

Deep Learning?

Together with Xiran Zhang and Maxwell Li, we participated in the 2019 MICCAI EndoVis challenge: the SCARED sub-challenge. SCARED stands for Stereo Correspondence and Reconstruction of Endoscopic Data. One particular issue with stereo reconstruction for medical scene is the lack of data with ground truth. Previous popular datasets are KITTI for autonomous driving and Middlebury Stereo Dataset for general scene. Thank Intuitive Surgical for collecting the data and release for an open challenge!

We read many recent papers on the topic, and decided to use Pyramid Stereo Matching Network (PSMNet) to prepare for the challenge, as our survey suggested that PSMNet has been chosen as the backbone for many recent works of good performence.

The main innovation in PSMNet is the spatial pyramid pooling module, and dilated convolution in the feature extraction stage. By using the dilated convolution during feature extraction, we have a larger receptive field in image without increasing the number of parameters. By using spatial pyramid pooling module, the network can integrate the information of image of different scale and the bilinear interpolation in SPP won’t introduce new parameters which will speed up the running time.

The multiple stacked hourglass architecture (encoder-decoder) in 3D convolution helps to differentiate the feature pair in different disparity regions.

Data Preprocessing

While visual screening through the dataset, we found out that the ground truth depth map did not perfectly match the RGB images. There is perceivable misalignment between the depth and color channels! With some investigation, we realized that the depth map is the result of interpolation with robotic kinematic between keyframes. In order to guarantee good input data, we proposed to fix the misalignment issue with data preprocessing.

We first use traditional method (SGM) mentioned earlier in this post to estimate a point cloud from the color image pairs. Then we try to register the point cloud from SGM with the “ground truth” point cloud, with a Euclidean transformation matrix. Afterwards, we apply the transformation to correc the “ground truth”.

The data are then fed into the network for training. We use pre-trained weights from Scene Flow dataset, and then fine-tune on SCARED dataset. Here are two clips of sample visualization.

As can be seen, the reconstructed scene is smoother, without the noise from SGM. However, it does not run as efficient yet. Our team ranked 4th place in the challenge.

However, we know that there are still many challenges for such deep learning algorithm to work efficiently and reliably. It is our future work to investigate how to combine the advantages of traditional methods and deep learning. The perfromance of current algorithms are also limited by the size of dataset. There is still much to be done for the community.

Thank you for reading!

Teleporting the Expert Surgeon into Your OR

[ augmented-reality hololens2 azure-kinect unity3d computer-vision ] 29 Aug 2019

The Medical Augmented Reality Summer School (MARSS) 2019 was successfully hosted in Balgrist Hospital, Zurich, during Aug 5-16.

It was a two-week event. Over the second week, participants formed teams and developed medical augmented reality projects. Our team worked on the project Teleporting the Expert Surgeon into Your OR. It was a quite successful Hackathon project with many exciting memories, and luckily, we won the Audience Award of the summer school.

Above is our project demo video. In this blog post, I would like to share with you my experiences of this project and the summer school.

The Idea

Telepresence is not a new idea in the mixed reality community. Imagine a “3D Skype” app, where you can see your friend walking around in your own 3D space. In the medical domain, such fun idea can be actually very useful. We propose to teleport an expert surgeon into the operating room of a less experienced young surgeon. Via realistic 3D telepresence, the expert surgeon can explain to the young surgeon about the procedure, and provide hand-over-hand guidance.

Here is the picture of our project illustration during brainstorming.

New Toys

Everyone gets his or her own excitement from the summer school. For me, the most exciting part is to have hands-on experience on the new or even unreleased devices: Azure Kinect and HoloLens 2. Thanks to Nassir, Microsoft is convinced to borrow us a few HoloLens 2 devices for development.

The pipeline of our system is pretty clear:

With Azure Kinect, we capture the real-time RGB and depth image of the surgeon.
Substract the background of the depth image
We convert the RGB and depth image into a colored point cloud, with proper calibration parameters.
The point cloud is streamed to the HoloLens
Visualize the point cloud on HoloLens.

The GPU of HoloLens 2 is much improved over the first generation, which could support the rendering of MORE points. While I was developing ARAMIS (to appear in MICCAI 2019), where a point cloud is streamed to HoloLens 1, the major performance bottleneck I identified is the visualization of points, instead of the bandwidth of streaming. HoloLens 2 perfectly solves this by offering a better GPU.

The field-of-view of HoloLens 2 is great. It becomes even more obvious when I came back in the lab and put on HoloLens 1. The interaction paradigm with HoloLens 2 is completely different. It is so natural that you can interact with holograms with your intuition. The system understands your intention very well by tracking your hands.

The Iterations

We went through three development phases within the Hackathon week.

In the first phase, we developed a fake PC-based server to generate arbitrary point cloud, and a point cloud client on Unity Editor. The first phase sets a starting point for everyone to separately work on their own parts.

In the second phase, Federica worked on the PC server, including point cloud generation from Azure Kinect, background substraction, interations to tune the behavior of the server. I worked on the HoloLens 2 program to receive, and visualize the point cloud. Arnaud worked on the user interation on HoloLens 2 using MRTK. Christ worked on the clinical part and designed the demo.

In the third phase, everything got combined together, optimized and we made the demo video (of course, over the night).

The Demo

The last day of summer school was dedicated to demos. We saw the excitement from the faces of our audience.

Dr. Farshad acting the expert surgeon (he is) to his fellow:

A virtual me shaking hand with participants:

A virtual me doing high-five with participants:

Our Team

I feel very lucky to work with these enthusiastic people during the summer school.

We are group 14.

Team mentor： Prof. Ulrich Eck.
Clinical partner: Dr. Christian Hofsepian
Industrial partner: Dr. Federica Bogo
Engineering student: Arnaud Allemang-Trivalle and me.