Abstract
Minimally invasive robotic surgery offer benefits such as reduced physical trauma, faster recovery and lesser pain for the patient. For these procedures, visual and haptic feedback to the surgeon is crucial when operating surgical tools without line-of-sight with a robot. External force sensors are biased by friction at the tool shaft and thereby cannot estimate forces between tool tip and tissue. As an alternative, vision-based force estimation was proposed. Here, interaction forces are directly learned from deformation observed by an external imaging system. Recently, an approach based on optical coherence tomography and deep learning has shown promising results. However, most experiments are performed on ex-vivo tissue. In this work, we demonstrate that models trained on dead tissue do not perform well in in vivo data. We performed multiple experiments on a human tumor xenograft mouse model, both on in vivo, perfused tissue and dead tissue. We compared two deep learning models in different training scenarios. Training on perfused, in vivo data improved model performance by 24% for in vivo force estimation.
Problem
Robot-assisted surgery for minimally invasive interventions has become popular since physical trauma can be reduced through motion compensation and scaling [1]. Over the last decade, visual feedback for these systems has considerably improved through image fusion of preoperative data and head-mounted displays. However, many systems still lack force feedback which can be beneficial during surgical tasks to avoid malfunction or damage to organs as well as in distinguishing tissues with respect to type and condition [2].
Force feedback can be enabled through electro-mechanical force sensors that are attached to the tool base outside of the surgical field. However, biased force measurement due to friction forces at the tool shaft is undesirable. Gessert et al. [3] proposed a miniature force sensor integrated in the tool tip. These force sensors can be problematic due to sterilization, biocompatibility and integration in microsurgical instruments with a working channel. Therefore, vision-based force estimation was proposed as contact free alternative.
Previous approaches in vision-based force estimation included deformable template matching methods [4] or mechanical deformation models [5]. These methods are mainly based on single shots of the sample. A different more recent approach is to include temporal information in force estimation models. This approach provides a more realistic scenario since in vivo tissue is always in motion due to pulsation, breathing and force interaction between surgical tools and tissue. This can be modeled efficiently with convolutional neural networks (CNNs) or recurrent neural networks (RNN) and was demonstrated with RGB(D)-images [6].
Recently, optical coherence tomography (OCT) was proposed as an imaging modality which provides a high spatial and temporal resolution for vision-based force estimation. Feasibility in mapping the OCT surface deformation to forces was demonstrated [7]. Also, learning force estimates from full OCT volumes with CNNs has been studied [8], [9] where promising results were achieved on ex-vivo data.
Predicting forces acting on ex-vivo tissue surrogates is always limited to a feasibility approach in the laboratory. These measurements do not reflect the complex physiological and biomechanical properties of in vivo tumor tissue which leads to a different elastic response than ex-vivo tissue [10]. A static laboratory setup neglects properties such as the surrounding soft tissue of the tumor, pulsation, breathing, muscle twitches or speckle characteristics.
In this paper, we investigate vision-based force estimation in a human tumor xenograft mouse model. We employed a high-speed OCT imaging device to acquire OCT volumes at a high temporal rate. We employ two different 4D CNNs that process the high-dimensional 4D spatio-temporal OCT data for predicting forces acting on the tissue. We investigate how the deep learning models performed in an in vivo setting when being trained on either perfused or dead tissue data. Force estimation has been studied with different tissue types, however, there are no studies with tumor tissue so far.
Material and methods
Experimental setup
For data acquisition, the following experimental setup shown in Figure 1 was designed. A robot (H-820.D1, Physik Instrumente) for positioning the OCT field of view (FOV) relative to the tumor was employed. Note, the position of a volume does not change relative to a world reference system. Rather, the mouse which is fixed with tape to a heated bed (

Experimental setup for data acquisition. Left: A OCT scan head (A) for volume acquisition was employed. A robot (B) drives a heated bed (E) on which the mouse is fixed. Tumor tissue is deformed with a needle (C) which is driven by a stepper motor (C) along the needle axis. The force sensor (D) is mounted between the stepper motor and needle. Right: The skin was carefully dissected and the anesthetized mouse was fixed with tape to the heated bed (left image). The right image indicates the approximate position of the field of view (FOV).
Xenograft mouse model
Experiments were conducted on pathogen-free balb/c severe combined immunodeficient (SCID) mice (Charles River, Wilmington, MA, USA). They were housed in individually ventilated cages and provided with sterile water and food ad libitum. For injection,
Data acquisition and datasets
Each mouse was anesthetized and the skin above the subcutaneous tumor was carefully removed with a scalpel. Next, the mouse was fixed to the heated bed which can be easily mounted to the robot. For each experiment, the robot positions the FOV on the tumor (Figure 1, right). Next, the surface was detected by forwarding the needle along the needle shaft direction until a force of 0.02 N was registered. The tumor was palpated by moving and retracting the needle with a distance of 2 mm while continuously OCT volumes were acquired. For data variation, we performed palpation at five different velocities ranging from 0.3 mm/s to 0.7 mm/s. Tissue deformation on perfused tissue was compared to experiments performed on dead tissue. We refer to acquired data as Ante-Mortem (AM) datasets and Post-Mortem (PM) datasets, respectively. In total 10 AM and 10 PM datasets were acquired from five mice.
For deep learning model training and evaluation, a 10-fold cross-validation (CV) scheme was employed. Each fold contains approximately 6,000 vol and represents one experiment with the five different velocities and a different location on the tumor. Iteratively, we leave out one fold for validation, one fold for testing and train the deep learning model on all other folds. The final performance is expressed as the mean across all test folds.
Deep learning architectures
A sequence of 3D OCT volumes represents 4D data that needed to be processed. For this purpose, we employed 4D spatio-temporal CNNs that performed simultaneous spatial and temporal processing. Due to their high-dimensional nature, 4D CNNs are very parameter-intensive which might lead to a risk of overfitting. Therefore, a more efficient variant that uses factorized convolutions was employed [11]. Here, spatial and temporal processing were decomposed by using separate kernels. Thus, a fully 4D convolutional kernel of size

The two 4D CNN architectures employed. The red boxes represent ResNet blocks with 4D convolutions.
Results
Table 1 shows results for tool tip force predictions from the two 4D CNN architectures. For all experiments, we report the mean absolute error (MAE) in mN, the mean absolute error relative to the target’s standard deviation (rMAE), and average correlation coefficient (ACC). Only including PM-data or AM-data for training and evaluation show a similar ACC of 0.69 and 0.67, respectively. If PM-data is used for training and evaluation is performed on AM-data the ACC is only 0.36. The results are similar for both architectures. Example predictions can be seen in Figure 3. Clearly, predictions are best if training is performed on AM datasets.
Results for two deep learning architectures. The training and evaluation sets include either AM or PM data sets referring to data acquisition in perfused and dead tissue respectively.
Architecture | Train. | Eval. | MAE (mN) | rMAE | ACC |
---|---|---|---|---|---|
ResNet4D | PM | PM | 0.69 | ||
AM | AM | 0.67 | |||
PM | AM | 0.36 | |||
fResNet4D | PM | PM | 0.64 | ||
AM | AM | 0.69 | |||
PM | AM | 0.34 |

Example predictions of tool tip forces acting on perfused tumor tissue. The model was trained with Post-Mortem data (blue) or Ante-Mortem data (black).
Discussion and conclusion
Our results show that 4D CNNs can predict forces in perfused tumor tissue with an error of 4.8 mN. Further, our results indicate that CNN’s trained on dead tissue perform poorly when applied to perfused tissue. Note, all experiments were performed on tumors that were embedded in soft tissue. Hence, the change of physiological properties such as perfusion and breathing motion between the vital and dead tissue state strongly influences acquired data sets.
Summarized, we find that vision-based force estimation in in vivo data with deep learning models is heavily influenced by training data. In-vivo force estimation performs substantially better when models are trained on perfused, in vivo data.
Research funding: This work was partially funded by the TUHH i3 initiative and partially by DFG SCHL 1844/2-2.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: Authors state no conflict of interest.
Informed consent: Informed consent has been obtained from all individuals included in this study.
Ethical approval: The research related to human use complies with all the relevant national regulations, institutional policies and was performed in accordance with the tenets of the Helsinki Declaration, and has been approved by the ethics committee of the Medical Center Hamburg-Eppendorf.
References
1. Song, SE. Robotic interventions. In: Handbook of medical image computing and computer assisted intervention: Elsevier; 2020:841–60 pp.10.1016/B978-0-12-816176-0.00039-9Search in Google Scholar
2. Diana, M, Marescaux, J. Robotic surgery. Br J Surg 2015;102:e15–28. https://doi.org/10.1002/bjs.9711.Search in Google Scholar
3. Gessert, N, Priegnitz, T, Saathoff, T, Antoni, ST, Meyer, D, Hamann, M, et al. Needle tip force estimation using an oct fiber and a fused convgru-cnn architecture. In: MICCAI: Springer; 2018:222–9 pp.10.1007/978-3-030-00937-3_26Search in Google Scholar
4. Greminger, MA, Nelson, BJ. Vision-based force measurement. IEEE TPAMI 2004;26:290–8. https://doi.org/10.1109/tpami.2004.1262305.Search in Google Scholar
5. Mozaffari, A, Behzadipour, S, Kohani, M. Identifying the tool-tissue force in robotic laparoscopic surgery using neuro-evolutionary fuzzy systems and a synchronous self-learning hyper level supervisor. Appl Soft Comput 2014;14:12–30. https://doi.org/10.1016/j.asoc.2013.09.023.Search in Google Scholar
6. Marban, A, Srinivasan, V, Samek, W, Fernández, J, Casals, A. A recurrent convolutional neural network approach for sensorless force estimation in robotic surgery. Biomed Signal Process Contr 2019;50:134–50. https://doi.org/10.1016/j.bspc.2019.01.011.Search in Google Scholar
7. Otte, C, Beringhoff, J, Latus, S, Antoni, ST, Rajput, O, Schlaefer, A, et al. Towards force sensing based on instrument-tissue interaction. In: MFI 2016: IEEE; 2016:180–5 pp.10.1109/MFI.2016.7849486Search in Google Scholar
8. Gessert, N, Beringhoff, J, Otte, C, Schlaefer, A. Force estimation from oct volumes using 3d cnns. IJCARS 2018;13:1073–82. https://doi.org/10.1007/s11548-018-1777-8.Search in Google Scholar
9. Gessert, N, Bengs, M, Schlüter, N, Schlaefer, A. Deep learning with 4d spatio-temporal data representations for oct-based force estimation. Med Image Anal 2020;64:101730. https://doi.org/10.1016/j.media.2020.101730.Search in Google Scholar
10. Carter, FJ, Frank, TG, Davies, PJ, McLean, D, Cuschieri, A. Measurements and modelling of the compliance of human and porcine organs. Med Image Anal 2001;5:231–6. https://doi.org/10.1016/s1361-8415(01)00048-2.Search in Google Scholar
11. Sun, L, Jia, K, Yeung, DY, Shi, BE. Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE ICCV; 2015:4597–605 pp.10.1109/ICCV.2015.522Search in Google Scholar
12. He, K, Zhang, X, Ren, S, Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE CVPR; 2016:770–8 pp.10.1109/CVPR.2016.90Search in Google Scholar
© 2020 Maximilian Neidhardt et al., published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.