Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter March 28, 2014

Humanoid Robot Navigation: Getting Localization Information from Vision

  • Emilie Wirbel EMAIL logo , Silvère Bonnabel , Arnaud de La Fortelle and Fabien Moutarde


In this article, we present our work to provide a navigation and localization system on a constrained humanoid platform, the NAO robot, without modifying the robot sensors. First, we tried to implement a simple and light version of classic monocular Simultaneous Localization and Mapping (SLAM) algorithms, while adapting to the CPU and camera quality, which turned out to be insufficient on the platform for the moment. From our work on keypoints tracking, we identified that some keypoints can be still accurately tracked at little cost, and used them to build a visual compass. This compass was then used to correct the robot walk because it makes it possible to control the robot orientation accurately.

Keywords: humanoid; SLAM; vision

1 Introduction

This article presents recent work that extends the results presented in Wirbel et al. [20] and proposes new algorithms for SLAM based on quantitative data.

1.1 Specificities of a Humanoid Platform: The NAO Robot

All the following methods have been applied to the humanoid robot NAO. This robot is an affordable and flexible platform. It has 25 degrees of freedom, and each motor has a magnetic rotary encoder (MRE) position sensor, which makes proprioception possible with a good precision (0.1° per MRE). Its sensing system includes, in particular, two color cameras (see Figure 1). It is equipped with an on-board Intel ATOM Z530 1.6 GHz CPU, and programmable in C++ and Python. A stable walk API is provided; however, it is based only on the joint position sensors and the inertial unit (see Ref. [6]). It is affected by the feet slipping: the robot orientation is often not precise. This API makes it possible to control the robot position and speed.

Figure 1 The Humanoid Platform NAO.(A) The NAO NextGen robot. (B) On-board sensors and actuators.
Figure 1

The Humanoid Platform NAO.

(A) The NAO NextGen robot. (B) On-board sensors and actuators.

Working on such a platform has undeniable advantages in terms of human–robot interaction, or more generally with the environment but it also has specific constraints. The reduced computational power, limited field of view, and resolution of the camera and unreliable odometry are very constraining factors. The aim here is to get some localization information without adding any sensor and still being able to run online on the robot.

1.2 Related Work

SLAM is a recurrent problem in robotics. It has been extensively covered, using metric, topological, or hybrid approaches. However, the existing algorithms often have strong prerequisites in terms of sensors or computational power. Our goal here is to find a robust method on a highly constrained platform that has not been specifically designed for navigation, such as the NAO robot.

Metric SLAM is the most common type of SLAM. Most rely on a metric sensor, such as a laser range sensor, a depth camera, a stereo pair, an array of sonars, etc. Thrun [19] presents a survey of some algorithms in that category: Kalman filter based such as FastSLAM by Montemerlo et al. [13], expectation–maximization algorithms by Thrun [18], occupancy grids (originally from Elfes [5]), etc.

Vision-based metric SLAM relies on landmark position estimation to provide the map, using mostly Kalman-based approaches. For example, Chekhlov et al. [3] implement a SLAM based on unscented Kalman filter for position estimation and SIFT features for tracking: this method is designed to deal with high levels of blur and camera movements. However, these algorithms require having a calibrated camera or a high resolution and frame rate, which is not available on the NAO: the extrinsic parameters of the camera vary from one robot to another. A VGA resolution is usually enough for most algorithms; however, most existing algorithms use either a calibrated camera or a wider field of view than what is available on NAO. A higher resolution is available, but at a limited frame rate and at the cost of higher CPU usage. It is also worth noting that most handheld visual SLAM algorithms implicitly rely on the hypothesis that the tracked keypoints are high above the ground and so of good quality and relatively easy to track. Among other related work, PTAM is very promising. The initial algorithm was described by Klein and Murray [8], improved and optimized for a smartphone in another work by the authors [9], which is a setup close to NAO in terms of computing power and camera field of view, but does not seem to be fully reliable yet.

The previous methods use sparse keypoint information. It is also possible to use dense visual information to obtain qualitative localization. Matsumoto et al. [12] implement navigation based on omni-directional view sequences. This kind of approach is heavier in terms of computational power, but more robust to blur and less sensitive to texture quality. This is why it is also an interesting trail to explore.

Several approaches already exist to provide the NAO robot with a navigation system; however, they all have strong prerequisites and/or require additional sensors. Some perform a metric localization and estimate the six-dimensional pose of the robot by adding some metric sensor: Hornung et al. [7] use a combination of laser range and inertial unit data, while Maier and Bennewitz [11] use an Asus Xtion Pro Live depth sensor set on the top of the head. Both require a prebuilt three-dimensional (3D) map of the static environment, although Maier and Bennewitz [11] also deal with dynamic obstacles. As far as vision is concerned, Maier et al. [10] use sparse laser data to detect obstacles and classify them using vision. Osswald et al. [15] estimate the pose using floor patches to perform reinforcement learning and reach a destination as fast as possible. It is worth noting that adding any kind of sensor on the robot (a laser head or 3D camera) affects the robot balance and walk. Chang et al. [2] propose a method for the fixed Robocup context, using only the regular sensors; however, they rely on the specific field features.

The first aim of this article is to try out simple keypoints positions method, in order to be as light as possible in terms of computational power, which means in particular avoiding Kalman-based approaches. In Section 2, we describe a lightweight keypoints position estimation and try to adapt it to the NAO platform. We show that although we have good theoretical results, the application to the robot is difficult. Despite this, in Section 3, we use some of our observations to derive a partial localization information, which consists in estimating the robot orientation, and test it both in simulation and on a real robot.

2 A Metric Visual SLAM Algorithm

2.1 A Keypoint-Based Metric SLAM

2.1.1 Notations and Model

Let x be the planar position of the robot and θr its orientation in the world reference. For each keypoint indexed by i, we let pi denote its position in the world reference. The robot axes are described in Figure 2.

Figure 2 Axes Definition on the Robot.
Figure 2

Axes Definition on the Robot.

Let ω denote the angular velocity of the robot, u ∈ R; the value of its velocity expressed in m/s, and Rθ the rotation of angle θ. If we assume that the robot velocity vector is collinear with the orientation axis X, we end with the following kinematic model for the motion of the robot:

(1)dθrdt=ωdxdt=uRθXdpidt=0. (1)

This model is simple but corresponds to the existing walk API, which provides a speed control for translations and rotations. It is also possible to control the walk step by step; however, as the focus of this work is not on the walk control, we have kept the high-level simple control approximation.

In the robot reference frame (X, Y, Z), the position of the keypoint i is the vector


The camera measurement being the direction (bearing) of the keypoint, for each i the vector


is assumed to be measured.

In the present article, we propose a simplified two-step approach to the SLAM problem. First, the odometry is assumed to yield the true values of ω and u, and a filter is built to estimate each keypoint’s position, and then, using those estimates globally, the knowledge of ω and u is refined (i.e., the sensor’s biases can be identified). The present section focuses on the first step, that is, estimating pi from measurement yi and perfectly known ω, u.

Our approach is based on the introduction of the vector yi=Rπ/2yi. Letting ′ denote the transposition of a vector, and writing that yi is orthogonal to yi, we have

(2)0=zi(t)yi(t)=(Rθ(pix(t)))yi(t)=(pix(t))(Rθ(t)yi(t)). (2)

Introducing the vector ηi(t)=Rθ(t)yi(t) that is yi mapped into the world reference frame, we thus have

(3)(pix(t))η(t)=0=η(t)(pix(t)). (3)

The previous notations are illustrated in Figure 3: Figure 3A displays the notations in the absolute world reference and Figure 3B in the robot reference.

Figure 3 Notations for the Metric SLAM Position Estimation.(A) Notations in the world reference. (B) Notations in the robot reference.
Figure 3

Notations for the Metric SLAM Position Estimation.

(A) Notations in the world reference. (B) Notations in the robot reference.

Equation (2) simplifies all following computations by performing a reference change. Contrary to what is usually done, all observations are considered in the robot reference and not in an absolute reference.

Knowing at each time the previous relation (3) should be true, pix(t) and thus pi (up to an initial translation x(0)) can be identified on the basis of the observations over a time interval as the solution to a linear regression problem (the transformed outputs η(t) being noisy). To tackle this problem online, three main possibilities can be thought of.

2.1.2 Gradient Observer

The problem can be tackled through least squares, trying to minimize at each step the following cost function:


To do so, we propose the (non-linear) gradient descent observer (5), the estimation of x and θ being a pure integration of the sensor’s outputs, and the estimation of pi being ki > 0 times the gradient of C(t) with respect to pi at the current iteration:

(5)dθ^dt=ωdx^dt=uRθ^(t)X1dp^idt=kiηi(t)(p^i(t)x^i(t))ηi(t)  i. (5)

The gradient of the cost function with respect to pi is


Let e(t) denote the estimation error between the estimated keypoint position and the actual position in the robot reference (7), i.e.,

(7)e(t)=Rθ^r(t)(z^i(t)zi(t)). (7)

The following result allows proving asymptotic convergence of the error to zero under some additional conditions.

Proposition 1

(8)ddt||e(t)||2=2ki(ηi(t)e(t))20. (8)

Equation (8) proves that the estimation error is always non-increasing as the time derivative of its norm is non-positive. Moreover, if there exists T, ε, α > 0 such that


is a matrix whose eigenvalues are lower bounded by ε and upper bounded by α, the estimation error converges exponentially to zero (the input is said to be persistently exciting).

Proof. By construction:

(9)ddt||e(t)||2=ddt(e(t)e(t)).=2ededt (9)

First we compute the error derivative

(10)dedt=ddt(Rθ^r(z^z))=ddt(Rθ^r(Rθ^r(p^x^)Rθr(px)))=ddt((p^x^)Rθ^rθr(px))=dp^dtdx^dtddt(Rθ^rθr(px))). (10)

Now if we denote θ^r(t)θr(t)=θ˜(t):


We know that dθ^dt=dθdt=ω by construction of the observer, so dθ˜(t)dt=0 and also that dxdt=uRθX1 and dpdt=0, as the observed point is considered fixed.



Thus, if we inject this in eq. (10) we obtain



(11)dedt=dp^dt. (11)

Therefore, using eqs. (9) and (11)

(12)ddt||e||2=2edp^dt=2keη(p^ix^i)η. (12)



So if we use relation (3) we have


So, finally, if we inject this in eq. (12)


However, in simulations, we see that the convergence with such an observer can be very slow. This is because the instantaneous cost C(t) is only an approximation of the true cost function viewed as an integral of C(s) over all the past observations, i.e., observed during the time interval [0, t]. For this reason, we should consider some other possibilities.

2.1.3 Kalman Filtering

The solution of this problem could be computed using a Kalman filter approach. As we do not have any a priori knowledge of the keypoints positions, the initial covariance matrix would be P0 = ∞.

Although this approach is perfectly valid, we will take advantage of the simplicity of the current problem and consider an even simpler solution, described in Section 2.1.4. This way, we avoid having to take care of numerical issues due to the initial jump of the covariance matrix from an infinite value to a finite one. We also avoid having to maintain a Riccati equation at each step.

Another similar solution would be to use a recursive least squares (RLS) algorithm, with a forgetting factor λ< 1. The following simpler approach is closely related to RLS: it replaces the forgetting factor by a fixed-size sliding time window.

2.1.4 Newton Algorithm over a Moving Window

Owing to the simplicity of the problem, and having noticed eq. (2), we choose the following approach over Kalman or RLS filtering.

A somewhat similar approach to RLS consists in using a Newton algorithm to find the minimum of the cost function C(s) integrated over a moving window. This has the merit of leading to simple equations and being easily understandable. The cost at its minimum provides an indication of the quality of the keypoint estimation p^i. We choose this simple alternative to Kalman filtering. Consider the averaged cost

(13)Q(t,p)=tTt(ηiηi(s)(px(s)))2ds. (13)

It is quadratic in p. Its gradient with respect to p merely writes


The minimum is reached when the gradient is set equal to zero; that is

(14)p^(t)=A1(t)B(t). (14)

To be able to invert matrix A(t), it has to be of full rank. By construction, it is possible iff the successive directions of the keypoint observations are not collinear. This means that the keypoint has been observed from two different viewpoints, which is the case in practice when the robot moves and the keypoint is not in the direction of the move.

The convergence speed of this algorithm depends on the size of the sliding window T. It also depends on the determinant of matrix A(t), which is easily interpreted: the triangulation will be better if we have observations that have very different bearings.

This algorithm can be interpreted as a kind of a triangulation. The cost function (13) can be interpreted as the sum of the distances from the estimated position to observation lines.

2.2 Adapting to the Humanoid Platform Constraints

The previous algorithm is generic, but it requires that keypoints must be tracked robustly and observed under different directions over time. This means that a few adaptations must be made on the NAO.

2.2.1 Images

As shown in Figure 4, the on-board camera has a limited field of view. On the other side, the keypoints observations must be as different from each other as possible, which is obtained at best when the keypoints are observed in a perpendicular direction to the robot.

Figure 4 NAO Robot Head with Camera Fields of View.(A) Side view. (B) Top view.
Figure 4

NAO Robot Head with Camera Fields of View.

(A) Side view. (B) Top view.

To compensate for this, we use the fact that the robot can move its head around and build semi-panoramic images. The robot moves its head to five different positions successively, without moving its base, stopping the head to avoid blur. Figure 5 shows an example of such a panorama. The different images are positioned using the camera position measurement in the robot reference, which is known with a precision under 1°, thanks to the joint position sensors. This means that it is not required to run any image stitching algorithm because this positioning is already satisfactory.

Figure 5 Example of Panoramic Image.
Figure 5

Example of Panoramic Image.

This makes it possible to have a larger field of view, and thus to track keypoints on the side of the robot, which are the ones where the triangulation works at best. However, it requires the robot to stop in order to have sharp and well-positioned images. In practice, we have determined that the robot must stop and take a half panorama every 50 cm.

The robot head is not stabilized during the walk, which means the head is moving constantly up and down. Because of the exposure time of the camera, most of the images are blurred. There is also a rolling shutter effect: as the top part of the image is taken before the bottom part, the head movement causes a deformation of the image that seems “squeezed.” Because of these factors, it is necessary to stop the robot to get usable images.

The observer described in Section 2.1.4 has been implemented in C++ to run on board on the robot.

2.2.2 Keypoints

The tracked keypoints are chosen for their computation cost and robustness. The keypoint detector is a multiscale FAST (features from accelerated segment test [16]) detector because it is quick to compute and provides many points. The descriptor is upright SURF or USURF, which is a variation of SURF [1] that is not rotation invariant and lighter to compute than regular SURF. The fact that the keypoints are not rotation invariant is an advantage here because it reduces the risk of confusion between points that are similar after a rotation, for example, a table border and a chair leg. This makes the tracking more robust and avoids false positives.

Keypoint tracking is performed by matching the keypoints descriptor using a nearest neighbor search with a KdTree to speed up the search. A keypoint is considered matched when its second nearest neighbor distance is sufficiently greater than the first nearest neighbor distance. This eliminates possible ambiguities in keypoints appearance: such keypoints are considered not sure enough to be retained.

The keypoint extraction and matching have been implemented in C++ using the OpenCV library (see Ref. [14]). Both FAST and USURF have proved robust enough to blur, so that the tracking error is usually less than a few pixels. The FAST detector has a pixel resolution, which means 0.09 degrees for a VGA resolution.

2.3 Results and Limitations

The previous algorithm has been tested on several runs in office environment (see Figure 5). During the runs, the robot walked around, stopping every 50 cm to take a half panorama. The experiments were recorded in order to be replayable and filmed in order to compare the estimated keypoint positions to the real ones.

Figure 6 shows the different stopping positions of the robot during a run. The robot positions and orientation are marked with red lines. Figure 5 was taken at the first stop of this run. Figure 7 shows a successful keypoint position estimation. The successive robot positions are marked with red arrows, where the arrow shows the direction of the robot body. The green lines show the observed keypoint directions. The pink cross corresponds to the estimated keypoint position. The runs have shown that there are many keypoint observations that cannot be efficiently used for the observer. For each of these cases, a way to detect them and so ignore the keypoint has been implemented. Parallel keypoint observations can be problematic. If the keypoint observations are exactly parallel, the matrix that has to be inverted is not of full rank. In practice, the observations are not exactly parallel, so the matrix would be invertible in theory. This leads to estimations of keypoints extremely far away because of small measuring errors. To identify this case, we rely on the determinant of the matrix, which must be high enough. Figure 8 shows an example of such observations: the observations are nearly parallel, probably observing a keypoint very far away in the direction highlighted by the dashed blue line.

Figure 6 Robot Stopping Positions during an Exploration Run.(A) Stops 0–4. (B) Stops 5–8.
Figure 6

Robot Stopping Positions during an Exploration Run.

(A) Stops 0–4. (B) Stops 5–8.

Figure 7 Example of Correct Triangulation from Keypoints Observations.
Figure 7

Example of Correct Triangulation from Keypoints Observations.

Figure 8 Parallel Keypoints Observations.
Figure 8

Parallel Keypoints Observations.

Because of matching errors or odometry errors, the triangulation may be of low quality, resulting in a triangulation with too much error. In that case, we can use the interpretation of the cost function as the sum of distances from the estimated position to the observation to correlate its value with the size of the zone where the keypoints could be. Figure 9 shows an example of low-quality triangulations: the estimated keypoint is 0.5 m away from some observation lines. The triangulation zone is drawn with striped lines. A threshold on the cost value has been set experimentally to remove these low-quality estimations.

Figure 9 Low-Quality Triangulation.
Figure 9

Low-Quality Triangulation.

Sometimes, inconsistent observations can still result in an apparently satisfactory triangulation. One of the ways to detect this is to check that the estimated keypoint position is still consistent with the robot field of view. Figure 10 shows that the intersection between observations 0 and 4 is inconsistent because it is out of the robot field of view in position 4. This is the same for observations 0 and 8: the intersection is behind both positions. The observation directions have been highlighted with dashed blue lines, showing that the line intersections should be invisible from at least one robot pose. To avoid this case, the algorithm checks that the estimation is consistent with every robot position in terms of field of view and distance (to be realistic, the distance must be between 20 cm and 5 m).

Figure 10 Inconsistent Keypoints Observations: Intersection Behind the Robot.
Figure 10

Inconsistent Keypoints Observations: Intersection Behind the Robot.

Figure 11 shows the influence of the different criteria on the keypoints estimation. The robot trajectory is drawn in red, with the robot positions in green, and the keypoint identifiers are put at the estimated positions. Note that the scale changes from one image to the other. Figure 11A presents the results without any criterion: the keypoints are very dense, but some estimations are too far away to be realistic. One keypoint is estimated 30 m away, for example, and most could not have been that far because of the room configuration. Figure 11B adds the quality criterion: some points are eliminated, and the distance values seem less absurd. The maximum distance is now around 6 m. Figure 11C reduces the number of keypoints by adding the consistency criterion: it is visible that the number of unrealistic points has been reduced. For example, there are much less keypoints behind the first robot position. Finally, Figure 11D also adds the matrix determinant criterion. The distance to the keypoints is again reduced, which is consistent with the purpose of the criterion.

Figure 11 Effects of Different Criteria on the Keypoint Position Estimations.(A) No criterion. (B) Quality criterion. (C) Quality and consistency criteria. (D) Quality, consistency, and determinant criteria.
Figure 11

Effects of Different Criteria on the Keypoint Position Estimations.

(A) No criterion. (B) Quality criterion. (C) Quality and consistency criteria. (D) Quality, consistency, and determinant criteria.

However, even if the number of unreliable estimations has been greatly reduced, an examination of the tracked keypoints has revealed that most of them are in fact false matches that happen to meet all of the previous criteria. The ones that effectively were correct matches were correctly estimated; however, it was not possible to distinguish them from false matches. This very high number of false matches comes from different causes. The first one is perceptual aliasing: in the tested environment, there were a lot of similar chairs that were often mixed up. Possible solutions would be either a spatial check when matching keypoints and the use of color descriptors. However, this would not solve all problematic cases. Another difficulty comes from the size of the platform. At the height of the robot, parallax effects cause a large number of keypoints to appear or disappear from one position to the other. A way to solve this would be to take more half panoramas; however, this implies more stops, which is problematic because the exploration is already very slow. In fact, there were a lot of keypoints that were efficiently tracked. However, they corresponded to keypoints that were very far away, or directly in front of the robot when it walked. These keypoints cannot be used because they do not satisfy our estimation hypothesis. In Section 3, we try to take advantage of these points in another way.

In conclusion, even if simulation results seemed promising, and even if it was possible to meet some of the requirements of the algorithm, it was not enough to track reliable, high-quality keypoints separately.

3 A Useful Restriction: The Visual Compass

As explained in Section 2.3, keypoints that are straight in front of the robot or that behave as if they were infinitely far away can be tracked. These observations cannot be used to approximate their positions, but they still can be useful.

3.1 From the Metric SLAM to a Compass

Keypoints that are infinitely far away (or behave as such) have a common characteristic: when the robot orientation changes, they shift accordingly in the image. It is the same when the robot performs pure rotation, without any hypothesis on the keypoint’s actual position. Ideally, when the orientation changes, all keypoints will shift of the same number of pixels in the image. This shift depends on the image resolution and camera field of view and can be known in advance. Here, we make the assumption that the camera intrinsic parameters do not deform the image and that we can simply use a proportional rule to convert a pixel shift into an angular shift. This assumption is only a simplification, so a more complex model could be used, and it is met in practice on the camera of NAO.

3.1.1 Observation of the Deviation Angle

Ideally, if all keypoints were tracked perfectly, and satisfying the desired hypothesis, only one pair of tracked points would be required to get the robot orientation. However, this is not the case because of possible keypoints mismatches, imprecisions in the keypoints position due to the image resolution. The assumption that the robot makes only pure rotations is not true because the head is going up and down during the walk, and the robot torso is not exactly straight. This implies some pitch around the Y-axis and roll around the X-axis, which added to the yaw deviation around the Z-axis (see Figure 2 for the axes definitions). When the robot is going forward, because the tracked keypoints are not actually infinitely far away, the zooming or dezooming effect will add to the shifting, and our assumptions will be less and less valid. To compensate for this effect, a possible solution is to rely on all tracked keypoints to get the most appropriate global rotation. A way to this is to use random sample consensus (RANSAC) (or any variation such as PROSac). A model with two pairs of matched keypoints is used to deduce the yaw, pitch, and roll angles. To make the model computations easier, we will use the complex representation of the keypoints position. Let z1 be the position of the first keypoint in the reference image, and z1 its position in the current image. z2 and z2 are defined in a similar fashion for the second keypoint. We define the following variables

(15)θ=arg(z1z2z1z2)o=z1z1eiθ. (15)

The angles can then be obtained by the following equations, if c is the image center coordinates:


It is worth noticing that the model (15) is not usable for every couple of matched points. The model assumes that this pair of points is consistent with the model. This is not the case, for example, if the keypoints are mismatched. However, these false computations will be eliminated by the RANSAC because they cannot have more inliers than a model based on valid points.

To compute the inliers for one particular model, we have to define a distance from one point to the other. For a point pair (z, z′), we compute the theoretical image of z:


The distance is then

(16)d=||z˜z||. (16)

We can then use the model computations from eq. (15) and the distance computation from eq. (16) to perform the algorithm. The number of inliers from the final best model can be used as an indication of how reliable the computation is. In particular, this can be used when the assumption that keypoints are infinitely far away becomes dubious, as in Section 3.1.2. The keypoints used are multiscale FAST detector as before, but the chosen descriptor was the ORB descriptor (oriented BRIEF, see Ref. [17]) for speed and computational cost reasons. This descriptor is a vector of 256 binary values, which are quicker to extract and to compare than the usual SURF descriptors, with only on a slight difference in robustness.

In practice, the camera angular resolution is 7 × 10–3 radians per pixel for a 160 × 120 resolution (which is the one used here). This means that if we have an uncertainty of 5 pixels, which is a reasonable upper bound for the uncertainty of the keypoint detection, this results in an uncertainty of about 2 degrees maximum, which is a considerable improvement from the uncorrected walk (see, e.g., Figure 16 for a comparison). The robustness and precision of this method is linked to the keypoint density. In practice, most indoor or outdoor environments have enough keypoints to provide a reliable heading estimation.

3.1.2 Control of the Robot Movement

The principle of our visual compass is to use this information to control the robot orientation. Let ω be the angle to rotate. Before rotating, the robot first turns its head toward a reference direction. This direction is situated at the angle position ω2 in the initial robot reference, so that any rotation within [–π, π] is possible. The robot then maintains its head in that direction and rotates its body toward the final orientation. The head target position is 0, and the base target position is ω2, in the direction reference. Let α be the head angle in the fixed reference and θ be the body angle. Their initial values are 0 and ω2, respectively. The robot is controlled with its body rotation speed θ˙ and with its head speed in the robot reference, which is α˙θ˙.

The control is done with two Proportional-Integral-Derivative (PID) controllers, described in Figure 12. The observed variable is the current angular deviation of the head in the fixed reference, α^. We deduce the current deviation of the robot body using the internal position sensors of the robot, S. The desired robot body speed is then computed. It is used to predict the head movement due to the body movement Δαp=θ˙Δt, which is then added to the current observed angle for the error computation. This step has been added to make the head tracking more robust for high rotation speeds.

Figure 12 PID Controllers for the Head and Robot Body.
Figure 12

PID Controllers for the Head and Robot Body.

For translations, it is possible to make the robot walk a straight line by simply adding a forward speed x˙. When the estimation starts being less reliable (see Section 3.1.1), the robot stops going forward, aligns itself in the desired direction, and renews its reference image. This makes the reference reliable again, and the robot can keep going forward.

3.2 Simulation Results on Webots

To obtain quantitative results with a reliable ground truth measurement, we have used the Webots simulator from Cyberbotics [4]. This simulator provides physics simulation along with various sensors simulation, including cameras. A simulated world has been built to test the visual compass (see Figure 13) using the simulator as a ground truth. This simulator emulates a real robot and reproduces the architecture of the robot operating system, NAOqi. The simulator differs from the real world in the odometry estimation, which is slightly worse in the simulator (because the robot is slipping more), and the images are more stable (because there is no motion blur or lighting conditions changes). However, this is an interesting validation step and makes it possible to get repeatable and measurable experiments.

Figure 13 The Simulated Environment (using Webots).
Figure 13

The Simulated Environment (using Webots).

Two different simulated experiments have been run. During each experiment, we record the estimated angle from the compass, the ground truth position of the robot and the estimated position of the robot by the odometry. The odometry and compass estimation are reset to the ground truth between each new target movement.

The first experiment makes the robot perform two consecutive pure rotations, of an angle π then –π (each lasting around 20 s in the simulator). Figure 14 shows a comparison of the different estimations of the robot base angle. The ground truth angle is drawn in dashed blue line, the compass angle is drawn in full red line, and the odometry angle is drawn in dotted green line. It is visible that the odometry measurement drifts steadily during the movement, while the compass estimation stays close to the ground truth.

Figure 14 Comparison of the Robot Angle Estimations during Pure Rotations (Two Rotations of 20 s Each).
Figure 14

Comparison of the Robot Angle Estimations during Pure Rotations (Two Rotations of 20 s Each).

The second experiment makes the robot walk in a straight line for about 25 s. Figure 15 shows a comparison of the estimations of the robot base angle (with the same display conventions as the previous figure). This shows that the odometry has a nearly constant bias of approximately 0.15 radians (around 9°). This also shows that in the ground truth, the robot walks in a near straight line (or at least, a set of parallel lines with a slight lateral shift). The estimated compass angle oscillates around the ground truth, with a weak oscillation amplitude (0.05 radians).

Figure 15 Comparison of Robot Angle Estimations during a Translation (25 s Experiment).
Figure 15

Comparison of Robot Angle Estimations during a Translation (25 s Experiment).

Figure 16 compares the robot trajectory as estimated by the odometry and the ground truth trajectory. The odometry trajectory is drawn as a full blue line, while the ground truth is represented as a dashed green line. The oscillations come from the fact that the robot torso is oscillating during the walk, and the oscillation period corresponds to the stepping period. Note that the odometry underestimates the amplitude of these oscillations. The bias highlighted by Figure 15 can be seen quite clearly because the trajectory is clearly steadily drifting from the real trajectory, with a constant angle of around 0.12 radians (about 7°).

Figure 16 Comparison of Robot Trajectories Estimation during a Translation (25 s Experiment).
Figure 16

Comparison of Robot Trajectories Estimation during a Translation (25 s Experiment).

3.3 Results on the Real Robot

The compass has been integrated on the robot, with a C++ implementation based on the OpenCV library. Figure 17 shows a typical output of the keypoint matching with the RANSAC model. The reference image is shown on top, and the current one on the bottom part. The inliers are marked in green, while the outliers are drawn in red. This figure shows how all inliers behave consistently, here with a global shift to the left. In contrast, outliers correspond to inconsistent points, which are due to false matches. Note that the rotation change between the two images is clearly visible in the selected inliers.

Figure 17 Example of RANSAC Output on Keypoint Matching.
Figure 17

Example of RANSAC Output on Keypoint Matching.

Table 1 shows the CPU usage of the algorithm on the robot for available resolutions of the camera. For the walk corrections, the typical parameters are 160 × 120 and 15 fps. The maximum resolution is only available at five frames per second, so around 36% CPU. Note that the results are significantly better than in Ref. [20] because the keypoints descriptor has been changed from USURF to ORB.

Table 1

Comparison of CPU Usage for Different Resolutions and Frame Rates.

Resolution160 × 120320 × 240640 × 4801080 × 960
5 fps2%4%10%36%
13 fps6%11%31%NA
30 fps13%23%60%NA

PID parameters have been tuned for the robot. The following results are presented for a desired rotation of π2, which lasted around 10 s. Figure 18 presents the computed speeds in the robot reference. Both speeds are capped for walk stability. The head speed (blue full line) drawn is compensating both the base speed and the errors on the head and the base position. The base speed (in green dashed line) is first capped then decreases slowly. When the body angle reaches the destination, the robot is stopped immediately, which explains why the speed is not necessarily zero at the end of the experiments.

Figure 18 Evolution of Head and Base Speeds with Time (9 s Experiment).
Figure 18

Evolution of Head and Base Speeds with Time (9 s Experiment).

Figure 19 shows the evolution of the head angle in the fixed reference. The head angle (in blue full line) is oscillating around the consign (in green dashed line). Figure 20 shows the evolution of the base angle in the fixed reference. The goal value is π4 and the start value is π4. The robot stops when the error is under a fixed threshold. Overshoot is compensated by taking one last measurement at the end of the rotation to retrieve the final angle, which makes it possible to correct this overshoot later on, if necessary. Future plans include using a motion-capture device to get quantitative and precise data on a real NAO robot, in order to confirm the simulation experiments.

Figure 19 Evolution of Head Angle with Time (9 s Experiment).
Figure 19

Evolution of Head Angle with Time (9 s Experiment).

Figure 20 Evolution of Base Angle with Time (9 s Experiment).
Figure 20

Evolution of Base Angle with Time (9 s Experiment).

4 Conclusion

In this article, we have presented our work on localization and mapping on a constrained humanoid platform, the NAO robot. We have designed a light and easily interpreted observer to estimate tracked keypoint positions. We have added a series of adaptations for our platform, including increasing the field of view by moving the head and compensating false matches by checking some consistency conditions and estimation quality. However, the final results have not proven satisfying enough on NAO. This comes in particular from the small robot height, which makes it very difficult to extract high-quality keypoints and track them. The robot limited field of view is also an issue, along with its reduced speed, which makes frequent stops long and tedious. It might be possible to reuse some of this work on taller robots, such as ROMEO, for example, or some other platform, because there are very few requirements except the ability to track keypoints.

Nevertheless, from our work on keypoint tracking, we have been able to derive a partial localization information, the robot orientation. To do so, we have built an efficient visual compass that is based on matching reference image keypoints with the current observed keypoints and deducing the current robot rotation. FAST keypoints with ORB descriptors are extracted, and their global rotation is computed using a RANSAC scheme. PID controllers make it possible to control the robot orientation precisely, performing rotations and walking along straight lines with much higher precision that the open loop odometry. The implementation is running in real time on the robot, and its accuracy has been controlled quantitatively on a simulator and qualitatively on a robot. Future work includes getting quantitative data on a real robot with motion capture.

In future work, we will look more closely at other types of vision-based approaches, such as dense visual methods based on image correlation, whose characteristics and weaknesses are complementary to the sparse algorithms described in this article.

Corresponding author: Emilie Wirbel, A-Lab Aldebaran Robotics, 170 rue Raymond Losserand, 75015 Paris, France, e-mail: ; ; and MINES ParisTech, Centre for Robotics, 60 Bd St. Michel, 75006 Paris, France


[1] H. Bay, T. Tuytelaars and L. Van Gool, Surf: speeded up robust features, in: Computer Vision-ECCV 2006, pp. 404–417, 2006.10.1007/11744023_32Search in Google Scholar

[2] C.-H. Chang, S.-C. Wang and C.-C. Wang, Vision-based cooperative simultaneous localization and tracking, in: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 5191 –5197, May 2011.Search in Google Scholar

[3] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas and A. Calway, Real-time and robust monocular SLAM using predictive multi-resolution descriptors, in: Advances in Visual Computing, pp. 276–285. Springer, 2006. Available at in Google Scholar

[4] Cyberbotics, Webots simulator, Available at in Google Scholar

[5] A. Elfes, Using occupancy grids for mobile robot perception and navigation, Computer 22 (1989), 46–57.10.1109/2.30720Search in Google Scholar

[6] D. Gouaillier, C. Collette and C. Kilner, Omni-directional closed-loop walk for NAO, in: 2010 10th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 448 –454, December 2010.10.1109/ICHR.2010.5686291Search in Google Scholar

[7] A. Hornung, K. M. Wurm and M. Bennewitz, Humanoid robot localization in complex indoor environments, in: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1690–1695, October 2010.Search in Google Scholar

[8] G. Klein and D. Murray, Parallel tracking and mapping for small AR workspaces, in: Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November 2007.10.1109/ISMAR.2007.4538852Search in Google Scholar

[9] G. Klein and D. Murray, Compositing for small cameras, in: Proc. Seventh IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’08), Cambridge, September 2008.10.1109/ISMAR.2008.4637324Search in Google Scholar

[10] D. Maier, M. Bennewitz and C. Stachniss, Self-supervised obstacle detection for humanoid navigation using monocular vision and sparse laser data, in: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 1263 –1269, May 2011.Search in Google Scholar

[11] D. Maier and M. Bennewitz, Real-time navigation in 3D environments based on depth camera data, in: Proceedings of the IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS), 2012.10.1109/HUMANOIDS.2012.6651595Search in Google Scholar

[12] Y. Matsumoto, K. Sakai, M. Inaba and H. Inoue, View-based approach to robot navigation, in: 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2000. (IROS 2000). Proceedings, vol. 3, pp. 1702–1708, 2000.Search in Google Scholar

[13] M. Montemerlo, S. Thrun, D. Koller and B. Wegbreit. FastSLAM: a factored solution to the simultaneous localization and mapping problem, in: Proceedings of the National Conference on Artificial Intelligence, pp. 593–598, 2002.Search in Google Scholar

[14] OpenCV, Open computer vision library, Available at in Google Scholar

[15] S. Osswald, A. Hornung and M. Bennewitz, Learning reliable and efficient navigation with a humanoid, in: 2010 IEEE International Conference on Robotics and Automation (ICRA), pp. 2375–2380, 2010.Search in Google Scholar

[16] E. Rosten and T. Drummond, Machine learning for high-speed corner detection, Computer Vision-ECCV 2006, pp. 430–443, 2006.10.1007/11744023_34Search in Google Scholar

[17] E. Rublee, V. Rabaud, K. Konolige and G. Bradski, ORB: an efficient alternative to SIFT or SURF, in: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2564–2571, 2011.Search in Google Scholar

[18] S. Thrun, A probabilistic on-line mapping algorithm for teams of mobile robots, The International Journal of Robotics Research 20 (2001), 335–363. ISSN 0278-3649, 1741-3176. doi: 10.1177/02783640122067435. Available at in Google Scholar

[19] S. Thrun, Robotic mapping: a survey, in: Exploring Artificial Intelligence in the New Millennium, Morgan Kaufmann, San Francisco, CA, 2002.Search in Google Scholar

[20] E. Wirbel, B. Steux, S. Bonnabel and A. de La Fortelle, Humanoid robot navigation: from a visual SLAM to a visual compass, in: 2013 10th IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 678–683, 2013. doi: 10.1109/ICNSC.2013.6548820.10.1109/ICNSC.2013.6548820Search in Google Scholar

Received: 2013-9-30
Published Online: 2014-3-28
Published in Print: 2014-6-1

©2014 by Walter de Gruyter Berlin/Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 7.2.2023 from
Scroll Up Arrow