Identifying relevant feature-action associations for grasping unmodelled objects

Abstract Action affordance learning based on visual sensory information is a crucial problem within the development of cognitive agents. In this paper, we present a method for learning action affordances based on basic visual features, which can vary in their granularity, order of combination and semantic content. The method is provided with a large and structured set of visual features, motivated by the visual hierarchy in primates and finds relevant feature action associations automatically. We apply our method in a simulated environment on three different object sets for the case of grasp affordance learning. For box objects,we achieve a 0.90 success probability, 0.80 for round objects and up to 0.75 for open objects, when presented with novel objects. In thiswork,we demonstrate, in particular, the effect of choosing appropriate feature representations. We demonstrate a significant performance improvement by increasing the complexity of the perceptual representation. By that, we present important insights in how the design of the feature space influences the actual learning problem.


Introduction
Identifying sensory features indicating action affordances is a crucial problem to be solved by cognitive agents since it allows for the identification of "action opportunities".A fundamental problem is the design of the perceptual feature space in which affordances emerge.This space can make the problem rather trivial (e.g., in case features that have a strong link to specific affordances are already provided).It can be also very difficult, when the link between affordances and actions can only be established by a high order combination of simple features (e.g., on the pixel level as in [1]).In this paper, we investigate grasp affordances which are triggered by visual features of different order (see Fig. 1a), different granularity (see Fig. 1b) and semantic abstraction (see Fig. 1c).We are aware that the feature space we span is still of much lower complexity than what the human visual system provides in the occipital cortex.However, we investigate variation along three important dimensions of this feature space as further discussed in section 2.1.
In this paper, we introduce a method for finding feature-action associations in a complex visual feature space.The method for affordance learning described in the paper is not specific for a certain type of affordances, it can be in principal applied to any parameterizable action affordance.In this paper we however choose grasping as an example problem because of three reasons: First, due to the general importance of grasping.Second, we can simplify the learning problem by neglecting certain feature dimensions provided by the human visual system.For example, colour can be ignored as a relevant dimension for grasping.In this paper, we also neglect 2D shape information, which might already be a more questionable design decision.A third reason for addressing grasping is that there exists already relevant, related, prior work: In [2] (see Fig. 5), grasp affordances have been manually designed as first and second order relations of visual entities (local surfaces and 3D edges/contours).By that, we could already reach grasp performance of around 30% success.In [2], the grasp affordances however were defined "by hand" but in this paper, we aim at -besides improving performancereplacing such a manual design step by learning.
For this we want to explore the cross space of surface features and their combination, as shown in Figs.1a-1c, and grasping actions.Fig. 2 shows how the variation of complexity of the input feature relates to the learning task.In Fig. 2a, left, we see a surface patch being related to a grasp.Learning grasp affordances with high success from this kind of weak feature is impossible: To exemplify this, we can imagine that the feature-grasp association would predict successes for all surface patches occurring in the scene.However in the three examples given, actual successes would only occur for the grasp at the right but not for the other two grasps shown in Fig. 2a.When we extend the feature space to second order combinations of surface patches (see Fig. 2b), the grasp on the left could be recognized as an unsuccessful one from the fact that no surface patch at the inside of the object is observable.However, it is impossible to recognize that the middle grasp cannot be successful.However, when we also add the concept of a boundary and its direction to the surface patch (see Fig. 2c), the system could be able to distinguish that only the right grasp can be successful in the given context.Similarly in this paper, we investigate the consequences for learning when we vary dimensions of the feature space such as the order of features or their semantic content.Figure 3: An illustration of how different kinds of bias for grasping actions for a two or three finger hand can be defined.The space is defined by two measures of complexity.The first is the feature bias for the simple grasp reflex (based on either a single feature (1),( 2),( 3) and ( 4) or by multiple features ( 5) and ( 6)) and the second is the complexity of the manipulator here exemplified by a two and three fingered manipulator.

Bias complexity
The algorithm we apply for that is a rather simple clustering method combined with a voting approach and part of the investigations is to explore the potential but also the limitations of such an approach.The complexities associated to our approach primarily stem from two sources: Appropriate action bias: Unsuccessful actions are of limited usefulness for action affordance computation -although these can be used for sorting out uninteresting areas -and hence the system needs to be able to initially perform actions with a certain percentage of success likelihood.This can be achieved by introducing action bias (see Fig. 3), e.g., by designing simple feature based heuristics that trigger actions with sufficient success likelihood (as in, e.g., [2]).In our case, we define rather weak biases that already lead to reasonable success likelihoods between 10-50% depending on the object class.Feature space design: A further problem is to provide a feature space which covers features that are sufficiently correlated to successful actions.The feature space applied in this work does not provide feature coefficients that are independent.On the contrary, the feature space is highly structured: It provides geometric relations between surface patches which require appropriate parametrisations, careful choices of metrics as well as proper association of semantics.
Which features are actually relevant might depend significantly on the actual task and as we show most features are highly uncorrelated to action successes and therefore insignificant.The richer the visual space we provide, the more complex the learning problem will be, since then feature actions need to be found in a larger space.This holds in particular when feature relations of high order are computed since this will very quickly lead to a dimensionality which cannot be explored exhaustively anymore (dimensionality explosion).As a way to reduce the learning problem, the semantic content of features can be increased (as indicated in Fig. 1c).This, however, usually requires the introduction of additional heuristics and by that would jeopardize the genericness of the approach.In our work, we also need to deal with this trade-off.In particular we will show how the different design choices change the statistical distributions of particles in the feature space and by that also the structure and complexity of the learning problem.In this paper, we will describe how we approach the above mentioned complexities.We demonstrate how the affordance learning problem constitutes itself when important parameters such as the order of features, their granularity and their semantic complexity are varied.
In particular we show: -That grasp affordance predictions comparable to the heuristically defined grasp affordances in [2] can be learned as second order combination of surface features.In that way, heuristics depending on the insight of the designer could be replaced by learning from experience.-That the complexity of the feature space we span is of significant importance for the ability to learn affordances with a high rate of success.In particular we show that we can improve the quality of affordance prediction by combining multiple features and adding semantic information.By that we are able to identify grasp affordances for a set of different object types with a high likelihood of success.-That suboptimal choices of feature representations lead to insufficient information to be considered as a good basis for grasp affordance learning.
This paper is structured as follows: We relate our work to the state of the art in grasp affordance learning and other relevant work in section 2. The problem formulation our approach is based on is outlined and formalised in section 3. The approach to address the problem domain is presented in section 4. In section 5, the experimental settings are explained, whereas the experimental results are presented in section 6. Finally the paper is concluded in section 7.

State of the art
In the following we will relate our work to state of the art, first in terms of the analogies to the primate's visual processing and second to the work within grasp affordance learning.

Analogies to the primate's visual processing
It is in general acknowledged that for humans, vision is a strong cue for affordance generation [3].More than half of the primate's cortex is connected to visual tasks.As already pointed out in [4], the primate visual space is fundamentally of higher complexity compared to the action space.This, in the first place, concerns the dimensionality of visual information compared to a still rather low dimensionality of action parametrisation connected to the limited number of joints to be actuated.).It is evident that the receptive field sizes also increase with the level of the hierarchy.Note for example the smaller receptive fields of V1 neurons compared to V2 neurons.Also the abstraction of features assumed to be processed at the specific levels (as indicated in the left part of the boxes a-e) increases.For example in rods and cones in the retina (box a) information similar to RGB pixel information in a camera while in area V1 edge information and a more advanced colour representation is applied.In V2 than even more abstract concepts such as border ownership [5] are computed.This figure uses material from [6] which we also refer to for further details.
The human visual system constitutes a deep hierarchy, covering a large number of complementary feature descriptors at different levels of granularity, different order and semantic abstraction (see Fig. 4 and [6] for a review of today's knowledge about the human visual system).More than 2 /3 of the visual cortex (the so called "occipital cortex") is associated to task-independent feature processing, displayed as yellow areas in Fig. 4. In these areas, a rich set of visual feature descriptors covering different aspects of visual information such as colour, 2D and 3D shape as well as motion are extracted.At least at early stages of processing, this is done in largely separated processing streams [6].
As shown in Fig. 4 and described in its caption, the level of abstraction of feature representation as well as the receptive field size increases (and by that the granularity of the features decreases) in this hierarchical process.As a consequence, and as modelled in our approach, the human visual system can search for affordances in rather different feature spaces ranging from, e.g., low-level 2D contrast information in retinal ganglion cells to 3D edge information with semantic association (such as border ownership [5]) in area V2.Moreover, it is not only the features themselves but their combination -as we will also investigated in our work -that provide relevant affordance cues (see Fig. 5b).From search tasks it is, for example, known that feature combinations up to third order are computed in parallel in the human visual system, which results in so called "pop-out effects" in visual search tasks [7].Hence, finding structures relevant for affordance programming in this high dimensional space at appropriate levels of gran-Figure 5: Simple manually defined grasps: (A) Grasp affordances defined with respect to a single 3D surface feature (hence defined in respect to a first order feature relation), (B) Grasp affordances defined with respect to two 3D contours (hence defined in respect to second order feature relation).Source [2].
ularity, order and semantic abstraction poses one of the major problems for affordance learning.

Related work on computing grasp affordances
Visual triggered action affordance learning is important for the development of cognitive agents.Within the grasping community typically an object is grasped to be further manipulated.However, affordance work like [8][9][10] take a more generic approach towards affordance learning, with the aim of finding what visual features afford actions.
In [10], visual triggered affordance learning was investigated, with the purpose of finding what visual 2D feature cues of an object afford graspability.A supervised learning approach was employed, where a robot interacts with an object to discover graspability and links it to extracted feature cues.A different approach is adopted in [9], were affordance cue's are extracted from inspection of human interaction.By identifying which areas of an object are occluded by the human during a grasp/action, it is learned what local areas of an object afford grasping, e.g., a handle.
In our work, we take a similar generic approach towards affordance learning, but while the authors of [9] learn object properties, e.g., graspability, we learn the coupling of visual features and actions, that enable a specific action.In that sense our work is more in line with the work in [8], where grasping points are learned from local visual descriptors, resulting in particular grasping points with associated probabilities.
Given the grasping application in our work, also approaches towards learning of grasping unknown objects are of interest.This topic has been extensively investigated due to its importance for robotic applications.For the problem of grasping unknown objects, two different strategies have generally been adopted, either feature based methods or shape based method.Examples of feature based approached are [2,[11][12][13][14], where a hand designed grasp hypothesis is proposed given a certain situation.These works stretch from grasp hypotheses based on a single or a combination of two simple features in [2] to grasp hypotheses based on a circle-fitting approach for cylindrical objects [14].
In contrast to feature based approaches, shape driven approaches like [1,[15][16][17], where the agent has a shape model in its database with associated grasps.The shape is matched to the new scene and in case a good match to a shape primitive is found, the grasps associated to this shape are performed.In [17], a set of prototypical object instances are captured with associated grasps from human demonstration and afterwards used for matching in novel situations.Other approaches like [16] and [15] approximates the object in terms of a oriented bounding box [16] or multiple bounding boxes [15] and then suggest grasp hypotheses based on the configuration of the bounding box.In a similar sense [18] decomposes an object into super quadratics to get an approximated object on which grasping can be performed.Another example of a model based approach is [19], where object shape, based on height maps extracted from 3D data and human demonstrated grasps, are learned and matched against new scene context.
Another branch within grasp affordance learning is the utilisation of a closed loop structure, by adding tactile feedback.By introducing tactile feedback from the finger contact-points, the stability of the grasp can be assessed before execution or replanning, hereby enabling a better chance of grasping successfully.In [20] it is shown how a grasp is planned based on an initial grasping pose acquired from rather simple vision and then evaluated by the tactile feedback before eventually a grasp or replanning is taking place.In another work based on haptics [21] it is shown how tactile feedback before grasp execution in combination with a predictor based on visual information can complement each other for grasp prediction.In [22] tactile feedback is utilised to refine knowledge of an unknown object, hereby enabling for planning a suitable grasp based on the acquired geometry.For a broader overview of the grasping domain see [23], where data driven grasp synthesis of known, familiar and un-known objects are surveyed extensively, including some of the work mentioned here.
Our work is a feature based approach, as we introduce simple feature constellation with associated actions, to be used for action prediction.The work can be seen as an extension to the work performed in [2], but with the advantage that we learn feature to action constellation by exploring different visual representations.In recent work [24], it was in a similar way shown how deep learning techniques were used to learn a feature representation suitable for learning grasp affordances, as compared to a previous work with a hand designed feature representation [25].In contrast to [24], in our work we provide some kind of hierarchy to the learning algorithm which can then pick out promising candidates from this hierarchy.However, as discussed in the next paragraph, our approach can be seen as a step toward the learning of a deep hierarchy.
The focus on the underlying visual representation also links to work in non action domains, namely the work by the group of Ales Leonardis on learning hierarchical representations [26].In this work, visual hierarchies are built up layer by layer.Each higher level entity is a combination of usually three elements of a lower level, where such combinations represents a certain spatial arrangement of simpler features.The selection of such combinations is done unsupervised for lower levels of the hierarchy based on, e.g., the criterion of frequency of occurrence and in an supervised fashion at higher levels.Our work can be understood as a step towards such hierarchy building, since relevant particles derived in this paper (see equation 4) are also spatial constellations of simpler entities which could be used as input to a higher level of a deep hierarchical structure.Different from Leonardis' work, we however apply 3D entities instead of 2D entities and we also have task specific evaluation criteria already on rather early levels of processing.

Problem description and formalisation
The main topic we investigate throughout this paper is the cross-space between perceptual features and actions.We explore how different aspects of the visual representation can provide relevant information for predicting action affordances in a reliable way.

Formalisation
To be able to perform these investigations, we initially formalise the building blocks, that we will utilise throughout the paper.The general space we are working in is a cross-space of perception and (grasping) action.We represent the perception side using 3D surfling features.3D surfling features describe small surface patches in terms of a pose.In addition, we introduce a granularity measure that depicts the size of the features.Based on the previous description, we formalise 3D surfling features as Π σ = {SE(3)} (see Fig. 6a).σ depicts the granularity level for the feature.The granularity is measured in the number of sub-features that a 3D surfling feature relies on and hence is a measure of the surface area it covers.SE(3) depicts the 6D pose of the feature described in the Special Euclidean Group, SE(3), hence the name.With the description of the basis 3D surfling feature on the perception side, we introduce the concept of feature relations.Feature relations are essentially a combination of multiple features (3D surflings) described through their spatial and/or perceptual relationship, that allows for a set of higher level features.
One motivation for introducing the concept of feature relations is to compensate for the ambiguity in the 3D surfling feature pose, because the pose is derived from a principal component analysis of the underlying sub features (see Figs. 6a and 6b).The result is an unambiguous surface normal, but the other components in the orientation are ill defined.Hence we need other means to define a stable orientation of a 3D surfling feature.
By introducing feature relations, we add information through the spatial relationships between features, which theoretically will compensate for the uncertainties in the original pose.Moreover, we gain local structure information when we combine multiple features and hence achieve a more expressive visual representation.By means of feature relations, we create a representation where we can derive robust structures for predicting action affordances despite the simplicity of the basic building blocks.A complementary approach to tackle the issue of pose ambiguity in the basic building block is to introduce a more elaborated or expressive feature by additional levels of semantic.A boundary feature is introduced, where the pose is decided by the direction towards a given boundary.The boundary surfling is described by Π σ,β = {SE(3)}, where β denotes it is a boundary surfling and by definition, the first axis of the pose-frame is directed towards the boundary, see Figs. 6b and 6c.
Based on these basic 3D surfling features, we introduce a notation used for feature relations in equation 1, where N denotes the number of combined features, also referred to as the order of the relation, and σ denotes the granularity of the features it relies on.The function f transfers a combination of features into a parametrisation depending on the order and abstraction.To exemplify the transfer, we will describe a feature relation of second order based on generic 3D surflings (an illustration of such feature relations is shown in Fig. 7) which is parametrised as described in equation 2. The angles α 1 to α 3 and distance d 1 are defined as depicted in Fig. 7, whereas the feature relation coordinate system is described in world coordinates.

Action representation
Until now, we have not covered the action side of the perception × action space that we want to investigate.For this, we introduce grasping actions as an example.We define a minimalistic grasping action as follows: which essentially describes a target action pose in world coordinates (SE(3) A W ) and an evaluation of the grasp outcome (E).The evaluation can theoretically take any value, but for the grasping case in this paper, we utilise a binary description.Other parameters such as preshape joint angles of the gripper could also be added to get a more elaborated action description.

Linking perception and action
In the final step, we link the perception part with the action part.Instances of the combined representation will be referred to as particles and denoted ρ as depicted in equation 4 and described in a condensed form using ρ's with superscript A (for action) and P (for perception) respectively.
A linked particle based on the previous examples of perception, equation 2, and action, equation 3, is presented in equations 5 to 6, where SE(3) A P is a condensation of the poses from the different domains into a single pose, where the action is described in terms of the coordinate system of the perception side.In Fig. 8, an illustration of a particle is shown for two different levels of perception.

Learning algorithm
In this section, the algorithm for learning and applying the visually predicted action affordances will be explained.An overview of the process is shown in Fig. 9.The figure covers the steps from the Object/Action environment through a data-creation process, a learning process of which the results are stored in an Action Perception database, and finally a prediction step where the knowledge is used to predict actions to be performed in the Object/Action environment.In the following subsections, the different com- ponents shown in the overview diagram will be covered.First we describe the data creation process, (section 4.1), next the learning phase will be explained (section 4.2) and finally the utilisation of the learned knowledge for predicting actions will be described in (section 4.3).

Data creation
The data creation process is relying on the formalism defined in section 3.1, where the two domains, action and perception, are combined.From the Object/Action environment, we acquire evaluated action information as well as visual information in terms of extracted 3D surfling fea-tures, for training set objects.From features, we compute feature relations and then link the two domains together such that the action is defined with respect to the feature combination (see equation 6).
The procedure for doing the linking process is explained in algorithm 1. Note, that for every particle, ρ, a random action and feature relation is chosen and combined into a particle.The random selection is introduced due to the intractability of exhaustively combining feature relations and actions.In the combination step, additional constraints such as, e.g., locality (the action target pose should be close to the feature relation pose) could be added.A fundamental part of the data creation pro-A .1: Combining feature relations with actions.
Input: FeatureRelations ρ P , Actions ρ A Output: Particles, ρ 1 N ; // Number of particles we use 2 i = 0; cess is the input actions.Such actions could be provided from various sources, e.g., real world experiments, simulation, hand labelled data or through human demonstration.The desirable properties of the input actions are that they provide a reasonable coverage and success rate for a given situation.In this work, we approach the data creation with a simulated environment that allows for a more explorative approach as compared to real world experiments.We utilise visually extracted surfling features as a bias for proposing a input action set.In Fig. 3, a number of examples are shown of how features can act as a bias for proposing candidate actions for the grasping case.That said, the action candidate creation is likely to be very dependent on the type of action.The input actions are then evaluated in simulation.Hereby we retain some control over the amount of input actions while we also can guide the rate of success.

Neighbourhood analysis
In this section, the foundation for learning will be described in terms of the different components.First the learning approach is presented, next a two-stage extension is introduced and finally an optimisation of the learning outcome is considered.

Algorithm outline
The overall outline of the learning process is depicted in Fig. 10.This illustration encapsulates the steps from the feature extraction, action creation to the establishment of an action perception database, in terms of particles ρ.

ActionPerceptionDB
Figure 10: Overview of the learning process, note the two-stage neighbourhood analysis, initially on instance level and finally on the combined set.On the instance level, the diagram explains how the visual features are first extracted, then utilised as a bias for the candidate grasp creation and finally evaluated.Secondly, the features are used to compute feature relations.Given the computed feature relations, ρ P , and evaluated grasps, ρ A , these two are linked to form particles ρ.Then the instance level neighbourhood analysis is performed, before the global neighbourhood analysis is used to merge the acquired knowledge from the instances.Finally the result are stored in the ActionPerceptionDB.The core of the learning process is a neighbourhood analysis, which is illustrated in Fig. 11.The first step is to find the set of supporting particles in the neighbourhood, which is formally described by A k in equation 7. Based on the set of particles, the two measures probability and support are computed.The support, s k , is given as the size of the set inside the neighbourhood (equation 8) and the probability, P k , is defined as the average success probability within the neighbourhood (equation 9).
As we will show in the result section, both variables are essential for the efficient prediction of affordances.
Given these two measures, we have a description of the action perception space in terms of success-outcome likelihood and the support for this likelihood.The latter can also be seen as the particle density in the neighbourhood.From a formal point of view, we go from particles in the form of equation 5 to evaluated particles of the form expressed in equation 10.
The elementwise Dist function in equation 7, is used to decide whether the particle, ρ k , is in the neighbourhood of ρ i .For the distance computation, we split SE(3) A P , from equation 6, into a rotational part described by a quaternion q and a positional part (x, y, z) described by three components: The distance is computed in the individual dimensions of the parametrisation, with the exception of the orientation part of the SE(3) A P pose, which is computed as the shortest angular distance between the orientation of ρ k and ρ i .Using a quaternion representation, the computation can be done with the formula in equation 12, where ⟨q 1 , q 2 ⟩ depicts the inner product of the two quaternions q 1 and q 2 .This approach ensures a well defined neighbourhood for the rotational part of the pose, as compared to a distance measure performed on the full quaternion parametrisation, which resembles the angular distance in a sub-optimal way.For the other parameters in the space this is not a problem.Hence for these a direct subtraction is used.
In equation 13, the distance computation is expressed between two particles of the type described in equation 6.
It should be noted that the comparison operator (<) in equation 7 is an element wise comparison of the distance vector (see equation 13) and the threshold vector (t).For it to be true, all the elementwise comparisons should be true.
The basic process for performing a neighbourhood analysis is captured by algorithm 2.
Input: Particles ρ Output: ActionPerceptionDB, ρ DB 1 t =Compute threshold; The decisive parameter when doing a neighbourhood analysis is the choice of "neighbourhood" or vicinity, expressed as the threshold vector t in equation 7. We propose two options for setting the threshold, t, a manual choice and an automatic choice.Using a manual approach to set the parameters involves setting a fixed threshold of each individual dimension based on common sense and then enable a scaling of the fixed parameter vector t by a scalar multiplier, Mm (see equation 14).
The automatic setting is based on a rule of thumb from Kernel Density Estimation.Scott [27] proposed such a rule (see equation 15).The estimated threshold or bandwidth, t scott is depending on the number of instances in the data, n, the dimensionality of the space, d, and the estimated standard deviation of the data-points within the dataset, σ.It should be noted that the dimension of the vector t and σ depend on the parametrisation used for the particles ρ.
We can then use Scott's rule as a guideline for the ratio between the distances in the different dimensions.To adjust the neighbourhood-distance, we introduce an additional scaling parameter, Ms, similar to the multiplier mentioned for the manual defined threshold.
t scott,M = Mst scott (16) In the Appendix, a comparison of an automatic-versus a manually set threshold is carried out.Here it it becomes apparent, that there might be a gain in prediction performance by choosing an appropriate manual threshold.Although there is a little gain, it is unlikely that the effort is worth it, especially when considering even more advanced visual representations of higher dimension.

Two-stage neighbourhood analysis
As displayed in the overview diagram (see Fig. 10), the neighbourhood analysis is performed in a two-stage process.This is motivated by the urge to decrease the computation time.The cost for performing the neighbourhood analysis is related to the number of particles (see equation 4), n, due to reliance on the KD-tree data structure.The computation cost for performing a search query in a KDtree is O(log n), where n is the number of nodes in the tree, and when we take into account that we need to perform a search for every n particles to find the neighbours, the computational cost adds up to O(n • log n).We can reduce the computational complexity by decreasing the amount of particles on which we are performing the neighbourhood analysis.
In an initial stage, we perform a neighbourhood analysis on the particles from the individual objects in the full dataset.This partitioning provides us with a set of significantly smaller neighbourhood problems, instead of a single large problem.Having a set of smaller problems, that are independent, we also facilitate a parallelisation of the first stage.By filtering the output particles of the firststage before performing the second neighbourhood analysis on the combined problem, we can drastically improve the computational time.One way of filtering away "unpromising" particles, is to set up a criteria for the minimum support that a particle should have for it to be taken into account.Such a filter could be expressed in absolute, average or median values of the support in the dataset.There are however some pitfalls when using support as a filtering parameter, namely the risk for filtering away the diversity in the particles.This aspect of the learning is addressed in the results (section 6.4), where different levels of support filtering has been applied to verify the effect on the prediction outcome.In practice, an introduction of support filtering in the neighbourhood analysis includes a small extension that removes particles below a certain support threshold for the final dataset.

Prediction
In order to apply the learned data in novel situations, two different methods have been applied.One method where we look for similarities on the perception side and use these as direct cues for proposing actions denoted as "direct action proposition" and secondly a method, denoted as "voting scheme", where we suggest a candidate list of actions from the ActionPerceptionDB to vote for the actions.The two approaches will be explained in the following subsections.

Direct action propositions
The direct action proposition approach is based on the assumptions, that our learned high probability and high support action perception particles are descriptive enough for predicting actions.Initially we extract feature relations, the ρ P part of the particles, from the novel object and search for similar ρ P parts in the ActionPerceptionDB.If we find a similar perception part with a high probability for success and high level of support, we take its action part, ρ A , and attach to our ρ P part, resulting in a proposed action.
Given the simplicity of the direct action proposition approach, it has some limitations.The main problem is, that the approach relies heavily on a discriminative perceptual representation in order to make reliable predictions.The potential problem arises when we use a too simple perceptual representation, namely that a particular simple relation can predict very different actions depending on the object it was learned from.This problem should eventually disappear if we utilise a more descriptive perception representation.Therefore we introduce a second approach, the voting scheme.For comparison, experiments have been carried out with the direct action proposition method (see Appendix A), where the prediction performance and limitation in the method are presented.

Voting scheme
The principle behind the voting scheme is that we want to utilise our learned ActionPerceptionDB as a means to vote for a set of candidate actions.Hereby we utilise multiple perception descriptors to predict the action outcome of a single candidate action, and by that improve the robustness of the prediction.The voting procedure has been formalised in algorithm 3. The process is very similar to the actual learning phase, however where we in the learning phase "forget" the origin actions when we combine them with the perception part, ρ P , we remember them in the voting scheme.This allows for a final step in which we can project a prediction probability back to the origin candidate action, and thereby give a prediction based on multiple perception action particles.In Fig. 12, an example is presented, where we utilise multiple feature relations (Figs.12d to 12g), to vote for a single candidate action (Fig. 12h

Setting
In this section, the settings for the experimental work will be explained.It involves the object data set (section 5.1), the simulation environment (section 5.2), the feature extraction (section 5.3), the visual biased action sampling (section 5.4) and details regarding action and perception parametrisation (section 5.5).(d), (e), (f) and (g) show feature relations that are used to vote for the candidate action.Probabilities are shown below which would be the probabilities found in the database.Given the example probabilities, the combined probability for the candidate grasp is shown in (h).

Object set
In Fig. 13

Simulation environment
The experiments in this paper are all performed in a simulated environment utilising the robotic library Rob-Work [30].RobWork is used to create a realistic environment, that facilitates simulated sensors (such as RGB-D sensors and Stereo cameras) as well as simulation of dynamics.The dynamics simulation is carried out using the associated simulation environment, RobWorkSim [31].Fig. 14 shows a view of a dynamic grasp simulation, with the Schunk SDH-2 hand and a pitcher.The grasping simulations are performed in a free-floating world where gravity is not taken into account since it facilitates grasp- ing from every direction.It should be mentioned that although gravity is not taken into account other forces acting between the gripper and the object are simulated.The masses of the objects used in the simulations variy between 0.2 kg and 0.6 kg, estimated based on their size.The friction coefficient between the gripper fingers and the objects are set to 0.2 µ corresponding to the friction between rubber and plastic.In Appendix A.3, a set of additional experimentsperformed in a table scenario -are presented where gravity is taken into account.These results show a high degree of similarity to the results achieved in the free-floating scenario.Due to this similarity and since the focus of this work is on exploring the perceptual representations, we use the free-floating scenario in our experiments.

Feature extraction
An essential part of the setting is the feature extraction from the simulated environment.In Fig. 15, our setup of RGB-D sensors is displayed.Having a setup of three sensors surrounding the object and an additional sensor from below gives an approximated full view of the objects in the centre.
Although the simulated recording situation is not very common, there exist some robot set-ups with multiple cameras providing a rather complete scene representation (see, e.g., [32]).Our recording situation is somehow inbetween dealing with full and perfect CAD models and the common recording context with one camera.Since humans have mechanisms to extract rather complete 3D representations in an observation context by either merging different views or by matching a complete representation existing in memory to a given scene context, one can assume that affordance reasoning in humans can make use of information beyond what is directly visible from one viewing direction.For example, the detection of complex features such as 'wall features' as second order relations of basic surface features is only possible when the inside and outside of an object is taken into account.
We are aware that this is a compromise between different possible options for a set-up.It allows for affordance reasoning on complete representations in which however still controlled amounts of sensory noise are present by the simulated RGB-D sensor as well as by the fact that the scene is observed by only a small set of viewing angles.Based on the simulated setup in RobWork, we are able to extract the 3D surfling features at different granularities and with added semantic.An example of the feature extraction of surflings at four different granularity levels is visualised in Fig. 16.Furthermore the extracted features are shown both with and without the added semantic for boundary features.The boundary features are shown with an additional vector depicting the direction of the boundary.

Action sampling
The action sampling biased through the visually extracted features is a prerequisite for learning the grasp affordances in an automatic way since it ensures a reasonable chance of success as well as a limit to the amount of considered actions, see Fig. 3 for an overview of potential biases.We propose two template grasp types for the sampling.The two types are visualised in Fig. 17, one is denoted the Side-PinchGrasp and the other is denoted TopGrasp.The Side-PinchGrasp has a rather narrow opening between the two fingers such that it can grasp within a container and the TopGrasp has wide open fingers to make an encompassing grasp of larger objects.We create a set of candidate grasps by means of extracted 3D surfling features with a small feature size such that we can achieve a reasonable coverage of the objects.Based on the features, we propose a set of template grasps by rotating them in 32 steps around the feature normal.From this sampling we achieve an average

Parametrisation of feature relations
Throughout the experiments, we will rely on a limited set of different feature relation types, namely of first and second order relation with different levels of boundary semantics.In equations 17 to 22 the different parametrisations are presented.The reason for limiting ourselves to first and second order combinations is partly due to the exponential combination explosion that our approach exhibit due to its simplicity.When utilising higher order combinations, three or more, the amount of possible combinations of visual feature and actions during the learning phase become intractable to cover exhaustively.This relates directly to the parameter spaces, that also increases and if we are not able to direct or limit the space, by for instance heuristics like the boundary feature, the sparsity becomes a problem.
In Fig. 18 visualisations are shown of the different types of feature relations used in the experiments.Note that only four different feature relations are visualised.The reason is that the parameters for equations 17 and 18 are similar with the only difference being that we know the feature in equation 18 is a boundary feature.The same holds for the two cases in equation 20 and 21.The parametrisation covers three first order cases: one plain feature (Υ σ 1 ), one where we know the feature is a boundary feature (Υ σ, β 1 ) and one were we utilise the boundary semantic with direction (Υ σ,β 1 ).As for first order, we introduce a parametrisation for three second order cases: One without semantic (Υ σ 2 ), one with the knowledge of a boundary but not the direction (Υ σ, β 2 ) and finally one with boundary semantic and direction (Υ σ,β 2 ).

Results
The result section is divided into four subsections.In section 6.1, we will present the outcome of the learning phase in terms of associated support and probability of the evaluated particles.In section 6.2, we will present the core results comparing the prediction performance when features at different granularities, different levels of abstraction and different semantics are input to the voting scheme.Subsequently (section 6.3), a qualitative analysis is presented of the results.Finally (section 6.4), we will present results regarding the impact of support filtering.In the experimental work, the different object sets have been split into two classes such that the learning from the first class and is applied on the second and vice versa.In the Appendix, a number of additional results are presented primarily focussing on methodology aspects such as automatic vs. manual threshold, the direct prediction method and table vs. free-floating simulations.

Learning outcome
In order to examine the learning outcome before it is used for prediction, we visualise the frequency of occurrence of the evaluated particles (see equation 6) in terms of support and probability.Fig. 19 shows the distributions in 2D histogram for the different parametrisations described in equations 17 to 22, where the colour depicts the frequency.The colouring is based on the log 10 transform of the actual frequency in the area to allow for a visible distinction.A histogram corresponding to Fig. 19a but without performing a log 10 transformation of the frequency is shown in Fig. 20 as a comparison.In this plot, we only see that the majority of the particles have low support and probability.When assessing the 2D histograms in Fig. 19, we can acquire indications about the predictive power of the different visual representations.We see a shift towards the higher probability areas when the order is raised or semantic is added to the feature relation, e.g., compare Fig. 19a towards Fig. 19f.This change is reflected in the later presented prediction results (see Fig. 23).

Core experiments
The outcome of the voting method (section 4.3.2) is a set of candidate actions with associated predicted probability.To discretise these outcomes, which allows for a comparison to the binary grasp outcome from simulation and hence to quantify the performance, we introduce a probability selection threshold.We vary the actual value of the threshold between the extremes.This results in the plots in Figs.21-23.In order to assess the prediction results, we present two different average measures of the prediction success over the object set.In addition we present a measure of the percentage of grasped objects from the set -Avg-1 -An average computed over all the objects in the set, independent of whether feature combinations leading to any grasp prediction were found for a certain object.If no predictions was found the object contribute to the average with a success rate of zero.This average type is plotted with a full line.-Avg-2 -An average computed over the average success prediction for only the set of the object instances, where a prediction was found.This average type is plotted with a dashed line.-random -The average chance on the object set for randomly getting a successful outcome given the candidate actions.This measure is plotted with a dashed black line.-Coverage -A measure of the percentage of objects from the object set that have been grasped for a given selection threshold.This measure is shown with a dashed-dotted line.
When assessing the result plots, there are multiple aspects that one need to consider when we want to identify a good result.One aspect is the difference between random chance and the top point of the predictions, another is how well a change in the moving threshold to a higher value is Note that the areas with high support and high probability in the action perception space increases when a more elaborated perceptual representation is used, e.g., compare (a) and (f).This change indicates that the latter representation has more prediction potential.The number of particles in the databases ranges from ∼ 250,000 to ∼ 400,000.reflected as a higher rate of success prediction.The result plots show in general a drop in the success-predictions after a top-point.The reason is that the amount of predicted grasps after the top-point drops drastically, because no or very few grasps are found with a prediction rate higher than the selection threshold at the top-point, enabling outliers to have a strong impact.
Finally one should note the ability to predict grasps for the full object set, which is covered by the percentage of objects that have been grasped by a certain selection threshold.

Box objects:
The results for the box object set are presented in Fig. 21.The plots show results where the two dimensions "order" (denoted N, equation 1) and "feature granularity" (denoted σ equation 1), were varied.
From the results we derive: (1) When the order is increased from N=1 to N=2, we see a clear improvement of the prediction rates, by comparing the top-points of the six lines in the plot.This is explained by the added knowledge introduced by a more complex visual feature, and (2), when the feature size is changed, small changes in the performance are observed.For the first order case, we see the best performance with a medium sized feature whereas there is no or little difference when we compare the second order cases at different granularities.Variations based on the used feature granularity is related to the ability of a given feature size to represent the object with adequate accuracy.(3) The object set seem well covered as there are predictions for all objects until a selection threshold of 0.90, where a drop is seen for N=1, at granularity 15.

Round objects:
The experimental results acquired for the round object set are shown in Fig. 22.As above, the plots show results where the two dimensions "order" and "feature granularity" where varied.We see: (1) When the order is increased from N=1 to N=2, a clear improvement is seen in the prediction rates specifically when observing the top-points of the plots.This is explained by the information gain from a second visual feature, and (2), when feature size is varied, we see small changes in the performance for the first order case, whereas we see a clear dis-ary without direction (comparing red and blue lines, and comparing orange and yellow lines), although we have a better object set coverage as the full line is resulting in a higher success probability.A significant improvement of success prediction rating is however achieved for second order relations with boundary and direction (brown line).We see however a small drop when we reach the higher end of the selection filter.This can be explained with the fact that the voting method act as a smoothing operator hence high prediction areas will be in general occurring rarely.When we compare the results acquired for the different granularities, we see a similar outcome as in Figs.21  and 22.The results for the percentage of grasped objects show some interesting patterns.In general the parametrisations with N=2, show a rather good coverage with a percentage between 0.6 and 1.0.In particular the most elaborated representation (brown line) shows close to full coverage.The parametrisations with N=1 show in general less coverage as the generalisation is worse.For the parametrisation of N=1 with boundary and direction it is seen that even a the lowest selection threshold, only around half the objects are covered which tells that although we have a good prediction (dotted green line) the generalisation over the object is not convincing.

Qualitative analysis of the power of semantic information
In order to illustrate the performance gain we get when we introduce the boundary semantic, we present a visualisation of the ActionPerceptionDB for the three first order cases.The visualisations are shown in Fig. 24.In the centre, a surfling feature is placed and the coloured area around the feature represents how the actions are distributed with respect to the pose of the feature.The colour coding of the actions depicts the likelihood of success for that particular particle.For Υ 5  1 we see a uniform distribution of success probability, whereas for Υ 5,   β  1   we see two rather uniformly coloured areas.Noticeable is an inner part with a higher success likelihood as compared to the outer part.This is explained with the added knowledge of the boundary, specifically by the fact that, at the boundary, a successful action will be closer to the feature, hence the inner circle captures both the successful boundary grasp as well as unsuccessful, whereas the outer part mostly capture the nonboundary action.
When assessing Υ 5,β 1 , it becomes obvious what we gain by introducing the direction towards the boundary.The visualisation shows a high likelihood of success along the direction of the boundary and the further the grasp are located orientational wise from the boundary direction a lower success likelihood is observed.
To visualise how the power of semantic constitutes itself when applied for predicting actions, a visualisation of the distribution of predicted grasps for an object is shown in Fig. 25.The figure shows the prediction result for a pitcher, where the order and level of semantic are varied.One can easily notice how the introduction of boundary and direction information for both first and second order  cases allow for high success areas at the boundary of the pitcher.

Support filtering
In order to investigate the impact of the support filter, a series of experiments based on the open object set have been performed, in which the amount of particles used from the first stage of the neighbourhood analysis is varied.We filter by choosing the zeroth to the tenth decile of the particles based on their support, e.g., split the first decile lowest supported particles from the highest supported particles and then utilise the highest supported part.Hereby we cover the extreme situations, from using every particle to using very few.The acquired results are presented in Fig. 26.Note the support level is described as a measure between zero and 1.0.
From the results, three main points are derived: (1) When assessing the results for Avg-1 for the four cases, Υ 1 , Υ β 1 , Υ 2 and Υ β 2 , the observed pattern shows that a lower support filter results in higher success rate, although only at lower selection threshold.When comparing the results of Avg-1 with Avg-2 for the same four cases, it is no-ticed that a larger support level result in a higher success rate for the instances that are found.This is in particular seen for Υ 1 and Υ 2 , as the selection threshold increases towards 1.0.This result indicates, that with a higher support level very good prediction for a subset of the objects can be derived.
(2) When assessing the Υ β 1 results the pattern is significantly different.For Avg-1 the prediction results show similar performance independent of the applied support level, with the only exception being the highest support level, where the performance is degrading at a low selection threshold.The results for Avg-2 show that if a prediction is found, then a higher success rate is achieved when a high support level is used.
(3) When assessing the results for Υ β 2 the recognised pattern for both the averages, Avg-1 and Avg-2, show similar performance with a small advantage at the higher support levels.Especially at the two highest support levels, an improved performance is noticed.The reason for the improvement is related to the predictive power of the representation: If we are able to find particles of high support and high success probability, then when filtering for a high support level, we would keep these "good" particles.To summarise the outcome of the support filter experiment, it can be observed that for the less elaborated feature representations, good predictions can be found for individual instances of objects at a high support level, whereas generalisation is in general not observed when utilising a lot of instances (a low support level).For the more elaborated visual representations, it becomes evident, that we are able to achieve an improved performance and still retain the generalisation when using a higher support level.This result indicates, that there indeed exists particular feature relations, which are predictive for grasping in the provided visual representation.

Summary and conclusion
In this paper, we have introduced a method for finding combinations of visual features that are predictive for actions.The method has been exemplified for the problem of learning grasping actions.We have performed an analysis of the cross space of perceptual features and grasping actions with special focus on how an enrichment of the perception side leads to improvements of the derived prediction.
Through the performed investigations, we have been able to learn actions with a high likelihood of success for three different object classes, namely box like, round and open objects.For the box and round object set we were able to reach a grasp prediction success of up to 0.90 and 0.80 respectively, when utilising a second order feature constellation as a perceptual descriptor.This high success rate should be seen in the context that grasping of those objects is a rather simple task.For the more difficult open object set, we investigated in addition to granularity and order of feature combination also the impact of additional semantic information attached to the features through boundary information.From these results, we were able to achieve a success-rate of up to 0.75, when second order features with added semantic where utilised on the perception side.
By that we have replaced manual design of affordances as done in [2] by learning.We could confirm that relatively high success rates for action feature associations built by means of rather basic features is possible.Moreover and most importantly, we have shown how the structure of the feature space influences the results of the algorithm.For that we investigated three important dimensions of a feature space motivated by the visual hierarchy of the human visual system: granularity, order of features and semantic abstraction.Since our approach is not restricted to grasping, in future work we plan to apply our algorithm to other action affordances.

A Learning methodology experiments
In the following subsections, two aspects of the learning approach will be investigated and one aspect of the simulation scenario: (1) The prediction results when the direct action proposition approach (see section 4.3.1) is applied, (2) the difference between an automatically-and a manually set threshold (see section 4.2.1) and (3) a comparison between a free-floating environment and a table environment with gravity acting for grasp simulation (see section 5.2).

A.1 Direct action proposition approach
As a comparison to the voting scheme (see section 4.3.2), a number of experiments were performed using the direct action proposition method (see section 4.3.1).The experimental results are presented in table 1.Compared to the results presented when utilising the voting method (see section 6.1), these results are evaluated with a single measure depicting the success prediction.In the experiments, the order and granularity were varied for the box and round object classes, whereas the level of semantic in addition were varied for the open object class.For the box and round objects, two things can be observed: (1) A larger feature size improves the success rate for the first order cases, whereas it degrades for the second order cases, and (2) the success rate is in general higher for the second order cases.The improvement due to a larger feature is explained by the increased object knowledge.This information gain however seems to be counteracted by the addition of another feature, resulting in a degradation of prediction performance for the larger feature.For the open objects, three things can be observed.(1) The performance when utilising the representations without any semantic is very low, however an improvement is noticed when going from the first order cases to second order cases.(2) For the first order cases, a larger feature results in a better prediction rate.This is not the case for the second order cases, where the highest prediction rate is achieved at a feature size of 15.
(3) The highest overall prediction rate is achieved at a representation based on Υ 30 1 .This essentially tells us, that the information gain from a larger feature is superior to adding another feature when used in connection with the direct action proposition method.
Finally, when comparing the results with the voting method, the direct action approach show lower performance, which is explained by the direct attachment of an action to a perceptual representation.This compares to the multiple particles, that are used to vote for a single action in the voting method.

A.2 Automatic vs. manual threshold
In this experiment we show the impact of an automatically chosen threshold as compared to a manually chosen threshold (see equations 14, 15 and 16 in section 4.2.1).In Fig. 27, the outcome of the experiments are shown for the three different object classes.We focus on the results with highest abstraction and order, meaning Υ β 2 for the open objects and Υ 2 for the box and round objects.For the open objects, we see an improved performance when the manual threshold is used.Both the top point of the curve and the consistency between the selection threshold and the prediction rate on the high end of the selection threshold show superior performance compared to an automatic thresholding.For the box-and round objects, the automatic threshold results show slightly better performance as the top point has a higher success rate, although the curve drops earlier than the manually selected threshold.
From these results, it can be derived, that an automatically chosen threshold shows a tendency to smooth the data more.Hence, the correspondence between the selection threshold and the actually success outcome is suboptimal close to the selection threshold of 1.0.However, although the manual chosen threshold shows better consistency between the selection threshold and the actual prediction, it comes with the cost of a lower top point and the need to manually define the threshold for the individual dimensions of the parametrisation.

A.3 Table vs. Free-floating scenario
In this experiment a comparison between a dynamic grasp simulation performed in free-floating and in a gravitation field is performed.By means of this comparison, we justify the usage of the "simpler" free-floating environment for the experiments performed in this work.To make this comparison, an experiment is performed on one set of objects, the open object set (see Fig. 13) and based on the results, we discuss why we believe the simpler scenario is preferable in this context.Initially the two different scenarios are explained.
The free-floating scenario, as explained in section 5.2, is a simulation performed with the object floating in space, enabling grasping from every direction without any grav-ity working in the simulation.However, the dynamic forces between the manipulator and object are still simulated.In contrast, the scenario with gravity, denoted the "table scenario", is limited by the fact that the object has to stand on a table perpendicular to the direction of gravity, hereby enabling grasping with gravity.In the free-floating environment there is no limitation on the ability to execute the grasps, however when presented with a table scenario a lot of potential grasps will initially be in collision with table and therefore do not make sense to execute.In the next section the setting for the experiments will be presented, then the results are presented and then finally the results will be summarised and interpreted.

A.3.1 Setting
The general setting for the experiments of the free-floating scenario is as explained in section 5.The setting for the table scenario is slightly different.All the open objects are placed on a table in a stable position with the opening away from the table.The gravity is set to −9.82 m s 2 along the normal of the table.
Given that the open objects primarily are graspable by the rim, this object setting allows for the objects still to be grasped.Secondly this could be seen as the "natural" pose of the objects.However when we then do a filtering based on gripper collision with the table, we remove a lot potential grasps from the equation, and due to the pose of the object this means that the probability to pick a successful grasps by chance is increased significantly.This can be seen when comparing the random line (see dotted black line) in the two result plots of Figs. 29 and 30.In Fig. 28 the distribution of grasps is visualised for the pitcher object with and without the filtering.As the intuition suggest, all the grasps at the bottom of the objects are filtered away as they collide with the table.Although not obvious from the visualisation, approximately half the grasps are filtered away for this object.

A.3.2 Results
In order to justify the use of the free-floating environment, we present results obtained in both environments, freefloating and table, and discuss the differences.Initially, we assess the results in Fig. 28.Qualitatively the grasp outcome of the two scenarios are very similar.The distribution of successful grasps around the rim of the object seem similar, whereas there seem to be differences around the handle of the pitcher, suggesting that the addition of gravity makes it easier to grasp the handle.Otherwise the most obvious observation is the lack of grasps from the bottom of the object, due to collisions with the table.This result indicates that although there are some differences between the two scenarios due to the required filtering, the similarity between them is significant.The second results are presented in Figs. 29 and 30.These results show the prediction performance, similar to the core results presented in section 6.2, on the open objects for the two different scenarios.A number of conclusions can be derived from the results.First, the random chance of successfully picking a successful grasp is significantly increased for the table scenario (see dotted black line).This increase is caused by the filtering process and the fact that the successful grasps primarily are at the top rim of the objects, hence the vast major- ity of grasps filtered away are unsuccessful.Regarding the performance of the prediction, a certain pattern emerges.The parametrisation with N = 2, β performs to a significant extent similarly when comparing the two scenarios.The other parametrisations show however a slightly different outcome.They all have increased performance in the table scenario, which is explained by the increased chance to be successful by random choice.Other than that, the most significant change is the increased performance in the table scenario for the parametrisation where boundary information is used.This can be in particular seen for the case N = 1, β, which shows similar performance as the N = 2, β case.The reason for this drastic improvement is that by filtering away the grasps at the bottom of the objects, a significant amount of boundary features found are not considered in learning and prediction, making a single boundary an even more significant feature in terms of successful grasps.See Fig. 16d and notice the boundary semantic features at the bottom of the object.The results for the percentage of objects that have been grasped show a similar pattern for the two different experiments.The percentages of grasped objects for the table scenario seem to drop a bit faster as compared to the free-floating scenario.A significant change is seen for N=1 β, where in the table scenario a higher coverage can be noted.This is explained with the same argument as before, namely that the table and predicting grasp affordances.What the experiments also exposed was the strong context specific bias, that a table scenario induces into the system, making the exploration of the perceptual spaces for grasp affordance learning and prediction performed in this paper diffuse.Based on these conclusion we find it justified to use the freefloating environment for our investigations of the perceptual action space.

Figure 1 :
Figure 1: Overview of different aspects of the perceptual space that are investigated throughout this paper.In (a), it is shown how we can increase complexity to the perceptual representation by means of combining multiple features into more elaborated structures.In (b), it is shown how we can increase/decrease the complexity of the perception side by changing the size of the features.In (c) it is shown how the level of abstraction of the feature representation can be raised by means of semantic (here adding a boundary label and a boundary direction to a surface patch).

Figure 2 :
Figure 2: Illustration of how different perceptual spaces can be used to limit the amount of grasp options.(a) shows a single feature grasp association which would not be able to distinguish between the three grasping situations on the left from which only the very left one leads to a success.(b) shows a second order-feature grasp association being rich enough to distinguish the left grasp situation as unsuccessful.(c) shows a two-feature grasp association for which also the boundary direction (red line) is taken into account.This enriched features allows for distinguishing that only the very right situation leads to a success.

Figure 4 :
Figure 4: The primate's visual cortex:The figure shows the deep hierarchical organization of the human visual system with the brain areas of the occipital cortex, the ventral and the dorsal pathway at the top right.For selected visual areas, the receptive field size of neurons as well as some of the features that are assumed to be processed in the specific areas are shown for the retina, area V1, V2, V4 and TE in the boxes a-e.The receptive field sizes are represented as they occur in the upper left quarter of the visual field in the right part of the boxes a-d.For area TE (box d), which has neurons with large receptive fields, the receptive fields are indicated in the whole visual field.Note that usually the receptive filed sizes are bigger for neurons representing the periphery of the visual field (which is very clearly visible in area V1 and V2).It is evident that the receptive field sizes also increase with the level of the hierarchy.Note for example the smaller receptive fields of V1 neurons compared to V2 neurons.Also the abstraction of features assumed to be processed at the specific levels (as indicated in the left part of the boxes a-e) increases.For example in rods and cones in the retina (box a) information similar to RGB pixel information in a camera while in area V1 edge information and a more advanced colour representation is applied.In V2 than even more abstract concepts such as border ownership[5] are computed.This figure uses material from[6] which we also refer to for further details.

Figure 6 :
Figure 6: Visualisation of the two basic building block.(a) a 3D surfling, Π σ , where a principal component analysis is performed on the sub-features (black) to decide the orientation.(b) a boundary corrected 3D surfling, Π σ,β , where the orientation is decided by the direction of a boundary.In (c), we see both boundary 3D surflings, blue with a red arrow, and standard 3D surflings.

Figure 7 :
Figure 7: Example of a feature relations of order two.It should be noted how the angles α 2 and α 3 describe the normal of the second feature Π σ 2 in terms of the coordinate system of the first feature, Π σ 1 .

1 σFigure 8 :
Figure8: Illustration of the linkage between action and perception for the first order case (left) and the second order case (right), essentially being a linkage (the dotted line) between the frame of the perception descriptor and the frame of the action.

Figure 11 :
Figure 11: 2D illustration of the neighbourhood analysis around a particle, highlighted in green.

Figure 12 :
Figure 12: A 2D idealised example illustration of the basic principle of the voting scheme given a candidate grasp.(a) An idealised cross-section of a 2D container, (b) a two-finger gripper, (c) a feature representation with a candidate grasp.Figures(d), (e), (f) and (g) show feature relations that are used to vote for the candidate action.Probabilities are shown below which would be the probabilities found in the database.Given the example probabilities, the combined probability for the candidate grasp is shown in (h).
an overview of the different objects used in the experiments is given.The objects are split into three different categories, namely box-like objects, curved/cylindrical objects and open/container objects.The objects in the set are partly taken from the KIT object database[28] and partly from the online database archive3D[29].The KIT objects are digitalised real objects which potentially simplifies the transfer from a simulated environment to the real world.Furthermore they add realism to the feature extraction as the objects are textured based on the real objects.However due to the lack of open/container objects in the KIT set, we needed to extend the object set with objects from other sources, which are not digitalised real objects.

Figure 13 :
Figure 13: Visualisation of the three different categories of objects.(Top), box objects, (middle), round objects and (bottom) open objects.

Figure 14 :
Figure 14: Visualisation from RobWork showing a grasping action with the Schunk SDH-2 hand.

Figure 15 :
Figure 15: Visualisation of the four simulated RGB-D sensor views, illustrated with the four coloured frames, and the object of interest in the centre.The frames depict the position and the camera-view (along the negative z-axis, coloured blue).The views from the four cameras are shown in the small images.

= 30 Figure 16 :
Figure 16: Visualisation of extracted features at four different granularities with (right column) and without (left column) boundary semantic.

2 Figure 18 :
Figure 18: Visualisation of the utilised feature relations and the associated parameters.

Figure 19 :
Figure19: Visualisation of the particle distribution for the open object set in terms of their support and probability for the learned Action-PerceptionDB's.Note that the areas with high support and high probability in the action perception space increases when a more elaborated perceptual representation is used, e.g., compare (a) and (f).This change indicates that the latter representation has more prediction potential.The number of particles in the databases ranges from ∼ 250,000 to ∼ 400,000.

Figure 20 :
Figure 20: Visualisation of the particle distribution in terms of support and probability for a learned ActionPerceptionDB, where the particle frequency is shown without any modifications.

Figure 23 :
Figure 23: Prediction result for open objects of granularity 5, 15 and 30.See equations 17-22 for the used parametrisation, and see text for further details.

Figure 24 :
Figure 24: The three visualisations show how the learned particles are distributed, when the feature part of the particles is positioned in the centre.The three cases are Υ 5 1 (left), Υ 5, β 1 (middle) and Υ 5,β 1 (right).Red colour depict a success likelihood of 0.0 and green a success likelihood of 1.0.

Figure 25 :
Figure 25: Visualisation of the grasp predictions for a pitcher object with feature relations of different order and with different semantic.The colour depict the predicted likelihood for success.Green meaning a success likelihood of 1.0 and red meaning a success likelihood of 0.0.

Figure 26 :
Figure 26: Prediction results for the open object set, with a feature size of 5 and different support filters, see equations 17-22 for the used parametrisations, and see text for further details.

Figure 27 :
Figure 27: Automatic vs. manual set threshold for the three different object set, see equations 17-22 for the used parametrisations, and see text for further details.

Figure 28 :
Figure28: Visualisation of the consequences of filtering due to collision in a table scenario.The pitcher object is used as example.The left column shows a visualisation of the grasp distribution (shown with small stick figures, the pink and red ones are failed grasps and the green ones are successful grasps) for the free-floating scenario and the right columns shows a visualisation of the grasp distribution after filtering for the table scenario.The top row shows the full grasping set and the bottom row shows a randomly chosen subset of 1000 grasps.In free-floating scenario 22,000 grasps, in table scenario -due to filtering -only 12,794 grasps.
Input: ActionPerceptionDB ρ DB , Features Output: Candidate Actions with prediction ρ A C,E 1 ρ A C = Create Candidate action through visual bias; 2 ρ P C = Compute feature relations; 3 ρ C = Combine feature relations with candidate actions as in ALG.1; 4 for ρ C,k in ρ

Table 1 :
Prediction results when utilising the direct action proposition method.The used parametrizations are found in equations 17-22, and see text for further details.