Abstract
Extracting spatial objects and their key points from remote sensing images has attracted great attention of worldwide researchers in intelligent machine perception of the Earth’s surface. However, the key points of spatial objects (KPSOs) extracted by the conventional mask regionconvolution neural network model are difficult to be sorted reasonably, which is a key obstacle to enhance the ability of machine intelligent perception of spatial objects. The widely distributed artificial structures with stable morphological and spectral characteristics, such as sports fields, crossriver bridges, and urban intersections, are selected to study how to extract their key points with a multihot crossentropy loss function. First, the location point in KPSOs is selected as one category individually to distinguish morphological feature points. Then, the two categories of key points are arranged in order while maintaining internal disorder, and the mapping relationship between KPSOs and the prediction heat map is improved to one category rather than a single key point. Therefore, the predicted heat map of each category can predict all the corresponding key points at one time. The experimental results demonstrate that the prediction accuracy of KPSOs extracted by the new method is 80.6%, taking part area of Huai’an City for example. It is reasonable to believe that this method will greatly promote the development of intelligent machine perception of the Earth’s surface.
1 Introduction
How to recognize spatial objects automatically from remote sensing images and access abundant information quickly and accurately is one of the main research directions of remote sensing data processing. However, the current research mainly focuses on how to extract spatial objects more accurately from remote sensing images. If the key feature points can be obtained while extracting the spatial objects, it is possible to dig deeper into the morphological information of spatial objects and enhance the ability to perform an automatic perception of the Earth’s surface.
Feature point extraction is a traditional method for machines to perceive remote sensing image information, mainly including Harris, SIFT, SUR, FAST, and ORB. Harris is an early proposed algorithm for feature point extraction on images, which extracts corner point features by looking at the image through a small local window. The corner points extracted by this algorithm have rotational invariance [1,2]. The disadvantage of Harris is that it is sensitive to scale and does not have geometric scale invariance. SIFT algorithm was first proposed by Lowe [3], and the key points found by SIFT are some prominent points that do not change due to illumination, affine transformation, and noise, such as corner points, edge points, bright points in dark areas and dark points in bright areas. Because of such advantages, SIFT has been widely used to select interesting points for matching of remote sensing images [4,5,6]. However, SIFT requires the image to have enough textures, otherwise the constructed 128dimensional vectors are not too differentiated. Bay et al. proposed the SURF algorithm, which is a SIFTlike interest point detection algorithm [7], and can improve the execution efficiency by maintaining the excellent performance characteristics of the SIFT operator while solving the drawbacks of high computational complexity and timeconsumption. There have been many studies on SURF in the field of image mosaic and registration [8,9,10]. To improve the efficiency of feature point detection on remote sensing images, other algorithms such as FAST [11] and ORB [12,13,14] have been proposed. Despite the significant progress in this domain and the promising results presented recently, these approaches only obtain feature points by analyzing the local pixel distribution of images. Unfortunately, a significant limitation is that the extracted feature points are unable to be determined from which object and thus cannot be used to intelligently perceive features of spatial objects, which is an obstacle to a deeper perception of the Earth’s surface.
Artificial intelligence provides new ideas for feature point extraction. Recently, the development of the mask regionconvolution neural network (Mask RCNN) model has made it possible for machines to extract spatial objects intelligently from remote sensing images [15,16,17,18]. On the basis of traditional instance segmentation, it adds a key point extraction branch and, for the first time, associates feature points with objects. The Mask RCNN model was initially used to estimate human body posture [15]. Gradually, the researchers applied it to the key point extraction of other objects. Allaberdiev et al. used a priori symmetric constraint to refine the key points located by any backbone detection networks. To deal with uncertainty in labeling clothing, a new loss was introduced to utilize all available data that contain “maybe” labels, and better results were achieved in different recognition data sets [19]. Zhang et al. utilized key point detection technology to construct the edge contour of targets, which requires each target edge to be sampled by a fixed number of boundary points and a heat map generated for each detection point [20]. Wong et al. proposed a manipulation planning method for object reorientation based on semantic segmentation key point detection, which is able to detect and reorientate the randomly placed objects to a specified position and pose [21]. To address the problem that key point detection performance degrades when super resolution (SR) is applied to people with a larger initial segmentation area, Hardy introduced a novel MaskRCNN approach that uses a segmentation area threshold to decide when to use SR during the key point detection step [22].
The traditional Mask RCNN uses the onehot method [15], which can be used to extract feature points in a clear order. Due to the strict requirements on the order of the key points of the extracted objects, the disorder of the key points of the spatial objects is an obstacle that the traditional method struggles to overcome, and this problem has not received much attention from scholars. In this article, we improve the feature point extraction branch of the traditional Mask RCNN, abandon the traditional onehot idea of generating features one by one, and propose a multihot method to generate a heat map of all points in the same category in one image without considering their order. The experimental results show that this method has high accuracy.
2 Definition of KPSOs
Key points, also known as points of interest, are feature points that describe the stability and differentiation of objects. Position and morphology are the most important features of spatial objects. Therefore, the KPSOs in this article refer to the feature points that can extract and recognize the position and morphology information of objects. Nevertheless, the spatial objects in remote sensing images have strong spatial differentiation in their internal configuration and form. The same type of feature of spatial objects can exhibit different internal configurations or morphological features in different regions. Taking sports fields as an example, their shape includes ellipses and rectangles, and their internal structure is related to their purpose. Obviously, there are prominent differences between the internal configuration of football field and basketball field. Similarly, urban intersections have significant differences in the morphology of “+,” “Y,” and “T,” etc. Thus, it is necessary to comprehensively analyze the features of spatial objects in different regions and determine their commonalities as the basis for the definition of key points. KPSOs can be divided into two categories as follows in detail.
Location point: The key point used to locate a spatial object is called the location point of the spatial object. Generally, the position near the center of a spatial object which is easy to be located can be selected as the location key point.
Morphological feature points: Morphological feature points refer to key points that can be used to describe the morphological features of a spatial object. Since Mask RCNN can predict the mask area of each object, we only need to define the most critical feature points and then combine them with the predicted mask area of each object to obtain its morphological characteristics. For example, the number and directions of urban intersections can be defined by putting a key point at each intersection, and the size and orientation of a sports field can be determined by placing two key points at the end of its main axis.
3 Extraction method of KPSOs
Mask RCNN is an intelligent model for instance segmentation developed from Faster RCNN [23]. Mask RCNN has an extensible open network structure. The model can predict the key points of each object by adding a key point prediction branch.
3.1 Limitations of traditional Mask RCNN for key point extraction
Conventional key point prediction adopts a onehot multiclassification method to predict the positions of key points. In the Mask RCNN model, the prediction heatmap with 56 × 56 elements is reduced to a onedimensional vector for each key point. Thus, each element in the vector represents a category, and a multiclassification method is used to predict the probability that a key point belongs to a certain category. The category of maximum probability is then mapped to a twodimensional space with 56 × 56 elements to calculate the position of the key point. Usually, the crossentropy loss function is adopted to calculate the prediction loss of key points as follows:
where L is the prediction loss of key points, K is the number of key points participating in the prediction, M is the number (equal to 56^{2}) of categories, p _{ ij } is a sign function, and q _{ ij } is the predicted probability of the key point i belonging to category j. Note that the value of p _{ ij } is set to 1 when the key point i belongs to category j, otherwise, it is set to 0. q _{ ij } is calculated by the Softmax function [15]:
where v _{ ij } is the value of each grid j of the prediction heatmap for each key point i.
The conventional Mask RCNN model requires the key points of the label to be marked in order, and the feature map of the key points carries out position prediction and loss calculation in strict accordance with the order defined by the key points. For the key points of a human body, it is easy to determine the order, and each key point can be predicted through model training accurately. However, KPSOs on remote sensing images are difficult to be sorted reasonably. Taking sports fields as an example, the two points at the end of the main axis are usually defined as key points. However, these two points have rotational symmetry relative to the main axis; that is, it is hard to determine which point is the first in order according to the shape of the sports field.
3.2 Extraction algorithm of KPSOs based on multihot crossentropy loss function
First, all the key points are classified and organized so that each point belongs to a key point category. A key point category consists of one or more points, which are not defined in sequence. Then, the key point categories are numbered sequentially; that is, each key point category has a specific sequence number. After that, the prediction model starts training. It should be noted that the key points of each category are read in the same order during the prediction process. As mentioned above, the key points of each spatial object include one location point and several morphological feature points, which means that each spatial object has two key point categories, one is the location category, and another is the morphological category.
3.2.1 Input label of KPSOs
For the common key point extraction branch network, the input label of key points is a threedimensional vector [N, K, 3], where N is the number of samples, and K is the number of input key points of each sample. The xcoordinate, ycoordinate, and status of each key point are stored in order. And the status is assigned values of 0, 1, and 2, respectively, representing “unmarked,” “marked but invisible,” and “marked and visible.”
In contrast, the input label of key points in our model is designed as a fourdimensional vector [N, C0, P, 3], since the key points are organized by categories. In this vector, C0 is the number of key point categories, and P is the maximum number of morphological feature points among all spatial objects. Obviously, C0 should be assigned 2; that is, the key point categories include location category and morphology category.
Each KPSO is written in the order of the location point first and then morphological feature points. Due to the unequal number of key points in the different categories, the input label of key points has a certain redundant space. For example, the label of the location point needs redundant storage since it has only one key point and the redundant P1 storage bytes are assigned as 0. The redundancy design of the input data of KPSOs is conducive to programmers to check the number of key point categories and the effective sequence of key points of each category through the input label.
3.2.2 Prediction branch based on MHCELF
In our model, the common mapping relationship, i.e., key pointprediction heatmap, is improved to key point categoryprediction heatmap. In this way, the problem of key point disorder is solved. The position index of KPSOs in the prediction box can be obtained by the following formula easily:
where I is the position index, I _{ x } and I _{ y } are the location indices in the x and y directions of the heatmap grid unit, respectively, which are specifically expressed as follows:
where (x, y) is the coordinate of the key point, and (x _{1}, y _{1}, x _{2}, y _{2}) is the coordinate of the prediction box.
According to equation (3), the input label is converted to a tensor with dimension [N, C0, P]. Correspondingly, the size of the output heatmap of the modified key point prediction branch is [N, C0, 56, 56].
The improved key point branch network uses the key point category as the basic unit. For each key point category of the spatial object, the number of valid key points with a status greater than 0 is determined. If the number is 0, the key point category is considered invalid, and the status is set to 0; otherwise, it is set to 2. On this basis, the label of the key point categories with a status greater than 0 and their corresponding prediction heatmap sequences are singled out. Assuming the number of legal key point categories participating in training after screening is C, thus the size of the key point label after screening is [N, C, P], and the size of the prediction heatmap is [N, C, 56, 56], as shown in Figure 1.
In order to compute the multihot loss, the input size of the key point label is converted from [N, C, P] to [N, C, 56, 56]. Specifically, each key point category i defines a 56 × 56 input heatmap:
where p _{ ij } is the symbol of the key point category i at the grid j in the input heatmap. The initial value of p _{ ij } is set to 0, and then, each key point in the key point category is traversed to obtain the position index, and the value of the corresponding grid p _{ ij } is set to 1. Different from equation (1), in our model, the prediction loss of the key points is expressed as follows:
where D is the total number of key points that participated in the prediction actually, which can be calculated by the following formula:
3.2.3 Key point prediction based on MHCELF
For each element of the prediction heatmap, the sigmoid function is used to calculate the predicted score:
where g _{ ij } is the predicted score for each grid j in the prediction heatmap of category i. The method of predicting multiple key points in a heatmap based on the predicted score is crucial to key point prediction based on MHCELF. In common key point prediction based on onehot crossentropy loss, only the maximum value of the prediction heatmap needs to be calculated, and the position of the maximum value is converted into the key point position. In our method, the predicted scores of the 56 × 56 elements of the heatmap need to be sorted in reverse order, and the candidate positions should be traversed from high to low. The key point sequence needs to conform to the following formula:
where O _{ i } is the set of key points predicted in the key point category i, and each element in the set is a grid unit o _{ ij } of the prediction heatmap. P _{ i } is the maximum number of key points of category i, ε is the minimum distance (in pixels) allowed between key points, and δ _{ i } is the threshold of the predicted score for category i.
4 Experiments and results
4.1 Selection of spatial objects
The Mask RCNN model is easy to extract objects with relatively stable morphological features and spectral features. To improve the ability and accuracy of machine perception of the Earth’s surface, we focus on those spatial objects which have the following characteristics:
Identifiable. A spatial object can be automatically and accurately extracted using a computer with existing technology. Therefore, it is necessary to select objects with relatively stable spectral and morphological characteristics.
Relatively stable. The location and form of the selected spatial objects remain stable, which reduces the difficulty of defining and extracting KPSOs of spatial objects.
Ubiquitous. The selected spatial objects exist widely on the earth’s surface. Under current technological conditions, it is important to improve the ability of machine intelligence to perceive the Earth’s surface by extracting a limited number of types of spatial objects and their key points.
In highresolution remote sensing images, the morphological and spectral characteristics of artificial structures, such as sports fields, crossriver bridges, and urban intersections, are relatively stable, as shown in Figure 2. Meanwhile, their location and form are also relatively stable. Therefore, these widely distributed artificial structures are selected as typical cases to study how to extract KPSOs from highresolution remote sensing images. At the same time, object screening rules are established to label those spatial objects with typical morphological characteristics in order to reduce the difficulty of object prediction. For instance, the sports fields must have elliptical boundaries, the crossriver bridge must be spanned by a wide road for motorized traffic, and the intersection must have two crosswalks at least.
4.1.1 Definition of location and morphological feature points
The blue rectangles in Figure 2 mark the location points of three typical spatial objects. The point that is easily located is used to indicate the location of each artificial structure. Specifically, the anchor point of a sports field, the midpoint of the bridge along the centerline, and the intersection of the centerlines of roads are defined as their location points, as shown in Figure 2a–c, respectively.
The blue dots in Figure 2 mark the morphological feature points of these spatial objects. The mask area of a sports field is the area within the edge of the runway, and its morphological feature points are located at the two points on the inner edges of the runway where the runway crosses the extending main axis, as shown in Figure 2a. The mask area of a crossriver bridge is the part of the road that crosses the river, and its morphological feature points are the two points where the edges of the bridge intersect with the centerline of the road, as shown in Figure 2b.
Urban intersections are relatively more complex, and most urban intersections have zebra crossings to guide pedestrians, which is a very good reference. The urban intersection area is outlined as the mask area. Therefore, the morphological feature points are defined as the internal points where the centerline of each road intersects with the boundaries of the intersection. If there is a zebra crossing on a branch road, the intersection point formed by the centerline of the zebra crossing and the road centerline is directly taken as the feature point. Otherwise, the feature points are drawn on the inside of the road centerline, as shown in Figure 2c.
4.1.2 Classification system of KPSOs
Table 1 shows the classification system of the key points of sports fields, crossriver bridges, and urban intersections. In this system, all types of spatial objects have only one location point and multiple morphological feature points. In our model, the element P in the vector [N, C0, P, 3] is set to 4 to support the prediction of road intersections with up to four branches.
Object type  Subcategory of objects  Key point  

LP  MFP  
SF  1  2  
CB  1  2  
UI  Threeway intersection  1  3 
Crossroad  1  4  
Complex intersection  1  >4 
SF: sports field; CB: crossriver bridge; UI: urban intersection; LP: location point; MFP: morphological feature points.
4.2 Model training and testing
First, the 18 levels of Google and Tianditu online remote sensing imagery are used as the experimental data source, and then images of appropriate size are cut in different regions as the training samples. After that, all the sports fields, crossriver bridges, urban intersections, and their key points are marked on each sample image. A total of 1,300 samples (1,574 objects) are collected as the training data set, and 260 samples (294 objects) are collected as the test data set.
The stochastic gradient descent method is used to train the Mask RCNN model. Note that the learning rate of the model is set to 0.001, and the weight attenuation is set to 0.0005. After 600 epochs of training, the loss value dropped to a relatively low level and tended to stabilize.
In order to increase the prediction precision, a higher category probability threshold needs to be set. Through repeated experiments, the thresholds of the category probability of sports fields, crossriver bridges, and urban intersections are set to 0.98, 0.96, and 0.96, respectively, and the probability threshold of mask pixels is set to 0.5.
The object extraction results are shown in Table 2. It can be seen that for the trained model, the recall rate is 82.7%, the category precision is 93.5%, and the mask precision is 82.4%. In general, the trained Mask RCNN model can extract those spatial objects which have relatively stable morphological features and spectral features from remote sensing images with high precision.
Object type  NTRS  NTES  CPT  CP (%)  MP (%)  

Recall rate  Precision  
SF  495  93  0.98  88.2  97.6  84.2 
CB  477  88  0.96  78.4  85.2  82.6 
UI  602  113  0.96  81.4  96.8  80.4 
Total  1,574  294  82.7  93.5  82.4 
NTRS: number of training samples; NTES: number of test samples; CPT: category probability threshold; CP: category precision; and MP: mask precision.
The key point extraction network branch shown in Figure 1 is added to the object segmentation model for training key point extraction. After 350 epochs of training, the loss value is dropped to a stable value. Figure 3 displays three typical heatmaps for predicting morphological feature points. For clarity, the spatial object is cut out from the image in accordance with the prediction box, and then, the cut image and its corresponding heatmap are compared in one graph. Since the size of the heatmap is 56 × 56, the local image of its spatial object is also scaled to 56 × 56. In the prediction heatmap, the predicted scores near the key points are higher than those at other locations; thus, the spatial aggregation effect is obvious, as shown in Figure 3. Given a reasonable parameter ε, i.e., the minimum tolerance distance among the key points, all morphological feature points of each spatial object can be extracted according to equation (9).
By superimposing the positions of the marked key points with the sorted prediction heatmap, the action radius of the key points on the heatmap can be obtained, so as to calculate the optimal value of parameter ε. Specifically, the position of the maximum value of the heatmap is taken as the predicted position of the first key point, and then, the subsequent predicted positions are extracted according to the predicted scores from high to low. If the predicted position is the closest to the first key point, the pixel distance to the predicted position of the first key point is calculated until the predicted position is pointed to the next key point. The maximum distance is taken as the action radius of the first key point. The action radii of all key points except the last one are obtained by the above method.
A test data set of 294 samples is used to test the prediction accuracy of KPSOs of our improved model, and the results are listed in Table 3. In order to evaluate the prediction ability of the improved model more accurately, the maximum action radius and minimum prediction score of the morphological feature points of the sports field, crossriver bridge, and urban intersection are calculated. As the results listed in Table 3, the maximum action radius of the predicted morphological feature points in the test data set is 7.16 pixels, and the minimum predicted scores for the location point and morphological feature points are 0.52 and 0.05, respectively. It is noted that the value of the parameter ε is set to 10, and the threshold scores of the location point and feature points are set to 0.5 and 0.05, respectively.
Object type  MPSLP  MARMFP (pixels)  MPSMFP 

SF  0.98  4.98  0.18 
CB  0.98  4.52  0.13 
UI  0.52  7.16  0.05 
Total  0.52  7.16  0.05 
MPSLP: minimum predicted score of location point; MARMFP: maximum action radius of morphological feature points; and MPSMFP: minimum predicted score of morphological feature points.
Table 4 exhibits the prediction accuracy rate and error of the KPSOs. The key points are considered to be correctly predicted for each predicted spatial object if all predicted key points are within the distance error ε of the corresponding marked key points. The proportion of prediction objects conforming to the above conditions in the total number is counted as the prediction accuracy of key points. Here, the prediction error refers to the pixel distance between the predicted key point and the marked key point on the original image. It can be seen from Table 4 that the prediction accuracy of the model based on MHCELF is 81.9%, and the average prediction error of the location point and morphological feature points are 7.5 and 7.1 pixels, respectively.
Object type  NTES  NCS  NCSAKP  Accuracy rate (%)  MPE (pixels)  

LP  MFP  
SF  93  82  72  87.8  6.9  6.6 
CB  88  69  58  84.1  6.8  6.9 
UI  113  92  69  75.0  8.7  7.8 
Total  294  279  244  81.9  7.5  7.1 
NCS: number of correct samples; NCSAKP: number of correct samples with all key points predicted; and MPE: mean prediction error.
4.3 Reliability evaluation of the improved model
Taking 18 levels of Tianditu online imagery as the data source and a certain area of Huai’an, Jiangsu Province (approximately 36 km^{2}) as the experimental area, the objects on the image, totaling 160, were marked by manual interpretation in advance. Then, compare the multihot method (algorithm A) with the traditional onehot method (algorithm B) to evaluate the reliability of our algorithm. Since the order of feature points of spatial objects cannot be defined reasonably, a simple sorting rule for algorithm B is established as follows: First, the position index of each key point is calculated according to equation (3), and the key points are sorted in line with their position indices. Then, the sorted key points are substituted into the traditional model for training, and the prediction is performed after the loss value stabilizes.
To analyze the effect of different key point branches on the object extraction accuracy, the number of predicted objects and correctly predicted objects are counted under different category probability thresholds (CPT) for algorithms A and B. The statistical results are shown in Table 5. It can be seen that the larger the value of CPT, the lower the recall rate and the higher the precision. Compared with algorithm B, algorithm A has a higher recall rate at the same CPT value and higher precision at a similar recall rate. This indicates that how to deal with the key point disorder of spatial objects has a greater impact on object extraction accuracy. The key points of spatial objects (KPSOs) cannot be reasonably ordered, and algorithm B may lead to inconsistency in error back propagation, which reduces the accuracy of object extraction.
IVCPT  A  B  

NPO  NCPO  Recall rate (%)  Precision (%)  NPO  NCPO  Recall rate (%)  Precision (%)  
0  144  129  80.6  89.6  124  116  72.5  93.5 
−0.02  160  133  83.1  83.1  144  121  75.6  84.0 
−0.04  175  134  83.8  76.6  161  125  78.1  77.6 
−0.06  198  139  86.9  70.2  175  128  80.0  73.1 
−0.08  216  142  88.8  65.7  188  131  81.9  69.7 
IVCPT: the incremental value of CPT with reference training values of 0.98, 0.96, and 0.96 for sports fields, crossriver bridges, and urban intersections, respectively; NPO: number of predicted objects; NCPO: number of correctly predicted objects.
Meanwhile, the accuracy of object prediction will affect the effectiveness of key point prediction. Figure 4 depicts several typical unpredicted or mispredicted objects by our method with an IVCPT value of 0. The sports field in Figure 4a looks strange due to the interference of the bridge above, and it is not recognized by the mask RCNN model, and thus, its key points cannot be predicted. The roof in Figure 4b is incorrectly identified as a crossriver bridge because of the shadows on both sides, resulting in invalid key points. For another example, the key points shown in Figure 4c are not predicted correctly due to the incomplete extraction of the object.
In order to evaluate the effectiveness of two algorithms for key point prediction objectively, the IVCPT values of algorithms A and B in Table 5 are set to 0 and −0.06, respectively, to ensure that the two algorithms have similar recall rates. At the same time, those spatial objects that are correctly predicted respectively by the two algorithms are selected to count the number of objects with a correct location point within the distance ε, and the number of objects with all morphological feature points predicted correctly.
It can be seen from the statistical results in Table 6 that both the algorithms (A and B) can achieve about 80% prediction accuracy of location points, because each object defines one location point only, and there is no problem with sorting key points. It is remarkable that our algorithm has significant advantages in the prediction of morphological feature points, especially for urban intersections with multiple feature points. Overall, the accuracy of morphological feature points extraction based on our method is 80.6%, while the traditional onehot method is only 18.8%. For clarity, two local regions in the experimental area are picked out to compare the prediction results which applied algorithm A and algorithm B, respectively, as shown in Figure 5. Obviously, the extraction effect of our method is better than that of the traditional method, and this lays a foundation for intelligent machine perception of surface spatial objects.
Object type  NCPO  NOCLP  NOAMFP  PALP (%)  PAMFP (%)  

A  B  A  B  A  B  A  B  A  B  
SF  27  29  23  22  25  14  85.2  75.9  92.6  48.3 
CB  14  12  12  12  13  8  85.7  100.0  92.9  66.7 
UI  88  87  69  68  66  2  78.4  78.2  75.0  2.3 
Total  129  128  104  102  104  24  80.6  79.7  80.6  18.8 
NOCLP: number of objects with a correct location point; NOAMFP: number of objects with all morphological feature points predicted correctly; PALP: prediction accuracy of the location points; PAMFP: prediction accuracy of the morphological feature points.
5 Discussion
This article presents a key point extraction method of spatial objects based on MHCELF. By comparing with the traditional onehot method, it is found that how to treat the disorder of the key points of the spatial objects has a great impact on the model accuracy. Note that the traditional onehot method may not be able to consistently backpropagate the errors of different key points, resulting in a lower recall rate in the case of the same CPT value and a lower precision in the case of a similar recall rate in the prediction of spatial objects compared with our method. In the experiments of key point extraction, the recall rates are about 80% for both our method and the traditional onehot method when the IVCPT values are set to 0 and −0.06 respectively, where the object extraction precision of our method is 89.6% and the prediction accuracy of morphological feature points is 80.6%, while the traditional onehot method is 73.1% and 18.8%, respectively. That is, our method is effective with high accuracy in extracting the KPSOs. It should be noted that three identifiable and relatively stable artificial targets are chosen here to study the method of extracting key points, mainly because they and their key points are easily recognized by machines under the current technological conditions. Since they are widely distributed on the Earth’s surface, the ability to perform automatic sensing of the Earth’s surface will be rapidly improved by applying existing techniques and a limited number of types of spatial objects. When necessary, this method can be expanded to other types of spatial objects.
Meanwhile, the accuracy of object prediction will affect the effectiveness of key point prediction. As shown in Table 5, there are still 31 objects among the 160 objects that are not recognized correctly, so their key points cannot be extracted. In addition, it is difficult for our method to extract all the morphological feature points of space objects accurately. For instance, the key point extraction effect of many urban intersections does not meet our expectations due to the shadow occlusion or image blur. MaskRCNN employs convolutional neural networks to extract features, which will ignore multiscale features and environmental characteristics of spatial objects, and lead to a decrease in the extraction accuracy of spatial objects [24].
Another limitation of this method is the minimum allowable distance in pixels among the key points. In order to distinguish different key points of the same spatial object correctly, it is required that their distances are greater than the minimum allowable distance. Otherwise, they will be considered as the same key point.
6 Conclusion
In order to extract spatial targets from remote sensing images efficiently and accurately, a KPSOs extraction method based on MHCELF is proposed. It successfully solves the problem that the disorder of KPSOs makes it difficult for traditional extraction models to predict accurately. This method classifies KPSOs and builds a classification system with disorder within categories and order between categories. For each key point category, a heatmap has the ability to predict multiple key points at the same time, thus avoiding the requirement of sorting KPSOs. Based on this, the traditional onehot classification is converted to a multihot classification. During the prediction, the scores of the heatmap are arranged in reverse order, and all the key points of a key point category are obtained one by one. Experiments have demonstrated that predicting scores near the key points are higher than those at other locations on the heat map. Note that it is very important to define a reasonable value of the minimum allowable grid distance (in pixels) among key points, so that different key points can be distinguished well.
An exciting statistical prediction result is that 80.6% of the spatial objects extracted by this method predict all the key points correctly, taking a certain urban area as the verification area. In future work, we will further improve the model to improve the recognition rate of space objects and their key points and reduce the limitation of the minimum allowable distance on the accuracy of keypoint extraction.
Acknowledgments
This work is supported by National key R&D plan (2018YFB0505300) and science and technology projects of SiChuan Province (2020YFG0146).

Funding information: The authors have no relevant financial or nofinancial interests to disclose.

Conflict of interest: Authors state no conflict of interest.

Data availability statement: The data that support the findings of this study are openly available at https://pan.baidu.com/s/1puvao_EXDC_xJEl8OUJNIQ, the password is tn7y. After extracting the data folder, a remote sensing image of the experimental area can be obtained, which is named “urban.tif.” The spatial objects and their key points extracted by our method and the onehot method were saved as “urban.shp,” “urban_kps.shp,” “urban_old.shp,” and “urban_old_kps.shp.” To evaluate the accuracy of the model, we have vectorized all the objects and their key points on the image according to the same rules as the training data set and saved as “urban_m.shp,” “urban_kps_m.shp,” respectively. Each layer of spatial objects has the following fields: “state,” “CPNLP,” “TNMFP,” “PNMFP,” “CPNMFP.” “state” indicates whether each detected object is valid. “CPNLP” indicates whether the location point of each spatial object is within the distance ε compared to the position of the manual interpretation. “TNMFP,” “PNMFP,” and “CPNMFP” represent the total number of morphological feature points, the predicted number of morphological feature points, and the correct predicted number of morphological feature points, respectively. The project document “experiment.MXD” can be directly opened with ArcGIS version 10.3 or above, and it directly stores all the above layers and legends.
References
[1] Kovacs A, Sziranyi T. Improved harris feature point set for orientationsensitive urbanarea detection in aerial images. IEEE Geosci Remote Sens Lett. 2013;10(4):796–800.10.1109/LGRS.2012.2224315Search in Google Scholar
[2] Ettarid M. Automatic subpixel coregistration of remote sensing images using phase correlation and harris detector. Remote Sens. 2021;13(12):2314.10.3390/rs13122314Search in Google Scholar
[3] Lowe DG. Object recognition from local scaleinvariant features. Proceedings of International Conference on Computer Vision; 1999. p. 1150–7.10.1109/ICCV.1999.790410Search in Google Scholar
[4] Hasan M, Jia X, RoblesKelly A, Zhou J. Multispectral remote sensing image registration via spatial relationship analysis on sift keypoints. Geoscience & Remote Sensing Symposium. IEEE; 2010. p. 1011–4.10.1109/IGARSS.2010.5653482Search in Google Scholar
[5] Etezadifar P, Farsi H. A new sample consensus based on sparse coding for improved matching of sift features on remote sensing images. IEEE Trans Geosci Remote Sens. 2020;58(99):1–10.10.1109/TGRS.2019.2959606Search in Google Scholar
[6] Chang HH, Chan WC. Automatic registration of remote sensing images based on revised sift with trilateral computation and homogeneity enforcement. IEEE Trans Geosci Remote Sens. 2021;59(99):1–16.10.1109/TGRS.2021.3052926Search in Google Scholar
[7] Bay H, Tuytelaars T, Gool LV. Surf: Speeded up robust features. European conference on computer vision. Heidelberg: Springer, Berlin; 2006;1:404–17.10.1007/11744023_32Search in Google Scholar
[8] Zhi LS, Zhang J. Remote sensing image registration based on retrofitted surf algorithm and trajectories generated from lissajous figures. IEEE Geosci Remote Sens Lett. 2010;7(3):491–5.10.1109/LGRS.2009.2039917Search in Google Scholar
[9] Zhang T, Zhao R, Chen Z. Application of migration image registration algorithm based on improved surf in remote sensing image mosaic. IEEE Access. 2020;8:163637–45.10.1109/ACCESS.2020.3020808Search in Google Scholar
[10] Rosten E, Porter R, Drummond T. Faster and better: A machine learning approach to corner detection. IEEE transactions on pattern analysis and machine intelligence. 2008;32(1):105–19.10.1109/TPAMI.2008.275Search in Google Scholar PubMed
[11] Rublee E, Rabaud V, Konolige K, Bradski G. ORB: an efficient alternative to SIFT or SURF. 2011 International conference on computer vision. IEEE. 2011;2564–71.10.1109/ICCV.2011.6126544Search in Google Scholar
[12] Zhang Y, Zou Z. Automatic registration method for remote sensing images based on improved ORB algorithm. Remote Sens Land Resour. 2013;25(3):20–4.Search in Google Scholar
[13] Ma D, Lai HC. Remote sensing image matching based improved ORB in NSCT domain. J Indian Soc Remote Sens. 2019;47(5):801–7.10.1007/s1252401900958ySearch in Google Scholar
[14] Wang S. Accurate registration of remote sensing images based on optimized ORB algorithms. Transactions on Computer Science and Technology. 2019;7(1):4.Search in Google Scholar
[15] He K, Gkioxari G, Dollár P, Girshick R. “Mask RCNN,” 2017 IEEE International Conference on Computer Vision (ICCV); 2017. p. 2980–8. 10.1109/ICCV.2017.322 Search in Google Scholar
[16] Mahmoud A, Mohamed S, ElKhoribi R, AbdelSalam H. Object detection using adaptive mask RCNN in optical remote sensing images. Int J Intell Eng Syst. 2020;13(1):65–76.10.22266/ijies2020.0229.07Search in Google Scholar
[17] Wu Q, Feng D, Cao C, Zeng X, Feng Z, Wu J, et al. Improved mask RCNN for aircraft detection in remote sensing images. Sensors. 2021;21(8):2618.10.3390/s21082618Search in Google Scholar PubMed PubMed Central
[18] Yu C, Hu Z, Li R, Xia X, Zhao Y, Fan X, et al. Segmentation and density statistics of mariculture cages from remote sensing images using mask RCNN. Inf Process Agric. 2021;9(3):417–30.10.1016/j.inpa.2021.04.013Search in Google Scholar
[19] Allaberdiev RS, Jiang H, Odongo RO. Apparel keypoints localization by Mask RCNN and attribute recognition. Int J Adv Res Eng & Technol. 2019;6(10):8.Search in Google Scholar
[20] Zhang W, Fu C, Zhu M. Joint object contour points and semantics for instance segmentation. arXiv eprints, 2020. arXiv: 2008.00460.Search in Google Scholar
[21] Wong CC, Yeh LY, Liu CC, Tsai CY, Aoyama H. Manipulation planning for object reorientation based on semantic segmentation keypoint detection. Sensors. 2021;21(7):2280.10.3390/s21072280Search in Google Scholar PubMed PubMed Central
[22] Hardy P, Dasmahapatra S, Kim H. Super resolution in human pose estimation: pixelated poses to a resolution result?. 2021arXiv eprints. 2021. 2021: arXiv: 2107.02108.Search in Google Scholar
[23] Ren S, He K, Girshick R, Sun J. Faster RCNN: towards realtime object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39:1137–49.10.1109/TPAMI.2016.2577031Search in Google Scholar PubMed
[24] Liu Y, Liu J, Ning X, Li J. MSCNN: multiscale recognition of building rooftops from high spatial resolution remote sensing imagery. Int J Remote Sensing. 2022 Jan 2;43(1):270–98.10.1080/01431161.2021.2018146Search in Google Scholar
© 2022 Jun Chen et al., published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.