The reconstruction and analysis of building models are crucial for the construction of smart cities. A refined building model can provide a reliable data support for data analysis and intelligent management of smart cities. The colors, textures, and geometric forms of building elements, such as building outlines, doors, windows, roof skylights, roof ridges, and advertisements, are diverse; therefore, it is challenging to accurately identify the various details of buildings. This article proposes the Multi-Task Learning AINet method that considers features such as color, texture, direction, and roll angle for building element recognition. The AINet is used as the basis function; the semantic projection map of color and texture, and direction and roll angle is used for multi-task learning, and the complex building facade is divided into similar semantic patches. Thereafter, the multi-semantic features are combined using hierarchical clustering with a region adjacency graph and the nearest neighbor graph to achieve an accurate recognition of building elements. The experimental results show that the proposed method has a higher accuracy for building detailed edges and can accurately extract detailed elements.
The reconstruction and analysis of building models are important aspects of smart city construction. Refined building models can provide a reliable data support for data analysis and intelligent management of smart cities . However, owing to the diverse colors, textures, and forms of details, such as building contours, doors and windows, roofs, and advertisements, it is difficult to accurately identify the various details of buildings. An effective improvement in the accuracy and integrity of the extraction of building facade elements is imperative for the construction and development of smart cities. In recent years, semantic segmentation technology has developed rapidly in the field of building facade information extraction .
Semantic segmentation is crucial for understanding images in image processing and computer vision tasks . The basic idea of semantic segmentation is to classify every pixel in the image and determine the category of each point (such as belonging to the background, edge, or subject). Point cloud segmentation is divided according to feature points such as space, geometry, and texture, and point clouds in the same group have similar features . In recent years, methods based on image semantic segmentation and point cloud segmentation have been increasingly used in building detail recognition. (a) Image-based building information extraction: traditional building information extraction methods include methods based on K-means clustering , pixel-level and region-level , and morphological operations . Most of the low-level features of the image are used, such as shadows, structures, edges, light, and dark. Huang et al.  proposed using a group of morphological operations to represent the inherent structural characteristics of buildings (such as brightness, contrast, and size) and automatically detect buildings from images. Zheng et al.  proposed the image edge regularity and shadow line indices as new features for boundary recognition of specific buildings. However, owing to the increasing amount of data and complex reality of buildings, these methods based on the artificial acquisition of low-level features in images cannot realize automatic and accurate building recognition. With the development of deep learning in recent years, convolutional neural networks (CNNs)  have replaced the manual and tedious feature design process by learning the semantic hierarchy of image data, thereby realizing the automation of image feature extraction. Long et al.  converted CNNs into fully convolutional networks (FCNs); however, the segmentation was not good in terms of details. Encoder–decoder model structures, such as SegNet  and U-Net , can effectively solve this problem. To further improve accuracy, Deeplab series improved segmentation accuracy by extending the receptive domain, learning multi-scale context information, and adding post-processing structures . Although these attempts and improvements of different networks have improved the segmentation accuracy , they still suffer from low edge accuracy when extracting details of complex building structures . (b) Point cloud-based building information extraction: the semantic features of building facades are extracted from noisy building point clouds. Traditional methods use the generated geographic reference image, coordinate information, shape features, and prior knowledge to extract windows, doors, billboards, and other elements. Some scholars proposed some other traditional methods, e.g., point cloud clustering and segmentation-based methods , similarity matching-based methods , and plane fitting-based methods . In contrast, deep learning-based methods obtain higher segmentation accuracy and do not require a manual design of feature extraction operators. Therefore, using deep learning to process point cloud data of building facades has important research significance. For example, VoxNet , Kd-Net , OCT-NET , and others first convert irregular point clouds into regular voxel grids, but significantly increase the amount of data and their computational efficiency is low. Multi-view CNNs (MV-CNNs)  perform semantic segmentation by projecting a point cloud onto a 2D plane, but lose spatial information during dimension reduction. PointNet greatly improved the accuracy of point cloud segmentation. Thereafter, some scholars improved it by considering the local features of point clouds and proposed PointNet++ . PointNet is an end-to-end network that directly processes point clouds. PointNet++ divides the entire point cloud into a series of local regions and then runs PointNet in each local region to extract the local features of each region, which can better handle the local and global structural information of point cloud data. Although this method improved the segmentation accuracy, it still ignored many details, resulting in under- or over-segmentation . Deep learning-based methods are driven by a massive amount of data . However, the current lack of a point cloud annotation reference dataset for large-scale urban building information extraction causes considerable inconvenience in the extraction of building details based on point cloud data [27,28].
With regard to building detail recognition, the recognition results of image-based segmentation are still not ideal owing to a lack of 3D geometric information. Point cloud-based segmentation methods are ineffective for color semantic extraction when there are no obvious geometric features. For example, architectural images are affected by external elements such as shadows, occlusions, and shading, which make 2D color and texture semantics susceptible to interference . In the case of building facades with different directions and roll angles, it is difficult to extract all the details based on color and texture features alone and to accurately extract information with obvious texture features, such as advertisements on flat walls, using only a point cloud . Therefore, it is difficult to realize the accurate recognition of building elements using only 2D or 3D information. For most deep network architectures, the standard convolution operations are defined under regular grids, which largely affects the processing efficiency of irregular grids. As an end-to-end trainable network, superpixel segmentation with FCNs introduced nondifferentiable modules. However, both the skip-connect operation and the low-level pixel–pixel relationship have bad effects on the segmentation results. The AINet directly predicts the pixel–superpixel relationship by integrating the association implantation (AI) module into the FCN, which effectively improves the segmentation efficiency. In addition, the loss function considering the boundary-perceiving loss helps to improve the edge consistency of superpixels. Based on the aforementioned analysis, this study proposes a building element recognition method, called MTL-AINet (Multi-Task Learning AINet), based on 2D color and texture, 3D direction, and roll angle semantics. The proposed building facade segmentation method based on semantic projection correlation graphs of color, texture, direction, and roll angle addresses the shortcomings of traditional semantic segmentation methods based on image texture and improves the accuracy of building element recognition.
In this study, the association learning of 2D color and texture features and 3D direction and roll angle features was realized to enable detail recognition and extraction of building facades in different scenes. A flowchart of the proposed method is shown in Figure 1.
First, a dense point cloud model is generated through a multi-view stereo (MVS) reconstruction technique using building sequence images, and then, the direction and roll angle semantics are calculated. Considering the building facade orientation, it is divided into different viewpoints for planar image projection. Finally, the point cloud model is projected onto a 2D image to obtain pixel-level correlated direction, roll angle, and color maps. The fusion probability Q is obtained through multi-task integrated learning, and then, multi-semantic homogeneous patches are obtained. According to the different requirements of building element extraction tasks, a multi-semantic hierarchical clustering strategy was adopted to obtain the clustering results of building facades and achieve fine recognition and extraction of building elements under different scenarios.
2.1 Projections from multiple-view perspectives
Generally, both the color and texture are used for superpixel segmentation. A building surface has rich color and texture semantics. Building texture refers to the similar structures presented by the color and material of the building surface, and the color and texture can characterize the 2D features of different objects on the building surface, such as walls, windows, and doors [31,32]. For example, in Figure 2(a), the color and texture features of the building facade, windows, and roofs differ significantly, and the building detail elements such as windows, facades, and roofs can be accurately identified. However, as shown in Figure 2(b), the color and texture of the building facade are not consistent with those of the facade, and it is difficult to accurately identify the building elements using only the color and texture. The dilapidated building facade in Figure 2(c) also shows an inconsistency between the 2D color and texture and the 3D feature of the facade. In conclusion, although 2D color and texture semantics can be important for building detail recognition, it is difficult to achieve an accurate recognition of complex building details using them.
It is difficult to achieve an accurate recognition of complex building details using only 2D color and texture semantics. Adding 3D features can overcome the limitations of inaccurate 2D features and increase the recognition accuracy. Figure 3 shows the geometric schematics and projection interpretation of multiple features for a building. The color feature is displayed in Figure 3(a). As shown in Figure 3(b), and respectively, denote the normal vectors for the roof and the facade of a building. The geometric relationship about the orientation and roll angle features is described in Figure 3(c). The parameter in Figure 3(c) represents the roll angle, i.e., the included angle between the normal vector of a plane (e.g., roofs and facades of a building in Figure 3) and the horizontal plane. represents the projection of the normal vector in the horizontal plane, and the direction is represented by the intersection angle between and the true north direction. In this study, the normal vectors of the point cloud model are calculated based on the planar-fitting method. The combined use of direction, roll angle, color, and texture features can enable an accurate, quick, and automatic characterization of the detailed elements of buildings. Figure 4(a–d) shows the schematic diagrams of projection for building facades under multiple-view perspectives. Each building facade is filling with a color.
The point cloud obtained from the sequence image was used to calculate the direction and roll angle semantics of the building; however, the buildings in different scenes had different facade orientations and structural postures. To establish the correlation mapping between 2D and 3D features while considering the overall shape of the building, this study combines the facade orientation of an actual building with the design of multi-angle projection, divides it into different perspectives, and selects multiple perspectives of the building for projection to establish the mapping relationship between the direction and roll angle of different building perspectives and the color and texture semantics (Figure 4(d)), which provides the basic data for multi-task learning.
2.2 Multi-task learning and segmentation based on MTL-AINet
AINet adopts an encoder–decoder architecture to acquire deeper features and output feature maps with superpixel embeddings as the input image sensory field increases in the encoding stage, which is fed into the decoding stage to generate the pixel–superpixel association maps. Meanwhile, the superpixel and pixel embeddings generated in the encoding and decoding stages, respectively, realize the direct interaction between the pixel and its field through the AI module, i.e., the superpixel is embedded around pixel p so that the network can capture the association between p and its neighboring grids and realize the pixel–superpixel association, which is more in line with the goal of superpixel segmentation. A matrix Q of h × w × 9 is finally obtained as the relationship between p and the surrounding nine superpixels , i.e., the probability that p belongs to the surrounding nine superpixels. The central information of the superpixel is calculated using the association matrix, and the image is reconstructed based on the superpixel features using the association between the pixel and the superpixel. The specific calculation is as follows:
where f(p) is the feature of pixel p, is the association of pixel p with superpixel s in Q , is the number of superpixels around the pixel, represents the feature information, and represents the position information.
To make the superpixels fit more closely to the edges of the object, a boundary perception loss was constructed, which was applied to a pixel-by-pixel embedding to improve the boundary accuracy. It includes two main components: the first makes pixels of the same category closer to the mean, i.e., closer to each other, and the second increases the variability between pixels belonging to different categories. The overall concept of the cross-entropy (CE) function was used. Finally, the loss function was constructed based on the CE of the semantic labels and position vectors, L2 reconstruction loss, and boundary-perceiving loss using the following equations:
where is the truth label, is the semantic label of the estimated association matrix Q reconstruction, B is the sampling block, is the boundary perception loss, and α and β are the two weighting factors.
Because the AINet superpixel segmentation network is only applicable to 2D color texture features of images, it is difficult to handle 3D feature data, such as roll angle and direction. Therefore, a multi-task learning segmentation model, MTL-AINet, was designed based on texture, color, direction, and roll angle semantics. AINet was used as the basis function, and the mapping association map obtained from the projection of texture and color, direction, and roll angle features was input into the model for multi-task association learning. Therefore, multi-task learning considering the texture, color, direction, and roll angle features could be performed. The specific steps are as follows: for building facade information segmentation in different scenes, the dense point cloud model of the building is first reconstructed using the image sequence of the building according to the MVS algorithm , calculating the direction and roll angle semantics, dividing it into different viewpoints based on the overall structure and form to obtain the multi-view, and then projecting it onto the 2D image to obtain the pixel-level association of direction and roll angle maps and color and texture maps to be used as the input of the model. The fused pixel–superpixel association matrix (shown in Figure 5) is obtained via multi-task-integrated learning through the AINet network.
In the proposed algorithm, two sets of pixel embeddings corresponding to the semantic features of color and direction and roll angle, respectively, denoted as and , are obtained through the deep neural network. The embeddings of pixel p, involving the color and direction and roll angle semantic features, can be denoted as and , respectively. Let S be the sampling interval. The input image is compressed through multiple convolution and maximum pooling operations to generate two grid cell feature maps with multidimensional semantics, i.e., and , where , and .
The feature maps and are transformed into new feature maps and , respectively, through a 3 × 3 convolution. Thus, the embedding of the nine grid cells around pixel p is defined using equation (5), which directly associates a pixel with a semantic block:
where , and .
The association graph can be predicted through a 3 × 3 convolution and equation (6):
The proposed method uses the same loss function as the AINet superpixel segmentation method, including the three losses: CE loss, pixel reconstruction loss, and boundary-perceiving loss (as in equation (3)).
A new set of embedded pixels can be computed using equations (5) and (6), which directly reflect the pixel–superpixel semantic association regarding the color and roll angle. In the proposed method, AINet is used as the base learner, and the multi-feature semantic association projection maps, including the color and texture feature map and the direction and the roll angle map, are used as multiple inputs for MTL-AINet. The soft association maps and are obtained through MTL-AINet, which are, respectively, used to describe the probabilities each pixel belonging to its adjacent superpixels. Furthermore, and are integrated to calculate the fusion association map by equation (7), which is the output considering multi-feature semantics of the MTL-AINet. Subsequently, a set of semantic blocks is extracted according to the soft association map :
where and denote the weight factors of association maps and , respectively, satisfying .
Figure 6 shows the detailed computation process of the soft association graph . In Figure 6, represents a probability attribution distribution on color and texture of pixel , and similarly, represents a probability attribution distribution of pixel on the direction and roll angle. This pixel attribution reflects the similarity between pixel and its nine adjacent grid cells in 2D and 3D features, respectively; – represents the fusion probability distribution of pixel on multi-feature semantics. The maximum of the nine multi-feature probabilities for each pixel yields a mapping of the label, which corresponds to the result of the semantic block segmentation. In summary, by calculating the feature maps of color and texture, and direction and roll angle, and using the AINET basis function to calculate their soft association mappings and , respectively, the optimal association matrix is obtained using a multi-task learning strategy, and the value with the highest probability is considered the final homogeneous semantic block segmentation result.
2.3 Building information extraction based on semantic clustering
On the basis of obtaining the initial segmentation results of each building facade, the lowest level of homogeneous regions is obtained. For recognition tasks with different requirements, a region adjacency graph based on 2D color and texture features and 3D direction and roll angle features is constructed. Region merging is considered as an approximation problem of the image, and the final clustering results are obtained using a stepwise iterative optimization method based on the nearest neighbor graph for the nodes with the smallest edge weights for region merging .
The texture characteristics of the image are measured using the joint probability distribution histogram of LBP (local binary pattern) and LC (local contrast)  by combining the structure and intensity of image texture. The G-statistic method is used to perform the analysis. Let x and y, respectively, represent a random variable set, and let g denote the probability density function, then the G-statistic formula is as follows :
where represents the value of the pixel at position on the image. represents the number of pixels (pixel frequency) with gray level i in the image, and t represents the number of gray levels in the image. The results are highly capable of texture description and differentiation.
Buildings are geographic entities with certain shape information, and in the pixel merging process, it is often difficult to appropriately distinguish the contour edges of buildings using only texture information ; therefore, the geometric orientation constraint is adopted to obtain the object with a good edge fit. During merging, neighboring regions with longer common edges are prioritized to obtain more compact objects. The influence of the common edge length is introduced based on equation (9), and the geometric directional heterogeneity property is defined as H, where is the common edge length of neighboring regions x and y, and i is the influence factor of the common edge. When i = 0, = 1, i.e., the common edge has no influence on the regional heterogeneity, whereas when i ≠ 0, the longer the common edge, the smaller the heterogeneity .
In the region adjacency graph- and the nearest neighbor graph-based area merging, the problem of minimizing the adjacent-area approximation error is gradually transformed into that of calculating the area with the smallest merging cost. The merging cost is the weight of the edge of an adjacent area with a common edge and is defined as:
where denotes the combined generational values of regions x and y; and denote the areas of x and y, respectively; and denotes the heterogeneous properties of the two regions.
Using the strategy of building element extraction based on the aforementioned clustering strategy, we combined feature maps with hierarchical clustering to extract building elements in different scenes.
3 Experiments and results
Real buildings were selected as experimental objects, and datasets comprised of RGB samples and 3D semantic samples first need to be made for model training. The RGB and 3D semantic datasets are divided equally and with a one-to-one match. The training dataset contains 500 samples, and the testing dataset contains 200 samples. Semantic segmentation and hierarchical clustering on different facades of buildings in different scenarios are performed. In order to validate the accuracy and completeness of the proposed MTL-AINet method, three traditional superpixel segmentation methods, namely, SLIC-based, LSC-based, and AINet-based methods, are compared and analyzed with the proposed method. Moreover, superpixel evaluation metrics [40,41], including boundary recall (REC), undersegmentation error (UE), achievable segmentation accuracy (ASA), compactness (CO), and intra-cluster variation (ICV), are used to evaluate the reliability and robustness of the experimental segmentation methods for each building facade.
3.1 Segmentation extraction for building facades with similar texture
The Florentine Cathedral multi-view sequence image dataset (Figure 7(a)) was used in the experiment. A total of 105 images were selected with a resolution of 1,296 × 1,936 and a focal length of 29 mm. The sparse point cloud and camera parameters were first obtained using the structure from motion (SfM) technique, and the dense point cloud was then obtained using the MVS method. In total, 39.01 million points were obtained in the experiment, and the dense reconstruction results are shown in Figure 7(b).
Based on the geometric orientation of the building, the color and texture, and direction and roll angle projections were selected from the left, front, and right views of the building, and directional feature maps were obtained. Figure 8(a–c) shows the color and texture projections of the left, front, and right views of the building, respectively. Figure 8(d–f) shows the direction and roll angle projections of the left, front, and right views of the building, respectively. The figures show that the color and texture and the direction and roll angle projections express different details of the building facades.
Figures 9–11, respectively, show the segmentation results of the four segmentation methods under different viewing angles. From regions marked with the two red rectangles, it can be seen that the proposed MTL-AINet method performs better on those regions with similar color and texture but different façade orientations than the other three traditional methods; different facades were not be segmented successfully with the three traditional methods, while the proposed method can effectively distinguish different facades with similar textures, which facilitates the clustering task.
The clustering results of each facade shown in Figures 12–14 indicate that the AINet method can easily identify the roofs as the same facade, which cannot be separated by the AINet method due to the overly similar color texture. The proposed MTL-AINet method can effectively utilize the directional features of each facade, distinguish different facades with similar textures (such as the roofs and edges of each facade), and can accurately extract the detailed structure of each building. Thus, the details of the facade structure can be extracted accurately. Therefore, the MTL-AINet method is robust for buildings with similar color and texture features but large facade differences.
The experimental results show that for buildings with different facades but very similar texture features, the segmentation method relying only on single-texture semantics is prone to misclassification, and it is difficult to extract sufficient detail information. The segmentation method based on color and texture and direction and roll angle semantics considerably overcomes this limitation and can efficiently extract fine details of each building facade.
3.2 Segmentation extraction for complex buildings with occlusions and shadows
The multi-view sequence image dataset of Örebro Castle (Figure 15a) was used in this experiment. A total of 136 images were selected with an image resolution of 1,936 × 1,296 and a focal length of 29 mm. The sparse point cloud and camera parameters were first obtained through SfM, and the dense point cloud was then obtained through the MVS method. A total of 77.72 million points were obtained in the experiment, and the dense point cloud reconstruction result is shown in Figure 15(b).
The building was divided into four views (front, back, right, and left) according to its architectural characteristics, and a directional semantic feature map was constructed, as shown in Figure 16. In this experiment, the four views of the building were selected for the color and texture and direction and roll angle projections, and directional feature maps were obtained. Figure 16(a–d) shows the color and texture projections of the front, right, back, and left views of the building, respectively. Figure 16(e–h) shows the direction and roll angle projections of the front, right, back, and left views of the building, respectively. From the figures, it is evident that the texture and color and the direction and roll angle projection maps can express different details of the building facades. In particular, the direction and roll angle projection maps can significantly reduce the influence of shadows.
Because all the facades of the building are affected by shadows and occlusions to different degrees, the three traditional superpixel segmentation methods based on texture semantics can only segment the areas with obvious color–texture features and does not sufficiently distinguish between the shadowed parts and occlusions, such as the boundaries of the column-shaped buildings in Figures 17(a–c) and 18(a–c) and the roof and bottom fences in Figures 19(a–c) and 20(a–c). The proposed MTL-AINet method compensates for these shortcomings by weakening the influence of shadows and occlusions and improving the segmentation accuracy; for example, the top protrusion in Figure 17(d) is not mixed with the roof, the shadows and edges in Figures 18(d) and 20(d) are not divided separately, and the bottom fence in Figure 19(d) is not divided with the background but into a homogeneous surface.
From Figures 21(a) and 22(a), it is evident that the clustering result maps based on the superpixel segmentation results generated by AINet erroneously divides the shadows into a separate facade. Moreover, because the shaded parts present similar texture features with other facades, they can be easily mixed into the same facade, as shown in Figures 23(a) and 24(a), which considerably impacts the extraction of building elements. In contrast, the proposed MTL-AINet method can effectively avoid effects of occlusions and shadows, and improve the accuracy of facade structure extraction, as shown in Figures 21(b), 22(b), 23(b) and 24(b), where the large areas of shadows and shaded areas are efficiently distinguished and extracted. This significantly weakens the influence of shadows and facilitates subsequent tasks.
The experimental results show that because of shadows and occlusions, building facades cannot be segmented accurately based only on texture. However, a segmentation method based on color and texture, and direction and roll angle semantics can efficiently extract detailed elements of building facades.
3.3 Segmentation extraction for old buildings with inconsistent textures
The Martenstroget multi-view sequence image dataset (Figure 25(a)) was used in this experiment. Twelve images with an image resolution of 2,592 × 3,872 and a focal length of 30 mm were selected. The sparse point cloud and camera parameters were first obtained through SfM, and the dense point cloud was then obtained through the MVS method. A total of 23.2 million points were obtained in the experiment, and the dense point cloud reconstruction results are shown in Figure 25(b).
The building was divided into three views according to its architectural characteristics and a directional semantic feature map was constructed, as shown in Figure 26. In this experiment, three views of the building – front, right, and left – were selected for the color and texture and direction and roll angle projections, and directional feature maps were obtained. Figure 26(a–c) shows the color and texture projections of the front, right, and left views of the building, respectively. Figure 26(d–f) shows the direction and roll angle projections of the front, right, and left views of the building, respectively. From the figures, it is evident that the texture and color and direction and roll angle projection maps can reveal protrusions and depressions.
The color texture of the old building surface interferes with the recognition of building detail elements, and it is obvious from Figures 27(a–c) and 28(a–c) that the three traditional superpixel segmentation methods based on texture semantics mistakenly segment walls that have similar color textures but do not belong to the same facade into the same blocks. From Figures 28(a–c) and 29(a–c), it can be seen that the three traditional methods will mistakenly divide the regions belonging to the same facade but with dissimilar color textures into different categories. As seen from Figures 27(d), 28(d) and 29(d), the proposed MTL-AINet method effectively overcomes this limitation by dividing the regions that belong to the same facade but are not similar in color and texture into the same similar semantic facets, and distinguishing the regions that do not belong to the same facade but are more similar in color and texture, which provides a better basic data for the subsequent task of hierarchical clustering.
From the clustering results shown in Figures 30 and 31(a), it can be concluded that the traditional clustering method can only extract areas with more evident color textures and performs more false extractions. In contrast, the proposed method incorporates direction and roll angle features and can achieve a finer extraction of detailed information on old buildings, such as the protrusions and depressions, as shown in Figures 30 and 31(b).
Figure 28 shows the segmentation and clustering results of the front view of the building, wherein it is evident that the surface textures of the old buildings have large differences. Figure 28(a–c) shows the texture-based segmentation results of the three traditional methods, and it can be seen that segment regions belonging to the same facade but with variations in brightness and colors were wrongly segmented into separate facets. Better segmentation results are obtained by the proposed MTL-AINet method because it considers the direction and roll angle semantics, as shown in Figure 28(d). From Figure 32(a) and (b), it is evident that the clustering results based on AINet superpixel segmentation still cannot distinguish the edge details of old buildings very well. However, the proposed MTL-AINet performs better because it considered both direction and roll angle constraints.
Thus, the proposed MTL-AINet method obtains a higher accuracy and more accurately extracts details in depressions and protrusions. The experimental results show that for building facades with inconsistent color and texture dilapidation and similar textures, but with geometric depressions on the facade, the MTL-AINet superpixel segmentation method based on color and texture and direction and roll angle semantics has better extraction accuracy.
3.4 Evaluation and discussion
To verify the reliability of the proposed method, the results of building facade segmentation obtained in the experiment were evaluated using the super-pixel evaluation metrics REC, UE, CO, ASA, and ICV, and compared with three traditional methods based on SLIC, LSC, and AINet. The quantitative evaluation of building facade segmentation results across different scenes is shown in Tables 1–3.
The bold values refer to the evaluation results for the MTL-AINet method proposed in this research.
The bold values refer to the evaluation results for the MTL-AINet method proposed in this research.
The bold values refer to the evaluation results for the MTL-AINet method proposed in this research.
As indicated by the results presented in Tables 1–3, in the detail segmentation experiments of each building facade in different scenes and clustering for different recognition extraction tasks, the indicators based on the proposed MTL-AINet method have been greatly improved in comparison with the other three traditional methods. From the quantitative comparative evaluation results, it can be concluded that AINet and MTL-AINet have greatly improved the CO and under-segmentation of different building facades compared with traditional SLIC and LSC. Comparing with AINet, the average BR and UE rates of 11 building facades segmentation results from MTL-AINet were, respectively, improved by approximately 26 and 3%. Building facade segmentation results from MTL-AINet achieve the highest average REC rate and ASA. For MTL-AINet, as it considered multiple semantic features for superpixel segmentation, the ICV has a relatively moderate increase in comparison of AINet, but obviously better than SLIC and LSC methods. In conclusion, compared with the three traditional methods, the proposed MTL-AINet method has achieved better results, which reflects the superiority.
According to these experimental results, compared with the single semantic features that cannot adequately consider the detailed features of different facades of buildings, there are many problems of over- or under-segmentation. The MTL-AINet method proposed in this study integrated color and texture and direction and roll angle semantics of buildings, which can achieve an effective and accurate segmentation of target objects. In particular, for building objects with similar colors and textures but different directions and angles, building objects with severe shadow and occlusion effects, and old and new aging building objects with different colors and textures, the proposed method can accurately extract building detail elements with higher robustness.
This study proposed the MTL-AINet algorithm considering color and texture as well as direction and roll angle semantics to extract a detailed information on building facades. First, a dense point cloud model of the building is generated using multi-view images, and then 3D direction and roll angle features are computed. On this basis, these features are projected onto a 2D plane to generate color and texture as well as direction and roll angle feature maps. Subsequently, multi-task learning is used to obtain the fusion soft association map of multi-semantic features of each facade, according to which each facade can be divided into a series of semantic blocks. Finally, the detailed elements of the building are extracted using a semantic hierarchical clustering method. In this study, three kinds of building facades with similar colors and textures, with shadows and occlusions, and with inconsistent distribution of textures and dilapidated were selected as research objects for experiments. The experimental results show that the proposed MTL-AINet method has achieved the best results in REC, UE, CO, ASA, and the ICV and is also superior to SLIC and LSC methods. Therefore, the MTL-AINet has higher reliability and robustness, and its superpixel segmentation results provide a more accurate data basis for further cluster extraction.
For regions with similar or scarce textures, shadows, and occlusions, the proposed method considered multiple feature semantics including direction and roll angle, so it can not only improve the accuracy of 2D segmentation, but also meet different 3D clustering tasks and requirements. Besides, compared with those methods based on 3D point cloud, this study projected 3D information onto a 2D plane for building elements extraction, which can significantly improve the segmentation efficiency. The proposed method provides an idea and data support for the extraction of building details in smart city construction. The multimodal features of superpixels in this study refer to multiple 2D and 3D features. Future work will focus on exploring the intrinsic correlation of multimodal superpixels based on deep neural networks to achieve multimodal superpixel clustering automatically, quickly, and accurately.
The authors would like to thank the reviewers and editors for valuable comments and suggestions. The authors would also like to acknowledge Editage (www.editage.com) for English language editing.
Funding information: This research was funded by Key Laboratory of Land Satellite Remote Sensing Application, Ministry of Natural Resources of the People s Republic of China (Grant No. KLSMNR-G202213, KLSMNR-G202214), the National Natural Science Foundation of China (Grant No. 41901401, 42271482, 42101070), the China Postdoctoral Science Foundation (Grant No. 2021M691653), the Natural Science Foundation of Jiangsu Province (Grant No. BK20190743), and the Knowledge Innovation Program of Wuhan-Shuguang Project (Grant No. 2022010801020284).
Author contributions: Conceptualization: R.Z., G.L., and L.L.; data curation: M.J.; methodology: R.Z., X.Y., G.L., S.S., and L.L.; validation: M.J., S.S., and Y.H.; writing – original draft: R.Z.; writing – review & editing: R.Z., M.J., G.L., X.Y., S.S., Y.H., and L.L. All authors have read and agreed to the published version of this manuscript.
Conflict of interest: The authors state no conflict of interest.
 Su H, Maji S, Kalogerakis E. Multi-view convolutional neural networks for 3d shape recognition. Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 945–53.10.1109/ICCV.2015.114Search in Google Scholar
 Mostajabi M, Yadollahpour P, Shakhnarovich G. Feedforward semantic segmentation with zoom-out features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 3376–85.10.1109/CVPR.2015.7298959Search in Google Scholar
 Chen LC, Yang Y, Wang J. Attention to scale: Scale-aware semantic image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 3640–9.10.1109/CVPR.2016.396Search in Google Scholar
 Yin L, Ji X, Wu D. A building extraction method based on semantic segmentation and efficient conditional random fields optimization. Remote Sens. 2018;10(5):788.Search in Google Scholar
 Wang R, Du Q, Tao J, Yuan Z, Li T. Semantic segmentation of high-resolution remote sensing images based on joint feature learning and graph cut. Remote Sens. 2019;11(18):2152.Search in Google Scholar
 Huang T, Shengyong Y, Zhiqiang Z, Hongyun L. Model analysis of intelligent data mining based on semantic segmentation technology. Proceedings of the 2015 International Conference on Mechatronics, Electronic, Industrial and Control Engineering; 2015.10.2991/meic-15.2015.205Search in Google Scholar
 Zheng C, Zhang Y, Wang L. Multilayer semantic segmentation of remote-sensing imagery using a hybrid object-based Markov random field model. Int J Remote Sens. 2016;37(23):5505–32.10.1080/01431161.2016.1244364Search in Google Scholar
 Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 3431–40.10.1109/CVPR.2015.7298965Search in Google Scholar
 Badrinarayanan V, Kendall AC, Ipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481–95.10.1109/TPAMI.2016.2644615Search in Google Scholar PubMed
 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer; 2015. p. 234–41.10.1007/978-3-319-24574-4_28Search in Google Scholar
 Te G, Hu W, Zheng A. RGCNN: Regularized graph CNN for point cloud segmentation. Proceedings of the 26th ACM International Conference on Multimedia; 2018. p. 746–54.10.1145/3240508.3240621Search in Google Scholar
 Qi CR, Su H, Mo K, Guibas LJ. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017;1(2):4.Search in Google Scholar
 Li Z, Zhong Y, Yang B. Building extraction from airborne LiDAR data using local structural similarity matching. ISPRS J Photogramm Remote Sens. 2020;161:120–33.Search in Google Scholar
 Liu Y, Huang X, Zhang L, Qiao Y. Extraction of buildings from LiDAR data with a rectangle model. ISPRS J Photogramm Remote Sens. 2015;101:89–98.Search in Google Scholar
 Maturana D, Scherer S. Voxnet: A 3d convolutional neural network for real-time object recognition. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE; 2015. p. 922–810.1109/IROS.2015.7353481Search in Google Scholar
 Klokov R, Lempitsky V. Escape from cells: Deep Kd-networks for the recognition of 3D point cloud models. 2017 IEEE International Conference on Computer Vision (ICCV); 2017. p. 863–72. 10.1109/ICCV.2017.99.Search in Google Scholar
 Riegler G, Ulusoy AO, Geiger A. OctNet: Learning deep 3D representations at high resolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 6620–9. 10.1109/CVPR.2017.701.Search in Google Scholar
 Zhang Y, Rabbat M. A graph-cnn for 3d point cloud classification. 2018 IEEE International Conference on Acoustics, Speech Signal Processing (ICASSP), IEEE; 2018. p. 6279–83.10.1109/ICASSP.2018.8462291Search in Google Scholar
 Qi CR, Su H, Mo K. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 652–60.Search in Google Scholar
 Giraud R, Ta VT, Papadakis N. Texture-aware superpixel segmentation. 2019 IEEE International Conference on Image Processing (ICIP), IEEE; 2019. p. 1465–9.10.1109/ICIP.2019.8803085Search in Google Scholar
 Haris K, Efstratiadis SN, Maglaveras N, Katsaggelos AK. Hybrid image segmentation using watersheds and fast region merging. IEEE Trans Image Process. 1998;7(12):1684–99. 10.1109/83.730380.Search in Google Scholar PubMed
 Yang F, Sun Q, Jin H. Superpixel segmentation with fully convolutional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 13964–73.10.1109/CVPR42600.2020.01398Search in Google Scholar
 Gao S, Li ZY, Yang M, Cheng M, Han J, Torr P. Large-scale unsupervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2022.10.1109/TPAMI.2022.3218275Search in Google Scholar PubMed
 Wang Y, Wei Y, Qian X. AINet: Association implantation for superpixel segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. p. 7078–87.10.1109/ICCV48922.2021.00699Search in Google Scholar
 Hu Z, Wu Z, Zhang Q, Fan Q, Xu J. A spatially-constrained color–texture model for hierarchical VHR image segmentation. IEEE Geosci Remote Sens Lett. 2013;10(1):120–4. 10.1109/LGRS.2012.2194693.Search in Google Scholar
 Wang J, Luan Z, Yu Z. Superpixel segmentation with attention convolution neural network. 2021 International Conference on Image, Video Processing, and Artificial Intelligence. Vol. 12076. SPIE; 2021. p. 74–9.10.1117/12.2611692Search in Google Scholar
 Wu ZC, Hu ZW, Zhang Q, Cui WH. Remote sensing image segmentation method combining spectral, texture and shape structural information. J Surveying Mapp. 2013;1:44–50 (in Chinese).Search in Google Scholar
 Achanta R, Susstrunk S. Superpixels and polygons using simple non-iterative clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 4651–60.10.1109/CVPR.2017.520Search in Google Scholar
 L Chen, L Shao, Q Bai, J Yang, S Jiang, Y Miao. Review of image classification algorithms based on convolutional neural networks. Remote Sens. 2021;13(22):4712.10.3390/rs13224712Search in Google Scholar
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.