Three-dimensional (3D) city modeling is an essential component of 3D geoscience modeling, and window detection of building facades plays a crucial role in 3D city modeling. Windows can serve as structural priors for rapid building reconstruction. In this article, we propose a framework for detecting window lines. The framework consists of two parts: an improved stacked hourglass network and a point–line extraction module. This framework can output vectorized window wireframes from building facade images. Besides, our method is end-to-end trainable, and the vectorized window wireframe consists of point–line structures. The point–line structure contains both semantic and geometric information. Additionally, we propose a new dataset of real-world building facades for window-line detection. Our experimental results demonstrate that our proposed method has superior efficiency, accuracy, and applicability in window-line detection compared to existing line detection algorithms. Moreover, our proposed method presents a new idea for deep learning methods in window detection and other application scenarios in current 3D geoscience modeling.
Three-dimensional (3D) building modeling is an important part of 3D geoscience modeling, and 3D building modeling is also the foundation of smart city construction [1,2,3]. In recent years, with the development of virtual reality and digital twin, 3D urban reconstruction has become a research hotspot. For example, some researchers study the spatial accessibility between commercial and ecological spaces based on 3D urban reconstruction . Some researchers also study the use of 3D urban reconstruction in household distributed photovoltaic systems . Additionally, building facades are increasingly attracting people’s attention as an important part of 3D urban models. Images are commonly used as a reference for 3D modeling of buildings as they contain information related to 3D modeling, such as geometric features and depth information. Among these information, geometric features are the main features. For example, existing works [6,7,8,9] have focused on excavating different implicit geometric features in building facades to assist with 3D reconstruction of urban buildings. With recent progress in object recognition [10,11,12,13] and large-scale datasets [14,15,16,17], it is possible to identify, extract, and exploit the high-level geometric features or global structure of a scene for 3D reconstruction. These high-level geometry features can provide more significant and complete information about the overall geometry of the scene. A series of studies have sparked interest in extracting some geometric structures, such as lines and junctions (wireframes) , planes [19,20], surfaces , and room layouts . Among all the high-level geometric features, straight lines and their constituent surfaces (collectively referred to as line-surface structures ) are the most basic elements that can be used to assemble the 3D structure of building facade scenes.
Recently, Wang and Yu  developed a line-segment matching algorithm to rapidly reconstruct the 3D “line-surface” structure of urban building facades. Their algorithm mainly used the window line as a structural prior to realize the reconstruction of the line-surface structure. Window line is an important part of the building surface structure. Extracting window lines is pivotal for the reconstruction of building façade line-surface structures and is also of great significance for geovisualization-based 3D reconstruction [24,25]. Therefore, the aim of this article is to explore a simple and efficient method for extracting window lines on building facades.
There are some traditional studies related to window-line detection [26,27,28]. Jung et al.  employed a constrained algorithm with laser-scanning instruments to extract and adjust lines representing four window edges. Then, the window lines were projected back to the original coordinates to obtain a complete 3D wireframe model of the window. Wang et al.  first marked the point cloud data obtained by laser scanning of backpacks as objects (such as walls and windows), and then extracted the line structure from the marked points to construct the line frame of the building. Based on this, a point cloud building model was created. The building line frame is composed of lines with semantic information, among which the window line is the most important. Zhang et al.  first detected all line segments in the building, and then selected longer line segments after further connection and selection. According to the color features on both sides of each long line segment, line segments were selected as potential window edge and labeled. The building was then divided into several subimages to filter out the final window-line outcomes. These traditional methods either indirectly detected window lines through costly instruments or utilized complex algorithms that were unsuitable for complicated scenes. To improve the efficiency and simplicity of line detection, combined with a deep learning method, Zhou et al.  directly output a vectorized line frame containing junctions and lines based on the building. However, this method can only identify all line structures in the scene and is not capable of identifying specific structures such as windows. Some research studies [30,31,32] have been performed on the extraction of window structures. Kong et al.  adopted a boundary-tracking algorithm based on topological structure analysis to identify qualified contours according to the window contour characteristics. Then, based on the minimum circumscribed rectangle of the above qualified contours, the window area was detected through the minimum spanning tree method. Ma and Ma and Sun and Chen [31,32], respectively, used their neural network models to extract window features for window object recognition. However, their method only focused on the approximate area of windows on the building facade for object recognition and classification, leading to inaccurate window area location and ignoring the abundant connectivity and geometric features of the window-line structure.
In conclusion, 1) traditional window-line detection methods mostly use complex algorithms, which are inefficient and not suitable for complex scenes. Moreover, expensive equipment is needed for point cloud structure data. 2) Most traditional methods regard the window line as a collection of independent pixels, so the connectivity of the line structure cannot be considered, which is not conducive to the reconstruction of the line-surface structure of the building facade. 3) Existing window detection methods with good performance only perform object recognition for the approximate area of the window, ignoring the semantic and geometric features of the window-line structure. 4) Existing line detection methods combined with deep learning cannot extract window lines.
Newell et al.  first proposed the hourglass model, which was successfully applied to the detection of human joint poses. The hourglass model is suitable for detecting point–line features with mutual reference and connectivity, and the window line is a geometric structure with connectivity. Moreover, there is a strong reference between the point and line structures of window lines, that is to say, the known window vertices and window lines can be used as the basis for speculating the position of other window vertices and lines. Therefore, we choose to extract window lines based on the hourglass model. Following the introduction of the hourglass network, numerous investigations [34,35,36,37,38] have adapted the hourglass network to enhance its efficacy in human pose estimation. Most of these methods add additional subnetworks or layers to the hourglass network to optimize performance, but the original hourglass network itself has a very large number of parameters, requiring a very large storage capacity and computational capacity; adding subnetworks or layers exacerbates this problem. Therefore, how to optimize the hourglass network with the minimum number of parameters without sacrificing the quality of the network becomes a problem. Kim and Lee  utilized additional skip connections to improve performance while still maintaining the number of network parameters, which solves the aforementioned problem well. Their network is called the lightweight hourglass model. Inspired by this, we also apply this idea to our hourglass model for window-line detection. Huang et al. and Zhou et al. [18,29] proposed a simple and fast training framework based on the point–line detection and verification module, which can directly output vectorized wireframes of all line structures in a scene, regardless of whether these lines are needed or repeated. The difference is that Zhou et al.  used a sampling module to assist verification, while Huang et al.  used a fusion algorithm to assist verification. Accordingly, we propose our window-line detection method as follows. Based on the lightweight hourglass model, our method utilizes the idea of additional skip connections, combined with the window point–line detection and verification model, to achieve end-to-end detection of window lines on building facades. Additionally, we also propose a real-world dataset for window-line detection on building facades.
Our proposed method mainly has the following three contributions: 1) Our method proposes a new approach for window detection by combining an improved stacked hourglass neural network model, which can output a vectorized window wireframe with a point–line structure that contains semantic and geometric features. The current window-line detection methods either rely on point cloud data collected by expensive equipment or generate a series of pixel sets based on traditional algorithms. The current window detection methods can only recognize the approximate window area as window objects and do not provide information about connections. The lack of information about how points and lines are connected to each other limits its application in 3D reconstruction scene parsing and understanding. In addition, our method applies the idea of additional skip connections, which improves the network performance while maintaining the number of network parameters, resulting in better learning results for window-line detection than other methods. 2) We construct a brand new real-world dataset for window-line detection. To the best of our knowledge, this is the first dataset targeting window lines. 3) We conduct detailed experiments on this dataset to evaluate the performance of our method in comparison to other window-line detection methods, and the experimental results show that our method is very suitable for the detection of window lines in building façades. 4) The aim of our framework is to extract the accurate window structure from building façade in real time. This window structure helps to explore the 3D spatial information of the image. Our system can be used in many scenarios. For example, it allows architects to quickly calculate light rates. It contributes to the automatic construction of virtual city models, which can reduce duplication of effort for 3D reconstruction workers. It can also assist mapping engineers in calculating building heights or floor heights. Moreover, our approach can be extended to other domains if used appropriately. For example, it can assist the construction of sustainability models  and the analysis of selecting location center .
2 Related work
2.1 Line detection
Line detection is a research hotspot in computer vision. It aims to generate vectorized line representation from images. Traditional methods such as those described by Stephens and Gioi et al. [42,43] detect lines based on local edge features, but the detected lines are just collections of pixels. Recently, Nan et al.  used the traditional line detection algorithm, line segment detector, proposed by Gioi et al.  to generate a line segment map represented by an attraction field and combined two deep learning semantic segmentation methods to achieve end-to-end line structure detection. However, this line structure does not contain the semantic vector information about how the junctions and lines are connected, which is not conducive to scene parsing. Huang et al.  combined deep learning and line vectorization algorithms to detect wireframes, which have a vectorized representation, but the detected wireframes include many useless line structures, and it is impossible to detect the windows of building façades separately. As a result, it is not conducive to the 3D reconstruction of the building facade.
2.2 Window object detection
Object detection has been a hot topic in the field of computer vision, and it can be of great use in a variety of scenarios. For example, Mazzeo et al. , used regional CNN to detect aluminum profiles within images based on the new synthetic data (aluminum CAD files). Experiments show that the implementation of their architecture leads to good performance in automatic detection and classification. Similar to Mazzeo et al. , Ma and Ma and Sun and Chen [31,32] separately studied the detection of window structures from building facades. Ma and Ma  used Faster R-CNN, while Sun and Chen  used YOLOv3 to train their respective neural network models for window feature extraction and object recognition. Although their models can detect window objects well, such region-based recognition approaches ignore the rich geometric feature information of windows, resulting in inaccurate window area recognition and position deviation. In contrast, our method outputs a position-accurate vectorized window wireframe that consists of point–line structures, which is rich in both semantic and geometric features.
2.3 Window-line detection
In terms of window-line detection models, our method is inspired by previous studies [18,29,33]. The hourglass network model of Newell et al.  is used for the detection of human joint poses. The main contribution of the hourglass network is the use of multiscale features for pose recognition. Most of the previous networks designed for estimating human pose rely solely on the last layer of convolutional features, leading to information loss. In fact, for an association-type task such as pose estimation, different nodes of the whole body do not have the best recognition accuracy on the same feature map. Therefore, the challenge of using multiple feature maps simultaneously arises. The challenge is well solved by the hourglass network. It exploits the property that joint points can be predicted by reference to each other. And it improves the recognition accuracy of individual joints by reusing multiple joint feature maps throughout the body with the help of residual modules. Additionally, the window-line structure, which has strong connectivity and reference between the points and lines, is conceptually similar to the human pose detection in Newell et al.  and can be extracted by our method. Inspired by this, we use the hourglass model as the foundation of our model. The difference is that in Newell et al. , human pose detection is generated by a sliding window approach, while our window lines are generated by connecting significant intersections.
Huang et al.  were the first to introduce the notion and representation of a wireframe, utilizing a fully convolutional neural network for detecting wireframe intersections. Zhou et al.  used the line sampling verification module to output a vectorized wireframe containing all lines within the scene based on the image. Inspired by this, we have adopted the wireframe detection technique for window-line detection and have incorporated the idea of point–line detection verification to extract the window-line structure. The difference is that we extract the window-line structure, while the wireframe in Zhou et al.  does not distinguish different structures. As the wireframe in Zhou et al.  has a large number of repeated redundant and non-semantic lines, sampling is needed to reduce the number of lines during the training process. Namely, the method in Zhou et al.  cannot extract window lines independently, resulting in a loss of practical significance for assisting in building facade reconstruction.
Figure 1 shows the architecture of a single hourglass module in our improved stacked hourglass network. Figure 2 illustrates the overall network architecture of our method. It mainly consists of two parts. The first part is an improved stacked hourglass network, and the output feature map is used as the input of subsequent modules. The second part is the window point–line detection and verification module, which can output the window wireframe end-to-end.
3.1 A single hourglass module
As shown in Figure 1, a single hourglass model is composed of residual blocks, including the architecture of the encoder and decoder. The encoder extracts features by reducing image resolution through continuous downsampling, while the residual block in the orange dotted box copies and saves the features extracted by the encoder at each downsampling layer. Finally, the decoder increases the image resolution through upsampling and simply recombines the aforementioned features copied by the corresponding residual block in each upsampling layer. By stacking the feature map layer by layer, the feature map in the final layer retains all of the information from every layer and is equivalent in size to the original input image. This means that a heatmap representing the probability and mutual relationship of different window vertices can be generated through convolution so that the global window information can be reused to improve the recognition accuracy of a single window vertex. This takes advantage of the fact that window vertices can be predicted by referencing each other. In addition, the studies of Sun et al. and Xiao et al. [46,47] point out that the feature extraction process of the encoder is more important than the simple feature combination and recovery process of the decoder, and the simple recombination process will cause information loss to a certain extent. Based on this, related research have been conducted in previous studies [34,35,38], but most of these methods deploy extra subnetworks or layers. The hourglass network itself has a large number of parameters, and extra subnetworks or layers add more parameters, which means higher requirements for the storage capacity and computing power of the device. However, our method utilizes a simple parallel skip connection to the encoding process, which solves the aforementioned problems without increasing the computation volume. The specific connection method is introduced in Section 3.2.
3.2 Improved stacked hourglass network
The purpose of the improved stacked hourglass network is to extract high-level features with semantic and global spatial geometry information for the subsequent detection and verification module of the window point–line. First, the input image is adjusted to a resolution of 512 × 512. In other words, all images of the dataset need to be resized to a uniform size of 512 × 512 before being fed into the network. After entering the network, the input image is first downsampled by a stride-2 convolution layer and a max pooling layer, and then it is fed into multiple hourglass networks stacked in series. The stacked multiple hourglass is called improved stacked hourglass network. As shown in Figure 2, the structure of the improved hourglass network is shown in the middle black rectangular box. As introduced in Section 3.1, each hourglass model stacks features layer by layer through downsampling, upsampling, and residual blocks. We assume that the stacked hourglass network consists of n single hourglass. In Figure 2, we only draw the structure of n − 1-th and nth hourglass. We refer to these two drawn hourglasses as Stack n − 1 and Stack n, respectively. The structure of remaining hourglasses (n − 2, n + 1) is not shown in Figure 2 for simplicity. The whole structure of the stacked hourglass network is similar to concatenation, where the output of Stack n − 1 serves as the input of Stack n. As shown in Figure 2, the output arrow of the Stack n − 1 points to the letter n, which represents Stack n. Through this way, individual hourglasses can be connected in series. We use this unique way to illustrate our network structure. The reason we do not use conventional representation methods is that skip connections can be better shown in this representation of network structure. As shown by the red arrow in the improved stacked hourglass network in Figure 2, while the previous hourglass network (Stack n − 1) extracts features through downsampling layers, the downsampled features of the previous network (Stack n − 1) are also transferred to the current hourglass network (Stack n) through parallel skip connections. This connection hardly increases the amount of computation. The size of the network itself remains basically unchanged. While highlighting the importance of feature extraction through downsampling of the encoder, it can also realize the use of the previous downsampling features in the subsequent hourglass module, which greatly prevents the loss of feature information. After the features of the current hourglass network (Stack n) are fused with the features downsampled by the previous network (Stack n − 1), they continue to be superimposed layer by layer in the current hourglass network (Stack n) to generate a feature map that is the same size as the input image. The feature map is divided into two branches through a convolution. The upper branch continues to undergo convolution, and the lower branch first goes through convolution to generate a heatmap representing the probability of the existence of point–line. And the heatmap also contains the relationship between point and line. As shown in Figure 2, the blue layer in the improved stacked hourglass network is the generated heatmap. The heatmap is further adjusted to the size dimension of the upper branch through convolution and then shown by the black dotted arrow in the improved stacked hourglass network in Figure 2. Combined with the output of the previous hourglass network (namely, the input of the current network), the upper and lower branches in Stack n are merged with the output of the previous hourglass network (Stack n − 1), together as the input to the next hourglass network (Stack n + 1). Such mixed features mean that the existence probability and interrelation of the point–line can be reused in the next hourglass network while retaining feature information to the maximum extent. In terms of loss, traditional convolutional neural networks only compare the difference between the final prediction and the ground truth, while we use intermediate supervision. This is because each individual hourglass network of the improved stacked hourglass network outputs a heatmap as point–line prediction. Therefore, we take every heatmap output by each hourglass into the calculation of the loss, namely, the middle part participates in supervised training.
In general, our proposed lightweight stacked hourglass network has the following advantages: With the help of additional skip connections, the loss of feature can be minimized without increasing the number of parameters. In addition, since the network contains multiple single hourglasses, the encoding and decoding operations can be performed repeatedly. Thus, the network can better mix the global and local features of the image. The disadvantage is that if we increase the number of stacked hourglass or the depth of a single hourglass, the performance of the model can be improved theoretically, but this can easily cause an exponential increase in the number of parameters. So the balance between network complexity and network performance is needed, which is the focus of our future research. In addition, since the basic step of the network is to identify pixel-level intersections, there is a high requirement for the resolution of the input images. This is also the direction of our future research.
3.3 Detection and verification module of window point–line
3.3.1 Detection module of window vertex
As described in Section 4, we represent the window wireframe as , in which . Similar to Huang et al. and Zhou et al. [18,29], the detection of window vertices first divides the input image with resolution into an mesh grid (Figure 3).
Here, we set to . If a window vertex falls into a grid cell, that cell is responsible for detecting it. Thus, each th cell predicts a confidence score reflecting how confident the model thinks there exists a window vertex in that cell. Meanwhile, to further locate the position of window vertices, each th cell will also predict the relative displacement of window vertices in the cell. This displacement is the window vertex coordinates with reference to the center of the cell. As shown in the lower left corner of the window point–line detection and verification module in Figure 2, the two predictions, and , are realized by feeding the output feature map from the improved stacked hourglass network into two convolutional layers. Finally, combined with and the relative displacement of the corresponding cell for screening and calculation, the position of the window vertex can be obtained. (Note that the behavior of the grid cells resembles the so-called “anchors,” which serve as regression references in the latest object detection pipelines, except that we do it by convolution.) Therefore, the final output of the window vertex detection module is the position of the top T window vertices with the highest predicted probability: .
Next, we introduce the loss function. Given a set of ground truth window vertices in an image, we write the loss function as follows:
in which is the confidence loss of window vertices:
where is the predicted score, indicating the probability of a window vertex for each grid cell. Let be the ground truth binary class label; we use the cross-entropy loss.
is the loss of relative displacement of window vertices:
We predict the relative position of a window vertex for each grid cell. However, we use to compute the loss, where returns the index of the grid cell that the th ground truth window vertex falls into. is the relative position of the ground truth window vertex with respect to that cell center. We compare the prediction with each ground truth window vertex using the loss.
3.3.2 Verification module of the window line
The aforementioned window vertex detection module outputs the predicted T window vertices . Similar to the method of generating wireframes in Huang et al. , we connect these T points in pairs to generate a series of prediction window lines . Here, represent the coordinates of the two endpoints of the th connecting line. Then, to verify which of these connected lines are window lines, researchers [48,49,50] use region of interest pooling to extract features for the purpose of verifying the correct object region in object detection. Inspired by this, we apply the line of interest (LoI) pooling method of Zhou et al.  in the verification module of the window line. To achieve this function, as shown in the upper left corner of the window point–line detection and verification module in Figure 2, the output feature map from the improved stacked hourglass network is also needed. The connected lines of the window vertices combined with the feature map of the hourglass network are used as the input to the verification network to determine whether it is a window line. The verification network of the window line is roughly composed of three parts, namely, an LoI pooling layer and two fully connected layers. The LoI pooling layer can extract window-line features and convert the lines connected by window vertices into features. These features can be fed into the network for prediction and verification. The input of LoI is the endpoint of each connecting line, which is the window vertex . The LoI pooling layer first divides the connection line equally to generate uniformly spaced points and uses linear interpretation to calculate the coordinates of these points:
Then, the LoI pooling layer uses bilinear interpolation to calculate the feature values at these points in the hourglass network’s feature map. This step avoids quantization artifacts . The resulting feature vector has a spatial extent of , in which represents the channel dimension of the feature map from the improved stacked hourglass network. After that, the LoI pooling layer reduces the size of the feature vector with a stride- 1D-max-pooling layer. Thus, the shape of the feature vector turns into . This reduced feature vector is then flattened as the final output of the LoI pooling layer. We pass this flattened feature vector through two fully connected layers to generate a logit, and then use the sigmoid function to map the logit between (0,1), namely, converting the logit into the probability that this line is a window line. The line with a probability greater than a threshold is verified as a window line. Up to this point, we have output the predicted window wireframe end-to-end and denoted it as .
4 Datasets and annotations
As part of our learning-based window-line detection work on building facades, we constructed a window-line dataset of real-world building façades. It contains 1,032 images, some examples of which are shown in Figure 4. These real building facade images were taken manually by imaging equipment. To ensure the diversity of the data, these images were taken in many different areas. We invited team members in different cities to help us collect images. In addition, considering the influence of weather, time, light, shadow, and occlusion on building facade imaging, we tried to select buildings with open scenes and less occlusion. We also tried to shoot at different times of the day, such as 6 a.m., 12 noon, and 6 p.m. In terms of weather, we tried to collect images in different weather conditions, such as sunny, rainy, and snowy days. Thus, different weather scenes are included in the dataset, ensuring the diversity of image data as much as possible. After the collection of building facades, we manually conducted follow-up processing on the collected building facade images to screen and cut those images that are not suitable for window-line annotation. For example, we will eliminate those images where the windows are blocked. If the brightness of the image is too low, it is difficult to accurately identify the position of the window even through manual recognition. Such images will also be deleted. If the non-building area in the image accounts for too much proportion, we will also crop it to retain the appropriate building facade area. After processing the images, we proceed to the labeling work. We use a software called labelme for manual annotation. For each image, we manually annotated all windows in the building facade scene, but instead of annotating the window as a polygonal area, we annotated its four vertices and four window lines. It is worth noting that to ensure the connectivity and integrity of a window, we labeled window lines in the clockwise direction. The starting point of each current window line is the end point of the previous window line. If the four window lines are annotated separately, there will always be some artificial errors, such as the deviation of the window vertice. In addition, as shown in Figure 4, when there are metal frames on some windows, we treat the windows and metal frames as a whole and label the outer contour of metal frames as window lines. When annotating window lines, we also store the pixel length and width of the labeled image for later learning and processing. All annotations are stored in json format to facilitate the learning of subsequent networks. In summary, for all the windows in each building facade, our annotations include a series of vertices and a series of line segments . The real coordinates of point in the image space are represented by . Each window line is represented by its two endpoints: . Therefore, we denote the window wireframe as , which records the connection and intersection relations between all window vertices and lines. In this respect, it can be regarded as a simplified window-line version of the wireframe definition in Huang et al. .
5.1 Our methods
5.1.1 Implementation details
The image of the dataset will be uniformly transformed into a resolution size of before entering the network. After passing through a convolutional layer and a max pooling layer, the features are then fed into an improved stacked hourglass network, which consists of three stacked hourglass modules. Each hourglass has a depth of 6. Then, the features will be fed into the LoI pooling layer. is set to 128, which means the channel dimension of the feature map in the improved stacked hourglass network is 128. is set to 32; that is, a line is set with 32 equidistant points to extract line features, resulting in a line feature vector with a shape of . The stride of the 1D max pooling layer is set to 4, thus reducing the dimension of the line feature vector to . The threshold for determining a window line is set to 0.9. The of the loss function is set to 0.9, and is 0.2.
All experiments are conducted on our proposed window-line dataset, where 132 images are used for testing. The experiments are conducted on a Linux server with 10 NVIDIA 2080Ti. We use the ADAM optimizer, the learning rate is set to 0.0008, and the weight decay is set to 0.0001. We set the number of training epochs to 256. The learning rate decays by a factor of 0.1 after 128 epochs, and the batch size is set to 4.
5.1.2 Evaluation metric
The method previously used to evaluate wireframes is based on the heatmap of the line [18,44], but this method is not suitable for the evaluation of window wireframes, so we use a novel method  for the evaluation of window lines. This new method includes two evaluation metrics, (structural average precision) for evaluating window lines and (junction average precision) for evaluating the window vertices. In this section, we will explain why the previous evaluation metrics based on line heatmaps are not suitable for the evaluation of window wireframes and why and are applicable to the evaluation of window wireframes.
The line-heatmap-based evaluation method first rasterizes the lines to generate a confidence heatmap. To compare it with the ground truth heatmap, a bipartite matching that treats each pixel independently as a graph node is run to match between two heatmaps. Then, precision and recall (PR) curves are computed according to the matching and confidence of each pixel. Finally, the area under this PR curve is calculated, which is the final quantitative numerical evaluation result. This evaluation method was originally designed to evaluate boundary detection , but it has the following problems when it is used to evaluate window lines: 1) it cannot penalize overlapped window lines and 2) it cannot evaluate the connectivity between the endpoints of window lines. Specifically, as shown in Figure 5a, if a window line is broken into several line segments, the resulting heatmap is almost the same as the ground truth heatmap. And in the case where two adjacent windows exist, as illustrated in Figure 5b, and the two window lines are mistakenly connected to form a single long window line, the resulting heatmap may still be almost the same as the ground truth heatmap. The good performance of the aforementioned two properties is crucial for tasks that depend on the correctness of window-line connectivity, such as the reconstruction of building facades using window lines . Therefore, the evaluation metrics based on the line heatmap are not suitable for the evaluation of window wireframes.
solves these problems well. is evaluated based on vectorized wireframes without relying on the heatmap. The test image predicts a set of lines with scores, which are also the probabilities of the window lines mentioned earlier. Based on these scored lines, the PR rates are calculated, and the PR curve is plotted. is defined as the area under this curve. Precision is the proportion of correctly predicted window lines to all predicted window lines, while recall refers to the proportion of correctly predicted window lines to all ground truth window lines. A predicted window line to be considered correct requires the following condition to be satisfied:
where and refer to the two vertices of a matched ground truth window line and is a user-defined threshold to regulate the strictness of the metric. In our experiments, the threshold values are set to 5, 10, and 15, which are denoted by , , and , respectively. If the predicted window line satisfies the aforementioned condition, it means that the connection of the endpoints of the window line is correct. In addition, each ground truth window line is not allowed to be matched more than once in our experiments, which penalizes overlapped window lines.
is also evaluated based on vectorized points without relying on the heatmap. This is because window points are geometrically spatially characterized and physically meaningful. And the window wireframe representation in this article records the connectivity information of window points and lines. Therefore, relying solely on the heatmap to evaluate window vertices would result in the loss of the evaluation on connectivity. is calculated similarly to by plotting a PR curve based on the predicted set of points, and the area under this curve is . A predicted window vertex needs to satisfy the following condition to be considered correct:
where refers to the nearest ground truth window vertex to the predicted window vertex , and similarly to penalize the repeated prediction of window vertices, each ground truth window vertex is not allowed to be matched more than once. In our experiments, = 1,2,3, denoted by , , , respectively.
In summary, and are able to penalize overlapped window lines and duplicate window vertices, and are able to evaluate the connectivity between window-line endpoints. They are highly appropriate for the evaluation of window wireframes.
5.2 Comparison with other methods
Since there is currently no method to extract window lines directly, we selected a state-of-the-art method of line detection . Combined with certain processing, we use the method to serve as the basis for our comparison experiment on window-line extraction. All evaluations are performed on our dataset. and with different threshold values are used to quantitatively evaluate the performance of the method.
Nan et al.  proposed two models, attraction field map (AFM)-Unet and AFM-Atrous, which utilize a line partition map of attraction field representation. These models are based on different neural networks. We compared our method with both models. All comparison experiments were trained from scratch based on our proposed dataset. And we integrate our processing method to produce the final results. The processing idea is as follows.
In Nan et al. , the aspect ratio is used to filter out false detections; however, we found that this filtering method is not applicable to the detection of window lines, as shown in Figure 6a. There will be some redundant and messy short line segments around the window, mostly clustered near the window vertices. This phenomenon could be attributed to the fact that the AFM method detects line segments directly from the building facade rather than generating lines based on point detection. To remove the cluttered short line segments, we use the length of the line segments as a filtering indicator for the process. We first calculate the lengths of all detected line segments in each facade and then sum the lengths of all line segments to calculate their average value. We utilize this average value as the threshold to remove those short line segments whose length is less than the threshold value. This simple operation can greatly improve the performance of AFM in extracting window lines. The results after our processing are shown in Figure 6b.
Then, we conducted experiments, and the outcomes of our evaluations are presented in Table 1 and Figure 7. It is worth noting that we treat the endpoints of each window line as the predicted window vertices when evaluating the metrics of the AFM-Atrous and AFM-Unet, and the output of both the AFM-Atrous and AFM-Unet is in the original image size, which we will transform to resolution for uniform evaluation.
All the experiments are conducted on our dataset. The columns labeled with “ ” and “ ” show the window-line accuracy with respect to the structural evaluation metrics.
Figure 7a–c shows the PR curves of with three different thresholds of 5, 10, and 15. It can be clearly seen that our method is far better than the other two methods, and the overall performance of AFM-Atrous is slightly better than that of AFM-Unet. AFM-Atrous and AFM-Unet have opened the gap only when the recall rate is high. Figure 7d–f shows the PR curves of for the three thresholds 1, 2, and 3, and it is obvious that our method is better than the other methods. From Table 1, the superiority of our method is more intuitive. As defined in the above text, the evaluation metrics represents the area under the PR curve. We set the maximum value to be 100. Our method performs significantly better than AFM-Atrous by approximately 50 percentage points in the metric, and the gap is even larger for AFM-Unet. For example, the of our method is 74.6%, the of AFM-Atrous is 26.2%, and AFM-Unet is 15.9%. Our method achieves 48.4-point improvement over AFM-Atrous and 58.7-point improvement over AFM-Unet. In the metric, our method has an approximately 30 percentage point improvement compared to AFM-Atrous. When compared to AFM-Unet, the improvement can increase up to 35-point. For example, the of our method is 57.8%, while the of AFM-Unet is 21.6%. The gap is 36.2-point. We believe there are several reasons for such a large gap: 1) Our method penalizes overlapped window vertices and lines, while AFM-Atrous and AFM-Unet treat overlapped vertices and lines as correct predictions. However, evaluation treats overlapped vertices and lines as incorrect predictions. 2) The improved stacked hourglass network that we employ, along with the method of extracting points and subsequently verifying lines, is well suited for detecting windows, which possess a geometric spatial structure characterized by incredibly strong connectivity and reference. 3) The maximum recall of AFM-Atrous and AFM-Unet is low, only half of our method, and the too-short PR curve leads to a much reduced area under the PR curve, i.e., the is greatly reduced. 4) AFM-Atrous and AFM-Unet break lines to better fill the predicted line heatmap, destroying the connectivity between points and lines, while the evaluation metric contains an evaluation of the correctness of window point–line connectivity. 5) AFM-Atrous and AFM-Unet utilize algorithms that directly extract window lines from the heatmap, without an intermediary step of extracting window vertices, leading to a loss of information regarding the connectivity of the window vertices. In general, it is not that the design and performance of the AFM-Atrous and AFM-Unet methods are very poor, but rather that the focus of the target task is different, and our method is more suitable for window-line detection.
In addition, in order to evaluate the network complexity, we also calculate the number of network parameters for the three methods. The number of our network parameters is 9,746k, while AFM-Atrous and AFM-Unet are 9,549k and 8,650k, respectively. Intuitively, the number of parameters of our network is slightly larger than other networks. However, considering the experimental results, our method obtained such a large improvement with a small increase in complexity. And our network is composed of multiple stacked hourglasses, which means many more layers. The increase in the number of parameters is inevitable. Nevertheless, we still use additional skip connections to improve the network without increasing the number of parameters. This case further illustrates the superiority of our method.
Furthermore, based on our proposed dataset, we visualize the window lines detected by different methods, and some examples are shown in Figure 8. The window vertices in the building facades are represented in cyan blue, and the window lines are represented in orange. AFM-Atrous and AFM-Unet do not generate window vertices directly. They detect window lines directly, so we mark the endpoints of the window lines as window vertices.
As shown in Figure 8, on the whole, AFM-Atrous gives better results than AFM-Unet, but they both generate many short lines, especially when there are obvious textures such as wall tiles on the building façade, which are more likely to produce some wrong window-line detection. As shown in the penultimate row of Figure 8, when the building façade is more complex and has a continuous arrangement of windows, AFM-Atrous and AFM-Unet directly generate a long line that encompasses all the windows, even if these windows exist individually. This error occurs due to their methods’ ignorance of the connectivity of window vertices. In contrast, our method shows significant improvement. When the building facade scene is relatively simple, we can obtain fully complete and accurate results, and the false detection of textures is well suppressed. There are very few wrongly connected window lines, and even the missing window lines are also caused by the failure to detect the window vertices due to the occlusion of trees. In the case of more complex scenes, the number of incorrectly connected window lines and missing window lines increases, but we note that these incorrectly detected window lines are partly affected by the metal frames installed outside the windows. In the case of windows without metal frames, we can detect a relatively complete window-line structure one by one, even if the windows are arranged consecutively, which is still better than the other two methods overall.
In this article, we propose a method for detecting window lines in building facades. The method consists of two parts: an improved stacked hourglass network and a point–line extraction module. Our method can output vectorized window wireframe end-to-end. As part of our work, we also propose a new real-world building façade dataset. Based on our dataset, we have conducted extensive experiments to show the advantages of our method.
The improved stacked hourglass model utilizes the idea of additional skip connections and combines the window point–line detection–verification model to realize the end-to-end detection of window lines in building facades. The use of multiple stacked hourglass is of great help to extracting global and local features. And the use of additional skip connections can improve the network performance without increasing the amount of parameters. The issue with existing methods for window-line detection is that they are either restricted to point cloud data or simply perform object recognition on the windows of the building façade. They cannot extract the accurate and effective window wireframe structure. However, our method can directly output the accurate window lines in the real-world building façade, which effectively solves the aforementioned issues and provides a novel idea for window detection.
Moreover, we propose a new dataset for window-line detection, which, to the best of our knowledge, is the first dataset for window-line annotation on images of real-world building façades. To ensure the diversity and comprehensiveness of the dataset, various factors such as location, architectural style, weather, lighting, and time have been taken into account during the dataset collection process. Furthermore, several manual post-processing procedures have been carried out to ensure the quality and accuracy of this dataset. Our team spent a lot of time annotating the images. The annotating files are stored in json format to facilitate the processing of subsequent networks. Based on this dataset, we conducted extensive experiments to evaluate our method objectively. We investigated different evaluation metrics and adopted a new evaluation metric to better evaluate the methods. The results show that our method is very suitable for window-line detection in building façades. It also proves the validity and stability of our dataset, and we think this new dataset can provide strong data support and new ideas for research related to the 3D modeling industry.
More importantly, as the main component of the building façade, window-line detection is an effective attempt targeting the main structure of building façade. We believe that with appropriate datasets, our method can be extended to any other geometric spatial structures that share similar connectivity and reference characteristics as windows, such as door–floor lines in building facades, urban road line, and even underground pipe networks. With our method, accurate line structures can be detected and used as structural priors to rapidly construct wireframe models, which is of great significance for the rapid reconstruction of 3D modeling in geographic space. In conclusion, our method provides support for 3D urban reconstruction and 3D geoscience modeling, and further training or improvement may be needed due to the needs of different application scenarios.
Funding information: This research was funded by National Natural Science Foundation of China, grant number 41471329 and Postgraduate Research & Practice Innovation Program of Jiangsu Province, grant number KYCX21_0763.
Author contributions: F.Y. performed the theory analysis, methodology, coding, and contributed to drafting the manuscript. Y.Z. analyzed the data and performed the investigation. D.J. performed the literature reviews, provided the background knowledge, and improved the writing. K.X. and D.W. collected the data. X.W. conducted the statistics. All authors have read and agreed to the published version of the manuscript.
Conflict of interest: The authors declare no conflict of interest.
 Wu X, Liu G, Weng Z, Tian Y, Zhang Z, Li Y, et al. Constructing 3D geological models based on large-scale geological maps. Open Geosci. 2021;13(1):851–66.10.1515/geo-2020-0270Search in Google Scholar
 Cuca B, Brumana R, Oreni D, Iannaccone G, Sesana M. Geo-portal as a planning instrument: supporting decision making and fostering market potential of Energy efficiency in buildings. Open Geosci. 2014;6(1):121–30.10.2478/s13533-012-0165-0Search in Google Scholar
 Liu Z-G, Li X-Y, Zhu X-H. A full-view scenario model for urban waterlogging response in a big data environment. Open Geosci. 2021;13(1):1432–47.10.1515/geo-2020-0317Search in Google Scholar
 Xue Y. Spatial accessibility between commercial and ecological spaces: A case study in Beijing, China. Open Geosci. 2022;14(1):264–74.10.1515/geo-2020-0333Search in Google Scholar
 Zhang W-H, Chou L-C, Chen M. Consumer perception and use intention for household distributed photovoltaic systems. Sustain Energy Technol Assess. 2022;51(1):101895.10.1016/j.seta.2021.101895Search in Google Scholar
 Duan WT, Allinson NM, editors. Vanishing points detection and line grouping for complex building facade identification. 18th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, Plzen, Czech Republic; 2010.Search in Google Scholar
 Seo D, Kang H-D, Hernandez DC, Jo K-H, editors. Building facade detection using geometric planar constraints. 9th International Conference on Human System Interactions (HSI), Portsmouth, England; 2016.10.1109/HSI.2016.7529663Search in Google Scholar
 Xiao H, Meng G, Wang L, Pan C. Facade repetition detection in a fronto-parallel view with fiducial lines extraction. Neurocomputing. 2018;273:435–47.10.1016/j.neucom.2017.07.040Search in Google Scholar
 Lotte RG, Haala N, Karpina M, de Aragao LE, Shimabukuro YE. 3D facade labeling over complex scenarios: A case study using convolutional neural network and structure-from-motion. Remote Sens. 2018;10(9):1435.10.3390/rs10091435Search in Google Scholar
 Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun Acm. 2017;60(6):84–90.10.1145/3065386Search in Google Scholar
 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston; 2015.10.1109/CVPR.2015.7298594Search in Google Scholar
 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Banff: International Conference on Learning Representations (ICLR); 2014.Search in Google Scholar
 He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas; 2016.10.1109/CVPR.2016.90Search in Google Scholar
 Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211–52.10.1007/s11263-015-0816-ySearch in Google Scholar
 Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Niessner M. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu; 2017.10.1109/CVPR.2017.261Search in Google Scholar
 Chang AX, Funkhouser T, Guibas L, Hanrahan P, Qixing H, Li Z, et al. ShapeNet: an information-rich 3D model repository. Amsterdam: European Conference on Computer Vision (ECCV); 2016.Search in Google Scholar
 Armeni I, Sax S, Zamir AR, Savarese S. Joint 2D-3D-semantic data for indoor scene understanding. Hawaii: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017.Search in Google Scholar
 Huang K, Wang Y, Zhou Z, Ding T, Gao S. Learning to parse wireframes in images of man-made environments. IEEE/CVF Conference on Computer Vision & Pattern Recognition, Salt Lake City; 2018.10.1109/CVPR.2018.00072Search in Google Scholar
 Yang F, Zhou Z. Recovering 3D planes from a single image via convolutional neural networks. European Conference on Computer Vision, Munich; 2018.10.1007/978-3-030-01249-6_6Search in Google Scholar
 Liu C, Yang J, Ceylan D, Yumer E, Furukawa Y. PlaneNet: Piece-wise planar reconstruction from a single RGB image. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City; 2018.10.1109/CVPR.2018.00273Search in Google Scholar
 Groueix T, Fisher M, Kim VG, Russell BC, Aubry M. A papier-mache approach to learning 3D surface generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City; 2018.10.1109/CVPR.2018.00030Search in Google Scholar
 Zou C, Colburn A, Shan Q, Hoiem D. LayoutNet: Reconstructing the 3D room layout from a single RGB image. IEEE/CVF Conference on Computer Vision & Pattern Recognition, Salt Lake City; 2018.10.1109/CVPR.2018.00219Search in Google Scholar
 Wang W, Yu L. Rapidly reconstructing 3D line-plane structures of urban building facades. Acta Electronica Sin. 2021;49(08):1551–60.Search in Google Scholar
 Juřík V, Herman L, Šašinka Č, Stachoň Z, Chmelík J. When the display matters: A multifaceted perspective on 3D geovisualizations. Open Geosci. 2017;9(1):89–100.10.1515/geo-2017-0007Search in Google Scholar
 Mora-Felix ZD, Sanhouse-Garcia AJ, Bustos-Terrones YA, Loaiza JG, Monjardin-Armenta SA, Rangel-Peraza JG. Effect of photogrammetric RPAS flight parameters on plani-altimetric accuracy of DTM. Open Geosci. 2020;12(1):1017–35.10.1515/geo-2020-0189Search in Google Scholar
 Jung J, Hong S, Yoon S, Kim J, Heo J. Automated 3D wireframe modeling of indoor structures from point clouds using constrained least-squares adjustment for as-built BIM. J Comput Civ Eng. 2016;30(4):04015074.10.1061/(ASCE)CP.1943-5487.0000556Search in Google Scholar
 Wang C, Hou S, Wen C, Gong Z, Li Q, Sun X, et al. Semantic line framework-based indoor building modeling using backpacked laser scanning point cloud. Isprs J Photogramm Remote Sens. 2018;143:150–66.10.1016/j.isprsjprs.2018.03.025Search in Google Scholar
 Zhang Y, Huo L, Li H. Automated recognition of a wall between windows from a single image. J Sens. 2017;2017:1–8.10.1155/2017/7051931Search in Google Scholar
 Zhou YC, Qi HZ, Ma Y. End-to-end wireframe parsing. IEEE/CVF International Conference on Computer Vision, Seoul; 2019.10.1109/ICCV.2019.00105Search in Google Scholar
 Kong Q, Zhao L, Zhang L. Indoor window detection based on image contour analysis. Comput Modernization. 2018;1(4):56–61.Search in Google Scholar
 Ma W, Ma W. Deep window detection in street scenes. Ksii Trans Internet Inf Syst. 2020;14(2):855–70.10.3837/tiis.2020.02.022Search in Google Scholar
 Sun S, Chen H. Building windows detection based on enhanced YOLOv3. In Proceedings of the 2020 Chinese Simulation Conference, Beijing; 2020.Search in Google Scholar
 Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. 14th European Conference on Computer Vision, Amsterdam; 2016.10.1007/978-3-319-46484-8_29Search in Google Scholar
 Wang R, Cao Z, Wang X, Liu Z, Zhu X. Human pose estimation with deeply learned multi-scale compositional models. IEEE Access. 2019;7:71158–66.10.1109/ACCESS.2019.2919154Search in Google Scholar
 Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X, et al. Multi-context attention for human pose estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu; 2017.10.1109/CVPR.2017.601Search in Google Scholar
 Bulat A, Tzimiropoulos G. Hierarchical binary CNNs for landmark localization with limited resources. IEEE Trans Pattern Anal Mach Intell. 2020;42(2):343–56.10.1109/TPAMI.2018.2866051Search in Google Scholar PubMed
 Peng X, Tang Z, Yang F, Feris RS, Metaxas D. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City; 2018.10.1109/CVPR.2018.00237Search in Google Scholar
 Tang W, Wu Y, Soc IC. Does learning specific features for related parts help human pose estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach; 2019.10.1109/CVPR.2019.00120Search in Google Scholar
 Kim S-T, Lee HJ. Lightweight stacked hourglass network for human pose estimation. Appl Sciences-Basel. 2020;10(18):62–70.10.3390/app10186497Search in Google Scholar
 Liu X, Pan Y, Zhang W, Ying L, Huang W. Achieve Sustainable development of rivers with water resource management - economic model of river chief system in China. Sci Total Environ. 2020;708:134657.10.1016/j.scitotenv.2019.134657Search in Google Scholar PubMed
 Zhu X, Dai J, Wei H, Yang D, Huang W, Yu Z. Application of the fuzzy optimal model in the selection of the startup hub. Discret Dyn Nat Soc. 2021;2021:6672178.10.1155/2021/6672178Search in Google Scholar
 Stephens RS. Probabilistic approach to the hough transform. Image Vis Comput. 1991;9(1):66–71.10.1016/0262-8856(91)90051-PSearch in Google Scholar
 Gioi RG, Jakubowicz J, Morel JM, Randall G. LSD: A fast line segment detector with a false detection control. IEEE Trans Pattern Anal Mach Intell. 2010;32(4):722–32.10.1109/TPAMI.2008.300Search in Google Scholar PubMed
 Nan X, Song B, Fudong W, Gui-Song X, Tianfu W, Liangpei Z. Learning attraction field representation for robust line segment detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach; 2019.Search in Google Scholar
 Mazzeo PL, Argentieri A, De Luca F, Spagnolo P, Distante C, Leo M, et al. Convolutional neural networks for recognition and segmentation of aluminum profiles. Multimodal Sens Technol Appl. 2019;11059:219–29.10.1117/12.2525687Search in Google Scholar
 Sun K, Xiao B, Liu D, Wang J, Soc IC. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach; 2019.10.1109/CVPR.2019.00584Search in Google Scholar
 Xiao B, Wu H, Wei Y. Simple baselines for human pose estimation and tracking. 15th European Conference on Computer Vision, Munich; 2018.10.1007/978-3-030-01231-1_29Search in Google Scholar
 Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition, Columbus; 2014.10.1109/CVPR.2014.81Search in Google Scholar
 He KM, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. IEEE International Conference on Computer Vision, Venice; 2017.10.1109/ICCV.2017.322Search in Google Scholar
 Girshick R. Fast R-CNN. IEEE International Conference on Computer Vision, Santiago; 2015.10.1109/ICCV.2015.169Search in Google Scholar
 Martin DR, Fowlkes CC, Malik J. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans Pattern Anal Mach Intell. 2004;26(5):530–49.10.1109/TPAMI.2004.1273918Search in Google Scholar PubMed
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.