The present study examined the usage of deep convolutional neural networks (DCNNs) for the classification, segmentation, and detection of the images of surface defects in heritage buildings. A survey was conducted on the building surface defects in Gulang Island (a UNESCO World Cultural Heritage Site), which were subsequently classified into six categories according to relevant standards. A Swin Transformer- and YOLOv5-based model was built for the automated detection of surface defects. Experimental results suggested that the proposed model was 99.2% accurate at classifying plant penetration and achieved a mean intersection-over-union (mIoU) of over 92% in relation to moss, cracking, alkalization, staining, and deterioration, outperforming CNN-based semantic segmentation networks such as FCN, PSPNet, and DeepLabv3plus. The Swin Transformer-based approach for the segmentation of building surface defect images achieved the highest accuracy regardless of the evaluation metric (with an mIoU of 90.96% and an mAcc of 95.78%), when contrasted to mainstream DCNNs such as SegFormer, PSPNet, and DANet.