ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs Extraction of residential building instances in suburban areas from mobile LiDAR data Shaobo Xiab, Ruisheng Wanga,b, a b T ⁎ School of Geographical Sciences, Guangzhou University, Guangzhou 510006, China Department of Geomatics Engineering, University of Calgary, T2N 1N4, Canada A R T I C LE I N FO A B S T R A C T Keywords: Mobile LiDAR Individual buildings Hypotheses and selection Point cloud segmentation Shape prior In the recent years, mobile LiDAR data has become an important data source for building mapping. However, it is challenging to extract building instances in residential areas where buildings of different structures are closely distributed and surrounded by cluttered objects such as vegetations. In this paper, we present a new “localization then segmentation” framework to tackle these problems. First, a hypothesis and selection method is proposed to localize buildings. Rectangle proposals which indicate building locations are generated using projections of vertical walls obtained by region growing. The selection of rectangles is formulated as a constrained maximization problem, which is solved by linear programming. Then, point clouds are divided into groups, each of which contains one building instance. A foreground-background segmentation method is then proposed to extract buildings from complex surroundings in each group. Based on the graph of points, an objective function which integrates local geometric features and shape priors is minimized by the graph cuts. The experiments are conducted in two large and complex scenes, Calgary and Kentucky residential areas. The completeness and correctness of building localization in the former dataset are 87.2% and 91.34%, respectively. In the latter dataset, the completeness and correctness of building localization are 100% and 96.3%, respectively. Based on the tests, our binary segmentation method outperforms existing methods regarding the F1 measure. These results demonstrate the feasibility and effectiveness of our framework in extracting instance-level residential buildings from mobile LiDAR point clouds in suburban areas. 1. Introduction Building extraction from various remote sensing data is of great importance for many applications, such as cadastral surveying (Rutzinger et al., 2011), 3D reconstruction (Chen et al., 2017), change detection (Qin et al., 2015), urban analysis (Yu et al., 2010), energy management (Cheng et al., 2018), and visualization (Deng et al., 2016). Automatic or semi-automatic building extraction algorithms from aerial (Cote and Saeedi, 2013) or satellite images (Ok et al., 2013) have been widely studied in the past. These methods are useful in large-scale mapping, but image resolutions and object occlusions limit the accuracy. Light Detection and Ranging (LiDAR) is an efficient and accurate tool to acquire point clouds of object surfaces. In the work of Lafarge et al. (2008), airborne LiDAR is used for building extraction. However, similar to the results from optical images, buildings extracted from airborne LiDAR are rooftops, and façade information is missing. Mobile LiDAR which refers to vehicle-mounted laser scanning systems can acquire accurate and precise point clouds and has been widely ⁎ used in mapping the environment along roads (Guan et al., 2016). Mobile LiDAR is good at collecting detailed façade information which is an important supplement for the aerial data. In the recent years, building extraction from mobile LiDAR point clouds has been studied in several researches (Fan et al., 2014; Wang et al., 2016). However, there still exist problems in extracting buildings in residential areas. Compared with large buildings in downtown areas, residential houses are often smaller in size and with complex components such as porches. Besides, residential buildings are often closely distributed, surrounded by dense vegetations and cluttered objects, which also increase the difficulty in extracting individual buildings from original point clouds. In this paper, we propose a new framework for building instance extraction from mobile LiDAR point clouds in suburban residential areas where outer walls do not connect buildings. In our framework, buildings are first localized, then points of each building are extracted from surroundings. The main contributions of this paper are threefold: (1) we propose a new “hypotheses and selection” method for independent building localization; (2) we propose a segmentation Corresponding author at: Department of Geomatics Engineering, University of Calgary, T2N 1N4, Canada. E-mail address: ruiswang@ucalgary.ca (R. Wang). https://doi.org/10.1016/j.isprsjprs.2018.08.009 Received 4 April 2018; Received in revised form 6 August 2018; Accepted 8 August 2018 Available online 22 August 2018 0924-2716/ © 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved. ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang (2013) propose an independent building detection method for MLS point clouds. A histogram with the x-axis corresponding to the distance along trajectory and y-axis corresponding to the number of points is calculated. If there are gaps between neighboring buildings, the corresponding bin values in the histogram will be much lower than building bins. Thus, adjacent houses can be separated by the gaps in the histogram. However, it suffers from under-detection, over-detection, and miss-detection of buildings. The assumption that buildings are parallel to the trajectory may not always be correct in the real world. Also, only building locations are detected by their method while building points are not identified from surroundings. As most of the building surfaces are flat, buildings can be detected from MLS point clouds by identifying flat regions. Pu and Vosselman (2009) apply region growing to segment mobile LiDAR point clouds and recognize buildings based on prior knowledge such as the wall orientation. The wall flatness can also be observed at super-voxel scale. Aijazi et al. (2013) propose a building recognition method based on super-voxels. First, supervoxels of input point clouds are generated and then neighboring voxels with similar properties (e.g. colors, intensity) are merged into objects. After that, buildings are recognized by analyzing the shape of each object. Yang et al. (2015) proposes a hierarchical object extraction method based on multi-scale supervoxels which are generated based on Euclidean distance, colors and eigenvalue-based features at two scales. These supervoxels are first grouped into larger segments and then merged into meaningful objects based on specific rules, such as geometric similarities between neighboring segments. This bottom-up grouping method can achieve good results when the geometric features are accurate and predefined rules are correct, but these conditions may not always be satisfied when dealing with the real world point clouds. Similarly, the work of Wang et al. (2016) presents a category-oriented grouping scheme based on voxels to group point clouds into individual objects. They improve the geometric shape labeling method for each segment, and different rules are applied to merge segments with different shape properties. They also propose an indicator named horizontal hollow ratio, the ratio of the projected area of object points to the surrounded area of object contours, to recognize individual building objects. However, this indicator does not work well with low-rise buildings (Wang et al., 2016). In summary, the existing methods have various problems in extracting buildings from mobile LiDAR point clouds. First, as buildings in the real world vary in size, shapes and detailed structures (e.g. balcony, porches), it is difficult to identify all building components with specific rules (Yang et al., 2012; Gao and Yang, 2013; Fan et al., 2014). Second, existing building extraction methods (Yang et al., 2012; Aijazi et al., 2013; Wang et al., 2016) often follow a “segmentation then recognition” route to identify buildings. The accuracy of extraction is highly restricted by the performance of scene segmentation which is still an open problem, especially in complex scenes (Yang et al., 2015; Golovinskiy et al., 2009; Xu et al., 2017). Third, most studies focus on the extracting building regions from MLS point clouds, and few methods are proposed to extract independent buildings. However, instance-level building extraction is critical for many applications such as building reconstruction (Li et al., 2016) and land-use classification (Kang et al., 2018). In this study, we aim at extracting building instances from dense residential areas in mobile LiDAR point clouds. algorithm for separating buildings from overlapping objects by integrating local geometric features and shape priors; (3) our proposed framework can achieve instance-level building extraction results in dense suburban residential areas. The paper is organized as follows. In Section 2, related works including building extraction methods and point clouds segmentation methods are reviewed. In Section 3, the proposed framework is carefully illustrated. Experiments and analysis are in Section 4 and the conclusion is given in Section 5. 2. Related works In this section, building extraction methods from remote sensing data with a focus on mobile LiDAR point clouds are reviewed. As building points extraction can also be viewed as a specific case of point clouds segmentation which aims at partitioning input data into individual objects, existing LiDAR point clouds segmentation methods are also discussed in this section. 2.1. Building extraction methods Building extraction has been widely studied with various data sources, including remote sensing images, airborne LiDAR data, and mobile LiDAR point clouds. When dealing with 2D images, corners and edges are often detected first, which are used to construct object boundaries. For example, if a closed outline consisting of corners in the images maintains a reasonable size, its covered areas will be recognized as one building instance (Cote and Saeedi, 2013). Similarly, building areas can be extracted by using rectangular outline models which are generated by edges in the aerial images (Lin and Nevatia, 1998). Xiao et al. (2012) first detect walls from oblique airborne images. By coupling adjacent walls, many building hypotheses are generated, which are then verified by checking the elevation discontinuity of roofs. The elevation discontinuity around roof boundaries can also be used to distinguish between buildings and adjacent vegetation areas (Liu et al., 2013). Yang et al. (2013) propose a building extraction method for airborne LiDAR point clouds. In this work, buildings are approximated by cubes and detected by optimizing the cube configurations using the Reversible Jump Markov Chain Monte Carlo (RJMCMC). In the work of Awrangjeb et al. (2010), two kinds of aerial data, LiDAR data and multispectral imagery are combined to detect buildings in the residential areas. Building masks are obtained from LiDAR data, and line segments detected from images are used to indicate building boundaries. Finally, buildings are detected by forming rectangles using neighboring line segments. In general, most building extraction methods for aerial data focus on utilizing roof boundary information and solving the extraction problem via identifying regular shapes from the input data. Compared with airborne remote sensing data, mobile LiDAR mainly acquires detailed façade information instead of the rooftops. Therefore, the key to building extraction from mobile LiDAR is utilizing the wall information. In the work of Hernández and Marcotegui (2009), 2D images are created by projecting 3D LiDAR point clouds onto horizontal grids, and grids with a large number of accumulated points are identified as building areas. Similarly, in the work of Yang et al. (2012), 2D geo-referenced feature images are first generated and classical edge detection methods like Canny edge detector are applied to extract object contours, then buildings are identified by analyzing the contour shapes. It should be pointed out that these methods only concern the building regions in MLS point clouds and the extraction of independent building instances is not discussed. Some building extraction methods are based on the spatial distribution patterns of buildings. For example, Fan et al. (2014) divide the input point clouds into three layers and group points of each layer into clusters. They assume that building points are consecutive in the vertical direction. Thus, if there exist clusters in the same location at each layer, these clusters will be extracted as building objects. Gao and Yang 2.2. Point clouds segmentation methods Segmenting unorganized, incomplete and uneven mobile LiDAR point clouds into individual objects is a challenging task. By using 2D grids (Serna and Marcotegui, 2014), point clouds can be segmented by 2D image processing methods such as morphological operations. However, 3D information will be missing, and it is also difficult in partitioning overlapping objects. In most cases, segmentation methods dealing with the original 3D point clouds are preferred. Region growing is popular in LiDAR point clouds segmentation 454 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang between different objects, ground is first removed by a ground filtering algorithm (Zhang et al., 2016). (Belton and Lichti, 2006). In general, it begins by initializing seed points and then groups adjacent points into one cluster according to some criteria. For example, if normal differences between two points are smaller than a predefined threshold, they will be grouped into one cluster (Nurunnabi et al., 2016). However, region growing is often sensitive to thresholds and relies on the accuracy of local features (Che and Olsen, 2018). Euclidean clustering is a widely used algorithm for point clouds segmentation (Klasing et al., 2008). It groups points into one cluster if the distance between neighboring points is no larger than a given threshold. Its distance threshold is difficult to determine, and this method also fails in separating closing or overlapping objects. Maalek et al. (2018) present a clustering method to segment linear and planar structures in point clouds. In their method, points are first classified based on features derived from robust principle component analysis (PCA) on local points, and then grouped based on the feature similarity between adjacent points. In the work of Xu et al. (2017), a hierarchical clustering algorithm is proposed, which measures the similarity between points based on Euclidean distances and normal differences. The clustering results are optimized by solving the bipartite matching. One important and common attribute of these methods is that further processing such as rule-based segments merging is a must to achieve the object-level segmentation results. However, merging segments into instances is difficult because of the object complexity and diversity in the real world. Several studies focus on the optimal segmentation of foreground and background objects in point clouds. Golovinskiy and Funkhouser (2009) presents a foreground and background segmentation algorithm called MinCut. First, a k-NN (k nearest neighbor) graph of original point clouds is built. Second, an objective function consisting of data term and smoothness term is designed. The data term for the foreground label is set to a constant, and for the background label is calculated based on the distances between points to the user-defined foreground center. Finally, the objective function on k-NN graph is optimized by max-flow/min-cut (Boykov et al., 2001). The MinCut often requires the input of foreground center and the object radius manually. It is also prone to output over-segmentation and under-segmentation results (Zheng et al., 2017). In the work of Yu et al. (2015), an extended normalized cut algorithm termed as 3DNCut for the segmentation of mobile LiDAR point clouds is proposed. A group of points is first voxelized, and a graph consists of non-empty voxels are built, then the graph is partitioned into two independent segments by spectral clustering method (Shi and Malik, 2000). The 3DNCut is a promising tool for binary segmentation of point clouds. But, there are two limitations when applying this method to mobile LiDAR point clouds. The first is the number of objects in the input data should be known in advance. The second is the normalized cut tends to partition input data into two segments with similar size, which may not be right in the real world (Zheng et al., 2017). In summary, current segmentation methods for mobile LiDAR point clouds mainly relay on local geometric properties of points including neighboring distances, local features such as normal differences. However, these local geometric features are not reliable especially in uneven point clouds and also highly depended on neighborhood size (Weinmann et al., 2015). Besides, existing methods often requires specific rules and manual inputs, and the performance of overlapping object partition still needs to be improved. 3.1. Building localization Although buildings vary in shape and sizes, most buildings have two dominant directions and can be approximately represented by bounding rectangles according to the building footprint analysis (Zhang et al., 2006; Liqiang et al., 2013). Based on this fact, we propose a new method to detect buildings from MLS point clouds, which mainly consists of three steps. In the first step, vertical wall segments are detected by region growing. Then, wall segments are projected horizontally and used to construct 2D rectangles. Finally, a subset of rectangle hypotheses is selected to indicate building locations. In a word, we aim at localizing buildings via detecting 2D rectangles from projected MLS point clouds. 3.1.1. Rectangle hypothesis Region growing with an angle threshold θ (e.g. 5.0°) is used to segment the non-ground point clouds into groups. To improve the program efficiency, segments containing points less than 50 points will be not be saved. The removing threshold is empirically selected and has little effect on the following steps as only large segments will be used for the rectangle generation. To find potential vertical walls, two rules are applied to filter out non-wall objects like fences and vehicles. First, only planes with their normals directing horizontally will be kept, i.e., if the dot product between a plane normal and (0, 0, 1) is larger than 0.05, the segment will be removed. Second, wall segments should maintain a minimum height over the ground and should be large enough. Concretely, the minimum segment-to-ground height Hm and the minimal segment length Lmin are used to filter out non-wall segments. Hm equals to the distance from the nearest ground point to the highest point in the segment. In this paper, Hm and Lmin are both set to 2.0 m. Finally, the remaining segments are recognized as potential walls. To generate rectangle guesses, potential walls are projected onto the ground as 2D line segments. Supposing there are n line segments and n every two line segments can generate one rectangle, it will result in 2 guesses which are huge. However, some pairs of line segments cannot make up a rectangle. First, line segment pairs which are not in parallel and perpendicular relationships are discarded. Second, collinear line segments such as Fig. 1(a) and intersected lines such as Fig. 1(b) are also not used to generate guesses. Fig. 1(c) and (d) give two examples of correct rectangle guesses. It should be noticed that if two selected line segments are not strictly parallel or perpendicular, a compulsive adjustment of included angles between lines will be applied. In this case, the longer line segment will remain unchanged while the shorter one will be adjusted. The number of guesses will be further reduced by removing rectangles which are too small or too large to form a building outline. Given a rectangle, if its width is below Bmin (e.g. 5.0 m) or its length is over Bmax (e.g. 30.0 m), this rectangle will be discarded. Besides, if the width to length ratio of one rectangle is smaller than a threshold Rwl (e.g. 0.3), this outline proposal will also be removed. After n filtering out unqualified hypotheses, only 5% of are kept in most 2 cases. () () 3.1.2. Rectangle selection The problem is turned into selecting a subset of rectangles which stand for coarse building outlines. A binary variable x i is defined to indicate whether the ith hypothesized rectangle is selected ( x i = 1) or not ( x i = 0 ). For a given line segment t, if the minimal distance between its middle point to the four edges of ith rectangle is smaller than a threshold Rectin (e.g. 0.5 m), the line segment t will be identified as an inlier of ith guess or called as covered by ith rectangle. Similarly, a variable yt is introduced to indicate whether the tth line segment is covered by some rectangles ( yt = 1) or not ( yt = 0 ). 3. The method We propose a new framework for extracting building instances from mobile LiDAR point clouds. It mainly consists of three steps. First, independent buildings are localized using clues of walls. Then, original point clouds are divided into groups, each of which contains only one potential building. Finally, building points in each group are extracted by a newly proposed foreground/background segmentation method. To reduce the computational burden and eliminate the connectivity 455 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang (Awrangjeb et al., 2010). Except for the constraints on building distribution, we also want to select rectangles that cover as many wall points as possible. For example, no rectangle covers the wall points of the right building in Fig. 2(c) if the coverage on data is not considered. To this end, the constraint can be written as ∑i ⊢ t x i ⩾ yt where i ⊢ t means rectangle i covers wall segment t. The consequence of adding this rule is that yt must be zero if no rectangle that covers line segment yt is selected based on the constraint 0 ⩾ yt . By combining constraints mentioned above, the objective function can be formed as Eq. (1), which accumulates the variable yt weighted by the projected length of wall segment Lt . The Nl is the number of wall segments. In general, by maximizing the objective function with constraints, we prefer to select independent rectangles which cover as many wall segments as possible. This is a typical linear programming problem and can be solved by software like Gurobi (Gurobi Optimization, 2016). An example of the optimal rectangle configuration is given in Fig. 2(d). Nl Maximize ∑ yt ·Lt . Subject to ∀ {i , j}, ai, j > 0: x i + x j ⩽ 1 t=0 Fig. 1. From line segments to rectangles. Proposed rectangles are formed by dashed red lines. (a) Collinear line segments. (b) Intersected lines. (c) A rectangle formed by two perpendicular line segments. (d) A rectangle formed by two parallel lines. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) ∀ {i, j}, Di, j > Dn : x i + x j ⩽ 1 ∑ xi ⩾ yt i⊢t x i , yt ∈ {0, 1} (1) Fig. 3 gives a step-by-step example of building localization in a scene consisting of four individual buildings. The input of our method is non-ground point clouds, such as Fig. 3(a). The point clouds are first segmented into planes, and only vertical walls are kept, such as Fig. 3(b). There exist 25 wall segments in this demo. Then 44 potential rectangles are generated using the projections of these walls, such as Fig. 3(c). By selecting a subset of hypothesized rectangles, building locations can be estimated, for example, four rectangles corresponding to four houses are selected as shown in Fig. 3(d). It should be pointed out that fences in the rightmost building are not covered by the selected rectangle. The rectangle is used to find main walls of one building, not construct accurate building outlines. 3.2. Dividing individual buildings into groups After building localization, non-ground points can be divided into groups, each of which contains only one building instance. First, the Euclidean clustering algorithm with a distance threshold (e.g. 1.0 m) is applied to group non-ground points into clusters. Some isolated buildings will be grouped into individual cluster while adjacent buildings may be merged into one cluster. For example, points in Fig. 4(a) are all merged into one cluster after clustering due to the grasses in the green circle. To tackle this problem, similar to the method in Gao and Yang (2013), a profile of point numbers along two rectangles can be built as illustrated in Fig. 4(b) and the building cluster can be divided at the lowest bin in this profile as indicated by the red arrow. To reduce the noise from non-building points, only points from planar segments are counted in the profile. Fig. 4(c) shows the grouping results guided by localization rectangles. As the number and locations of potential buildings are known, the drawbacks of trajectory-based method (Gao and Yang, 2013) such as over-detection and miss-detection are avoided in this step. Finally, the problem of building extraction is turned into how to segment building points from surrounding. For example, Fig. 4(d) shows an example of the final building extraction results. The used segmentation method will be illustrated in the following sections. Fig. 2. Optimal selection of rectangles. Black line segments stand for walls and dashed rectangles are generated guesses. (a) Overlapped rectangles (blue) cannot be selected simultaneously. (b) Two close rectangles (blue) are rejected and the red one is selected. (c) The building on the right is missed. (d) The optimal selection of rectangles. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Since buildings in the real world always distribute at intervals, no building overlaps are allowed. Based on these facts, two constraints are introduced. The first is the overlapping constraint. The overlapping area ai, j between ith and jth rectangles is calculated. If ai, j > 0, i th and jth rectangles cannot coexist, i.e., x i + x j ⩽ 1. For example, in Fig. 2(a), two blue rectangle proposals are overlapped, and they cannot be selected simultaneously. Also, buildings should not be too close to each other. Thus the second constraint controls the minimal distance Di, j between parallel edges of adjacent rectangle i and j. If the Di, j is smaller than the predefined threshold Dn (e.g., 2.0 m), ith and jth rectangles cannot be selected simultaneously, i.e., x i + x j ⩽ 1. For example, in Fig. 2(b), two blue rectangles are very close to each other, thus they cannot be selected simultaneously. This rule can reduce the multiple detection errors, i.e., one building is detected by multiple rectangles 3.3. Segmentation-based building extraction After dividing buildings into individual groups, buildings need to be further extracted from surroundings. This problem can be viewed as a 456 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 3. An example of building localization. (a) Non-ground point clouds. (b) Segmented wall segments. (c) Hypothesized rectangles are colored randomly. (d) Selected rectangles in red color. Wire-frames derived from rectangles are added for visualization. data term which accumulates the cost Di (li ) of assigning foreground/ background label to the points i. ∑(i, j) ∈ N Vij (li , l j ) is the smoothness term and it accumulates the penalty of the label difference between neighboring point i and j in the k-NN graph. The coefficient β is a nonnegative number to balance these two terms. This function can be optimized by min-cut/max-flow algorithms (Boykov et al., 2001). In this study, the data term is set according to the geometric distributions of local points. The smoothness term focuses on the similarity between neighbors. Besides, we extend the model in Eq. (2) by adding a shape term which considers the shape priors derived from planar segments. 3.3.1. Data term According to Boykov and Jolly (2001), the data term should be set based on prior information of background/foreground, which can either be defined beforehand or modeled based on seeds. For example, a constant value is adopted to penalize the foreground in Min-Cut (Golovinskiy and Funkhouser, 2009). Gaussian mixture models (GMMs) learned from seeds are applied to predicate the penalties for background and foreground in Rother et al. (2004). Specifically, the “foreground” refers to buildings and the “background” mainly consists of cluttered vegetations in our problem. The prior knowledge adopted here is that the buildings mainly consist of flat regions. To predict the data cost, three eigenvalue-based indexes linear ( f1D ), planar ( f2D ) and volumetric ( f3D ) features (Demantké et al., 2011; Yang et al., 2015) are calculated as follows, Fig. 4. An example of dividing individual buildings into clusters. (a) Points are merged into one cluster by Euclidean clustering. (b) A histogram of point number. Red rectangles indicate the building localization results. (c) Point grouping results based on (b). (d) Illustration of building extraction from each group. Points in black and blue indicate two separated buildings. Non-building points are colored green. Wire-frames derived from rectangles are added for visualization. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) f1D = foreground/background segmentation task and formulated as an energy minimization problem (Rother et al., 2004; Boykov and Kolmogorov, 2004). To this end, a k-NN graph (e.g. k = 10 ) of points is first constructed then the general form of the energy model can be written as, E (l) = ∑ Di (li ) + β i∈P ∑ (i, j ) ∈ N λ1 − λ2 λ1 , f2D = λ2 − λ3 λ1 , f3D = λ3 λ1 (3) where λ1, λ2 and λ3 (λ1 ⩾ λ2 ⩾ λ3 > 0 ) are three eigenvalues derived from PCA. f1D reaches a high value in linear structures such as branches and railings. f2D measures the local planarity. It will be large in flat regions such as walls. f3D measures the degree of the volumetric distribution of local points. It will be large in scattered regions like vegetations. Typically, local points with the large value of f2D indicate a high possibility of being labeled as building. Points with a large value of f1D or f3D often locate in non-building areas such as vegetations. The penalty for labeling one point as “foreground” (building) or “background” (non-building) is defined in Eq. (4). Fig. 5 gives an example of Vij (li , l j ), (2) where P is the input point set and N is the neighbor system in the kNN graph. li indicates the label of point i. li equals 1 when the point i is a foreground point (building point), otherwise equals 0. ∑i ∈ P Di (li ) is the 457 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 5. Example of planar ( f2D ) and max {f1D , f3D } . (a) Original point clouds of the building, tree and bushes. (b) The values of f2D . (c) The values of max {f1D , f3D } . Fig. 6. A 2D example of anchor points. Normals at anchor points may not be accurate. widely studied. In Freedman and Zhang (2005), the distances between pixels and the shape template of the foreground object are introduced as shape prior. The penalty for discontinuity of two pixels should be low if they are close to the object contour, and will become large for pixels away from the shape template. The minimization of energy function will facilitate boundary-preserving segmentations. Veksler (2008) proposes a general way termed as star-shape to incorporate shape prior based on the generic shape properties, i. e ., if the center of the foreground object is known as c and one point labeled as foreground is p, all points on the line segment with endpoints c and p should also be marked as foreground. Shape prior from planar segments. Planar segments have been retrieved in Section 3.1.1 by region growing. Most of these segments are from building structures, which can be used to set shape constraints. However, for the walls adjacent to vegetations, the normal vectors may deviate from the true normals. Thus, parts of walls will be missing after region growing. A 2D example is given in Fig. 6. Based on the human’s perception, there exists a potential linear structure consisting of red and blue points, and two cluttered groups (green points). After running region growing, red points are grouped into one segment. The growing stops as the angle difference between point A and B is larger than θ , although the blue points in Fig. 6 may also be regarded as parts of the linear structure. To find the missing potential structures, a set of points termed as “anchor points” are selected from the non-planar points. A non-planar point i will be selected as “anchor points” if it meets two standards: (1) the distance between point i and a planar surface ps is no larger than dt (e.g. 0.05 m), (2) the angle difference between the normal at pi and the surface normal of ps is no larger than 3θ . For example, in Fig. 6, blue points will be selected as anchor points. Finally, the points of planar segments and the anchor points are combined into one point set S which is termed as the “shape set” in this paper. Shape prior in energy function. Fig. 7 gives an example of how to f2D and max {f1D , f3D } behaviors in the real data. (1−f2D ) ∗max {f1D , f3D } li = 1, Di (li ) = ⎧ ⎨ ⎩ (1−max {f1D , f3D }) ∗f2D li = 0. (4) 3.3.2. Smoothness term The second term in Eq. (2) accumulates the penalty on the label inconsistency of neighboring points. This term can be calculated as li = l j , ⎧0 Vij (li , l j ) = { dij ⎨ exp − min(d , d ) i j ⎩ } l ≠ l. i j (5) where dij equals to the Euclidean distance between point i and j. d corresponds to the mean distance between one point and its neighbors. { dij exp − min(d , d ) i j } weights the distance between two neighboring points and min(di , dj ) is used to normalize the distance values. The closer two points are, the higher penalty will be imposed for the label disagreement. 3.3.3. Implementing shape prior Point cloud segmentation based on geometric properties in the local neighborhood is not reliable. First, mobile LiDAR point clouds are often noisy, incomplete, and uneven, thus features derived from PCA are prone to be affected. Also, the selection of neighborhood size will also affect the performance of local geometrical features (Demantké et al., 2011). Second, scenes in the real world are complicated and cluttered. Objects are often close to each other or even overlapped, which make it impossible to find boundaries between neighboring objects only by local features. A possible solution for these problems is introducing high-level geometrical constraints to assist segmentation. In the recent years, object extraction from 2D images with geometric constraints has been 458 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 7. A 2D example of segmentation on a k-NN graph. Black dashed line: a cut without shape prior. Red dashed line: a cut with shape prior. The thickness of edge indicates the weight derived from shape priors. Thicker edge has more weights. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) arrow. Anchor points may also come from the non-building objects (e.g. blue points in the bush). After minimizing the energy function in Eq. (8), a better segmentation result is achieved as shown in Fig. 8(c). extract the bottom linear segment from adjacent cluttered groups. First, a k-NN graph is built. By minimizing the energy function Eq. (2) which does not consider the shape prior, a cut may be achieved like Cut A in Fig. 7. However, a preferred cut is Cut B which divides the whole graph into two parts where clutters and the linear segment are well separated. In the favorite cut, cluttered points (e.g. point A) may be labeled as foreground and anchor points may also be labeled as background (e.g. point B), which depend on the distance between points and evidence from neighbors. To achieve the optimized cut with shape priors, we propose another term Sij (li , l j ) . Sij (li , l j ) = 4. Experiments and analysis 4.1. Datasets Two datasets of residential areas acquired by different mobile LiDAR systems are used for experiments. The RIEGL VMX-450 System collects LiDAR point clouds in a typical residential area in Calgary, Canada. This dataset contains more than 340 million points. It covers a rectangular region whose length is around 1200 m and width is close to 210 m. The scene is complex and contains buildings, vegetations, vehicles, power-lines, pole lights and other objects (e.g. pedestrians). The Optech Lynx mobile mapper V200 acquires about 53 million points LiDAR points in a residential region of Kentucky, USA. The shape of this region is irregular, and its bounding box is about 370 m in length and 140 m in width. The ground truth for building localization and extraction is obtained manually. Specifically, as buildings in MLS point clouds are often incomplete, a valid building should contain at least one wall segment with a height over 2.0 m. Thus, 375 buildings in the Calgary dataset and 78 buildings in the Kentucky dataset are regarded as ground truth. Before further processing, the ground filtering algorithm (Zhang et al., 2016) is applied to classify the original data into ground and non-ground point sets. For convenience, the Calgary dataset is split into 17 subsets, and each subset contains around 20 million points. The results of all subsets are combined after processing. The proposed algorithms are implemented in C++, and the experiments are conducted on a computer with Intel Core i7-6700 3.4-GHz CPU. Currently, no parallel programming strategy is adopted in our implementation. li = l j , ⎧0 i+j ⎨ϕ 2 ⎩ ( ) l i ≠ l j. (6) i+j In this term, the penalty for neighboring label inconsistence is ϕ ( 2 ) which is defined in Eq. (7) and always returns a non-negative value. pi and pj stand for positions (coordinates) of point i and j. ( p +p ) ⎧ Dist−1 i j i, j ∈ S , 2 ⎪ ⎪ i+j pi + pj ⎞= ⎞⎫ ϕ⎛ ⎧ Dist ⎛ ⎝ 2 ⎠ ⎨1.0−exp − ⎝ 2 ⎠ otherwise. ⎪ dt ⎨ ⎬ ⎪ ⎩ ⎭ ⎩ ) returns the positive distance between ) is the inversed Dist ( ). According to Eq. (7), In this equation, the Dist ( (7) pi + pj 2 the mean point of two neighboring points i, j and the nearest planar segment. Dist−1 ( pi + pj 2 pi + pj 2 if two adjacent points are both in the shape set S , the penalty will be large which means they are more likely to share a same label, either foreground or background. Otherwise, the label inconsistency penalty will be small if two adjacent points are close to the planar segments. The penalty will increase to a large value if points are far from any planar segments. As a result, edges near the planar structures contain less weights and are more easily to be removed (cut). For example, in Fig. 7, the edges near the linear segment are much thinner than edges along the segment and edges far from the segment. Finally, the energy function with shape priors can be rewritten as Eq. (8) and weighted by γ . The last term derived from shape priors is also called the “shape term” in this paper. The final objective function can be optimized by the graph cuts (Kolmogorov and Zabin, 2004). E (l) = ∑ Di (li ) + β i∈P ∑ (i, j ) ∈ N Vij (li , l j ) + γ ∑ (i, j ) ∈ N 4.2. Building localization 4.2.1. Results and analysis There are several parameters and thresholds in the building detection step. Based on our test, these values are empirically set and fixed in all the experiments. Table 1 summaries the eight parameters for building detection. Most of the executing time for building localization is spent in the region growing step which costs nearly one hour for the whole scene. In contrast, the total time for the rectangle generation and selection is about 150 s. Take one subset as an example, 141 vertical wall segments are detected after region growing, then 520 rectangle proposals are generated. Finally, there are 661 variables and 17863 constraints in Eq. (1) and 26 rectangles are selected. The time spent for rectangle generation and selection is around 11 s. The overviews of original point clouds overlapped by detected building rectangles are drawn in Figs. 9 and 10. The correctly selected rectangles are colored in red. Detected non-building objects and multiple detections are colored in blue. The missing buildings are indicated by black rectangles. According to these results, the building instance localization has better performance in Kentucky dataset than Calgary dataset. We further give some detailed examples of building Sij (li , l j ) (8) Point clouds in Fig. 8 are used to show the anchor points and the importance of the shape prior. By minimizing the energy function in Eq. (2), the segmentation results are given in Fig. 8(a). Some points of wall and stairs are mislabeled as non-buildings. In Fig. 8(b), points from region growing are colored red, anchor points are colored blue, and the remaining points are colored green. In this example, some planar segments, such as parts of roofs indicated by the red arrow, may be missed because of the inaccurate normal estimations and small sizes of segments. Also, points near the contours may not be contained in planar regions, such as the regions near the window indicated by the black 459 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 8. A 3D example of segmentation. (a) Segmentation without shape prior. (b) Points (red) from all segments after region growing, anchor points (blue) and other points (green). (c) Segmentation with shape prior. β = 1.0, γ = 1.0 . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) multiple detections by analyzing connectivity between detected buildings. However, due to the complexity of the real world, there are no universal solutions and the multiple detections are difficult to avoid. There are three false detections in Fig. 11(c). Two small blue rectangles belong to the multiple detections while the largest blue rectangle covers non-building objects, i.e., bare ground mainly. This rectangle is formed by wall segments from two different houses. This kind of false detection may be revised in the post-processing by analyzing the point distribution within the region indicated by the localized rectangle. In some cases, the false and miss detections are related. For example, if a large rectangle that covers two building instances is falsely selected, the correct rectangles of the two buildings will not be selected as they have large overlaps with the large one. This problem may be alleviated by reducing some thresholds such as the minimum rectangle width. However, changing thresholds may only solve some specific issues but increase the probability of multiple detections. Table 1 Parameter description and settings for building localization. Notation Description Settings θ Hmin Lmin Bmin Bmax Rwl Dn Rectin Angle threshold in region growing Minimal wall height Minimal wall length Minimal building width Maximal building length Minimal width to length ratio of a rectangle Minimal distance between rectangles Threshold for rectangle inliers 5° 2.0 m 2.0 m 5.0 m 30.0 m 0.3 2.0 m 0.5 m localization from Calgary dataset in Fig. 11. In Fig. 11(a), all buildings are correctly localized by rectangles. In the right middle of Fig. 11(b), one small building indicated by a black rectangle is missed by our method. In this case, only one side of this building is scanned by MLS. Thus no rectangle for this building can be formed with only one wall segment. In fact, the insufficiency of wall segments is the main reason of miss detection. The problem of wall missing is caused by occlusions, and some occlusions can be avoided by using multiple scans from different view points. There exists two blue rectangles in Fig. 11(b). They both belong to the multiple detection errors. This problem may occur when the building has complex shapes, which can be solved by post-processing methods such as removing 4.2.2. Quantitative analysis and comparisons To quantitatively evaluate the performance of the proposed building detection algorithm, the completeness and correctness are defined in Eq. (9), Completness = TP , TP + FN Correctness = TP . TP + FP (9) True positive (TP) is the number of buildings which are localized by our Fig. 9. Overview of building localization results in Calgary residential dataset. Ground are colored yellow, non-ground points are colored by height (a blue low to a high red). Red rectangles are correctly detected buildings. Blue rectangles are falsely detected buildings. Black rectangles indicate the location of undetected buildings. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 460 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 10. Overview of building localization results in Kentucky residential dataset. The color scheme is the same as that in Fig. 9. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) and other rectangles are counted as FP. False negative (FN) is the number of buildings which are missed by our method. In the Calgary dataset, 358 rectangles are detected by our method. The number of FP is 31, and the number of FN is 48. Therefore, the completeness of our method is 87.2%, and the correctness is 91.34%. In the Kentucky dataset, 81 rectangles are selected by our method, three of them belong to false positive. Thus, the completeness in this dataset is 100% and the correctness is 96.3%. The main reason for better performance in Kentucky dataset lies in the different scene complexities. Compared with Calgary dataset, there are much fewer vegetations in the Kentucky dataset, which results in complete building walls due to fewer occlusions. In the recent years, few methods are proposed to localize buildings in mobile LiDAR point clouds. For example, Fan et al. (2014) report that 32 out of 46 (69.57%) buildings are detected by their method in the residential scene. The detection rate is relatively low compared with other methods. In the work of Gao and Yang (2013), the average completeness and correctness of their proposed building detection method is 86.46% and 91.41%, respectively. However, their method requires accurate trajectory, and only buildings that are parallel and close to the trajectory can be detected. A more critical problem in Gao and Yang (2013) is that their method will falsely detect adjacent buildings as one object if there are trees or shrubs between them. This problem rarely happens in their test data, but in many residential areas such as our test data, scenes of flourishing vegetations between buildings are common. Besides, it is difficult for their method to distinguish buildings from large trees, which also makes their method infeasible in vegetation lush residential areas. Wang et al. (2016) test their method in a typical urban scene with 192 buildings, and the average completeness is 94.7%, and correctness is 91%. Specifically, the low-rise building’s completeness is 86.3% according to their paper. It should be pointed out that only buildings containing at least two vertical walls are counted as ground truth in their evaluation. In fact, if the same criterion is used in our dataset, only 341 buildings are identified as ground truth, and thus the completeness of our method increases to 95.89%. In their research (Wang et al., 2016), the independent buildings are detected by analyzing the horizontal hollow of projected point clouds. The buildings are said to maintain a much lower horizontal hollow ratio than other objects as roofs of buildings are often missed during scanning. However, this rule works badly in residential scenes as most of the buildings are low and parts of roofs can be fully scanned. In our dataset, only 24.3% of all buildings have a horizontal hollow ratio lower than 50%, which means the completeness of the hollow-ratio based method will be much lower than our proposed method. Besides, buildings in Wang et al. (2016) are often far from each other, and the problem of separating neighboring buildings rarely happens. In summary, our proposed method can achieve state-of-the-art building detection results in the complex Fig. 11. Details of building localization. Point clouds and rectangles are colored same as in Fig. 9. (a) All buildings are localized correctly. (b) Miss-detection and multiple detection of buildings. (c) The detection of no-building objects and multiple detections. method and also in the building reference set. False positive (FP) is the number of buildings which are recognized by our method but not in the building reference set. There are two types of FP. In the first type, the selected rectangle covers non-building objects. The second type is the multiple detections. In this situation, only the rectangle covering the highest number of LiDAR points is recognized as the correct detection, 461 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang which balances the completeness and correctness. The completenesses of all results are shown in Fig. 13. Among all the five methods, our proposed method achieves the highest completenesses in all the samples. High completeness indicates that most of the building points have been retrieved from original point clouds. Compared with our algorithm, other methods are much lower in terms of completeness. The completeness of Mincut is the highest among four compared methods. But its results highly depend on the foreground seed and predefined object radius. The averaged completeness of 3DNCut (85.75%) is a little lower than Mincut (89.32%), and it also needs manual work when merging multiple segments. In the rectangle of #6 in Fig. 12(f), shrubs are closely adjacent to the walls, which will highly affect the classification results of voxel-based features (i.e., linear, planar, and volumetric). In this case, if most of the points in a super-voxel are vegetations, all the points in that super-voxel will be labeled as non-building points. The accuracy of feature calculation can be improved by scale selection like VG (Wang et al., 2016) thus results in better completeness in Sample 6. But this type of improvement is limited in complex scenes. For example, in the rectangle #2 of Fig. 12(b), the roof points are still mistakenly labeled as non-building points by VG (Wang et al., 2016). Although our proposed method is also based on local geometric features, however, the graph-cut based framework with shape prior is effective in overcoming shortages of local features. For example, roof points in Fig. 12(b) and incomplete wall points in Fig. 12(f) are all labeled correctly. Besides, the predefined grouping rules in MSG and VG are not always true, especially in complex residential areas. For instance, the linear structures and planar structures will be merged if the normal or principle directions between them are less than a small threshold (e.g. 10∘) in VG (Wang et al., 2016). This rule may not be applicable in merging some small components of buildings, such as low steps in #1 in Fig. 12(a), porches in rectangle #3 in Fig. 12(c), and eaves in rectangle #6 in Fig. 12(f). As a comparison, our method can achieve better results in the buildings containing detailed structures. The correctnesses are shown in Fig. 14. Large correctness indicates the precision of building extraction method, i.e., the percentage of actual building points in the extracted points. The correctnesses of MSG and VG are both over 95%. The main reason for the good correctnesses of voxel-based methods is that only planar voxels are merged based on rules in most cases, and planar voxels mainly come from buildings in the residential areas. In fact, the correctness of our method is a little worser than MSG and VG. This is mainly because our method can extract more building’s non-planar structures (e.g. balustrade) than voxelbased techniques, which will also increase the probability of labeling adjacent non-building objects as buildings. In general, the correctnesses of 3DNCut and Mincut are much worse than other three methods. There are mainly three reasons. First, local geometric features are not used in both methods. Thus, the prior knowledge that most buildings consist of planar structures is not utilized. Second, the extraction results of both methods highly depend on the Human-computer interaction. For example, in the rectangle of #4 in Fig. 12(d), a large part of buildings are mistakenly labeled as the background using Mincut with a foreground radius of 12.0 m. By increasing the radius to 15.0 m, points in the rectangle may be marked as building points, but the tree in the middle will also mistakenly be labeled as buildings. Third, small objects close to the building are often merged into buildings using these two methods. For instance, low grasses in the rectangle #4 in Fig. 12(d) are all recognized as building points. The balanced accuracy F1 measure is given in Fig. 15. Our proposed method achieves the highest averaged F1 values (around 97.89%) among five methods. The second highest F1 mean is over 92% by VG (Wang et al., 2016). The F1 values of MSG (Yang et al., 2015) and Mincut (Golovinskiy and Funkhouser, 2009) also reach nearly 90% in these samples. The lowest averaged F1 87% is by 3DNcut (Yu et al., 2015). In general, our proposed method has the best performance on these residential building samples. Voxel-based methods can also Table 2 Summary of seven building examples. Samples Figs. Size (L/W/H, m) #Point number Running time (s) Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Fig. 12(a) Fig. 12(b) Fig. 12(c) Fig. 12(d) Fig. 12(e) Fig. 12(f) Fig. 12(g) 14.5/13.7/6.8 18.6/16.8/11.1 17.2/7.5/9.9 15.0/13.8/10.4 30.9/14.6/11.4 14.9/13.0/5.2 16.2/16.1/13.5 40,893 71,540 31,231 264,409 187,641 53,375 46,025 4.26 4.54 3.31 30.7 19.49 6.62 4.95 residential area with dense vegetation. Compared with assumptions used in the existing methods, such as no vegetations exist between adjacent buildings (Gao and Yang, 2013) and few roof points are recorded by MLS (Wang et al., 2016), our assumption that buildings consist of vertical walls is more general. 4.3. Building instances extraction 4.3.1. Experimental results The performance of the proposed building extraction method is evaluated with seven typical residential building samples. Sample name, size, number of points and executed time of our method are listed in Table 2. According to the running time listed in the last column, the speed of our method is highly related to the number of points. In fact, most of the time is spent on the initialization step which includes feature calculation, finding anchor points and graph construction while the α -expansion costs little time. For instance, the initialization of Sample 4 is around 26.0 s and only 4.2 s are used in optimization. The ground-truth labeled manually is shown in the first column in Fig. 12. These samples form a representative group of residential building point clouds acquired by mobile LiDAR. There are different kinds of surroundings in these scenes. For example, shrubs and conifers are against the walls in Fig. 12(a). In Fig. 12(b), walls are surrounded by bushes and roofs are connected with high vegetations. The buildings also vary in sizes and structures. For example, Fig. 12(c) shows a rectangular house with porch, Fig. 12(d) gives a squared building, and a one-story large-area house is shown in Fig. 12(e). Besides, the sample point clouds are different in point density and degrees of incompleteness. For example, large parts of walls in Fig. 12(f) are missing, and the point density in Fig. 12(g) is lower than others. Four existing methods are introduced for comparison. The first method is 3DNcut (Yu et al., 2015). The initial implementation is provided by (Shi and Malik, 2000) and extended to 3D point clouds. To obtain better performance, the input point clouds are iteratively segmented into eight parts, and the parts consisting of most building points are combined manually as extracted building points. The implementation of Mincut (Golovinskiy and Funkhouser, 2009) is acquired from Point Cloud Library (PCL) (Rusu and Cousins, 2011). The seeds and radius for Min-cut are selected manually. Besides, we also compare our method with the multi-scale super-voxel grouping method (MSG) (Yang et al., 2015) and the voxel grouping method (VG) (Wang et al., 2016). It should be pointed out that no color information is used with MSG (Yang et al., 2015) in our test. The results of four existing methods and our proposed method are listed in Fig. 12. To quantitatively evaluate the performance of these methods, three indicators named completeness, correctness, and F1 measure are calculated. The completeness and correctness are in the same form like Eq. (9). TP is the number of building points that are correctly extracted. FN is the number of missing building points. FP is the number of nonbuilding points that are falsely labeled as building points. The F1 measure is defined as F1 = 2TP , 2TP + FN + FP (10) 462 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 12. Example of building extraction. Sub-figure (a)-(g): Sample 1 - Sample 7. From left column to right column: Ground truth, results of 3DNcut (Yu et al., 2015), results of Mincut (Golovinskiy and Funkhouser, 2009), results of MSG (Yang et al., 2015), results of VG (Wang et al., 2016), and our results (β = 1.0, γ = 1.0 ). Blue points are building points and green points are non-building points. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Fig. 14. Correctness of extraction results using different methods. Fig. 13. Completeness of extraction results using different methods. 463 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 15. F1 measure of extraction results using different methods. Fig. 16. Evaluation on different parameter settings of β and γ . The Z axis is the mean F1 measure of seven samples. achieve good results when the local features are accurately estimated, and predefined voxel grouping rules are correct. The performance of 3DNcut and Mincut methods is not as good as other three methods. But, they are more flexible in handling various scenes and can always output acceptable results. In this building extraction task, the averaged F1 measures of these two methods are both over 86%. According to the experiments, our proposed method also has limitations. If the shape prior is wrong or unavailable, the extraction results will be affected. For instance, planar walls in the #7 in Fig. 12(g) failed to be extracted as no shape prior can be used in that region. Besides, if non-building objects with flat surfaces (e.g. cars, bushes, trunks) are very close to the house, they may be mislabeled as buildings. For instance, although the roof points in #2 are correctly recognized as building parts, the regular hedges in the #8 in Fig. 12(b) is mislabeled as parts of the building. 4.3.2. Parameters tuning There are two parameters β and γ in the final formulation of the energy function in Eq. (8). In the above examples, the values of β and γ are both fixed to 1.0. To analyze the influence from parameters as well as the importance of each term, the ranges of β and γ are both set to [0, 2.0] with the interval of 0.1. The upper limit 2.0 is selected in this study because the segmentation accuracy will decrease when larger parameters are used. The proposed building extraction algorithm is executed with combinations of these parameters on seven samples, and then the results are evaluated based on ground truth. Finally, the averaged F1 measure of all samples are calculated and drawn as meshes in Fig. 16. According to Fig. 16, the lowest F1 is close to 85% when no spatial relationship between points is considered, i.e., β and γ are both zero. The F1 increases by increasing β and γ and reaches the plateau of high F1 values in the middle of the mesh. The highest F1 measure 97.99% is achieved when β and γ are both 0.9. From this points, the F1 value will decrease by increasing β and γ . For example, by increasing β and γ to 2.0, the averaged F1 drops to 96.67%. Besides, if the γ is fixed to 0.9, the F1 will drop to 97.5% by increasing β to 2.0. In this paper, we set β and γ to 1.0 for simplification, and a high of 97.89% F1 measure is achieved. In Fig. 16, the importance of the shape term (the last term in Eq. (8)) is obvious. If the γ is set to zero, which means the last term in the Eq. (8) is discarded, the mean F1 first increases steadily and constantly stays at 94% by increasing the β . However, by increasing the weight of shape term, the F1 value exceeds 97% quickly. Although the F1 increment is relatively small (e.g. three percents), most of the improvements occur at detailed structures such as porches and walls surrounded by other objects. This indicates that the performance of building extraction is improved largely by considering the shape prior derived Fig. 17. Performance of proposed method using different parameters in Sample 6. The Z axis is the F1 measure of Sample 6. from segments, especially for those incomplete buildings with complex surroundings. For example, a more significant example can be found in Fig. 17 which shows how F1 values of Sample 6 (Fig. 12(f)) changes with different parameters. The highest F1 90.8% without considering the shape term, i.e., γ = 0 , is achieved by setting β to 1.4 while it will rapidly exceed 98.0% by setting γ to 0.1. Compared to the shape term, the second term in Eq. (8) seems less important. For example, by setting β to zero, a high of 97.4% is achieved with only shape term (γ = 0.9). The F1 increases to 97.9% by setting β to 1.0. This improvement (around 0.5 percent) is not significant in point cloud processing and may be discarded for reducing the model complexity. However, the existing of the second term in Eq. (8) guarantees a more reliable model when shape priors from segments are not available or even wrong. Therefore, we still keep this term in our final formulation in Eq. (8). 4.3.3. Large scale results Our building extraction is also tested in both datasets. The point-bypoint accuracy of dividing non-ground points into groups in Section 3.2 is not evaluated in this paper due to the difficulty of defining the ground truth. For example, it is hard to determine the belonging of fences shared by adjacent buildings in point clouds. If the building dividing step is evaluated at the object level, 327 buildings localized correctly 464 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 18. Instance-level building extraction overview for Calgary dataset. Orange points are ground. Green points are non-building objects. Individual buildings are randomly colored. To reduce the huge data volume, the ground points shown in the overview are down-sampled. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 294 m by 103 m and consists of 21,423,155 points. There are 56 buildings in these two scenes, nine buildings are from Scene1, and 47 buildings are from Scene2. The results of Scene1 are shown in Fig. 20 and the Scene2 results are shown in Fig. 21. The average completeness is 99.1%, the mean correctness is 97.9%, and the mean F1 measure equals 98.6%. Furthermore, we also randomly select ten building instances from Kentucky dataset and evaluate the accuracy. It turns out that the mean completeness is 98.83%, the mean correctness is 96.76%, and the F1 measure equals to 97.78%. The accuracy in Kentucky dataset is a little lower than in Calgary dataset. The main reason is that many vehicles are closely parked near the buildings in the Kentucky scene. In Fig. 20, buildings are large and most vegetations are not adjacent to walls. By visually inspect the details, most of the extraction results are correct. In the scene of Fig. 21, buildings are varied in sizes and shapes. Although vegetations are much closer to the buildings than that in Fig. 20, our method is still able to extract most buildings correctly. are all separated correctly in the Calgary dataset. The remaining 48 buildings failed in the detection step are manually clipped from original data and serve as the input for building extraction algorithm. In this paper, we focus on the performance of the proposed segmentation-based building extraction method in Section 3.3. It took about 1.5 h to extract all buildings from MLS point clouds in Calgary dataset and about 20 min were spent in the Kentucky dataset. The building extraction results of the Calgary dataset are shown in Fig. 18 and the results of Kentucky dataset are shown in Fig. 19. By examining the two overviews and enlarged details, we can find that most building points are correctly extracted from surroundings, regardless of the building shapes and cluttered vegetations. As the evaluation of the whole dataset is extremely time-consuming and labor-intensive, two relative small scenes in Calgary dataset named Scene1 and Scene2 to estimate the performance of our building extraction method in large-scale scenes. Scene1 covers an area of 161 m by 55 m and consists of 22,618,943 points. Scene2 covers an area of Fig. 19. Instance-level building extraction overview of Kentucky dataset. Colors are the same as those in Fig. 18. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) 465 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang 4.4. Discussion In this study, retrieving building instances from mobile LiDAR point clouds is treated as a specific problem of instance-level object extraction. Generally, instance-level object extraction can be decomposed into two subtasks, object localization and semantic segmentation (Golovinskiy et al., 2009; Liang et al., 2017). Its final output should be separated objects with class labels. If our goal is extracting building regions from mobile LiDAR point clouds, the localization only has adverse effects on the final results. However, the goal of our research is extracting building instances. Therefore, the localization is the key to divide building regions into individual instances or to merge discrete and unorganized building points into independent building objects. There are other ways to group building points into individual building instances. For example, Wang et al. (2016) propose a rulebased method which merges potential building points into independent instances. However, their low-rise building detection rate is relatively lower (86.3%) than ours (95.89% and 100% in two datasets). More importantly, the performance of rule-based instance extraction methods depends on two factors. The first factor is the predefined rules which are difficult to design as buildings in the real world have various structures. The second factor is the accuracy of building point recognition. Wrongly labeled points will reduce the grouping accuracy. For example, if non-building points between two buildings are falsely marked as buildings, these two buildings will have high possibility to be merged into one instance. In fact, building point labeling is still a challenging task. The state-of-the-art accuracy of building point labeling in ground-based LiDAR point clouds is approximately between 85% and 95% (Wang et al., 2015; Weinmann et al., 2015) which depends on many factors such as data quality and scene complexity. Even if the semantic labeling of building points is infallible, i.e., overall accuracy is 100%, dividing building points into instances is still a challenge task. For example, distance-based clustering method may be used for separating adjacent building instances. In Fig. 22(a), although the distance between closest walls of two adjacent buildings is over 2.0 m, we have to use a smaller clustering threshold (< 0.4 m) to separate these two buildings due to the existence of a small protruding component between them. However, for other buildings, a grouping threshold below 0.4 m may result in over-detection, which means that one building instance is divided into several clusters. For example, the building in Fig. 22(b) has a gap of 0.82 m due to the occlusion. Also, it is difficult to retrieve building instances if buildings are connected by building components such as enclosing walls or adjacent eaves. For example, buildings in Fig. 22(c) are connected by enclosure walls (not building outer walls), which will result in one merged building by the rule-based method or clustering method. In contrast, our proposed building localization method can deal with all these situations without tuning parameters or introducing specific rules. There are mainly three advantages of localizing buildings before building points extraction. First, knowing the positions of building instances, non-building objects far away from the potential buildings such as vehicles on the street can be removed in the early step, which largely reduces the computational burden. Second, if the input for the graphbased segmentation is large-scale point clouds, a graph consists of tens of millions of vertexes will be constructed, which may not be feasible for most desktops to process such a graph. To solve this problem, we divide the original point clouds into small groups based on localization results, which reduces the problem size and improves the efficiency. Third, the shape prior is used in our proposed segmentation algorithm. As different buildings often have various structural priors, shape information from adjacent buildings can be removed with the help of localization. One major limitation of the proposed framework is that it cannot separate buildings which are connected by outer-walls such as façades. These situations often appear in downtown areas, and these buildings may only have one wall scanned by mobile LiDAR due to the occlusions. Fig. 20. Building extraction details in Scene1. Orange points are ground. Nonbuilding points are colored green. Building points from different instances are randomly colored. Building points in the circle are mislabeled as ground after ground filtering. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Fig. 21. Building extraction details in Scene2. Orange points are ground. Nonbuilding points are colored green. Building points from different instances are randomly colored. In the circled region, car points are mislabeled as building. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) There exist problems in the large-scale tests. For example, in the enlarged view in Fig. 20, building points indicated by the dashed circle are mis-recognized as ground points during the ground filtering step. Another example lies in the dashed circle in the enlarged rectangle #1 in Fig. 21, the car very close (less than 0.5 m) to the building is mislabeled as a building part. One potential way to solve this problem is further segmenting the building points into smaller clusters, then identifying the class of each cluster. In summary, these experiments demonstrate that our proposed building extraction method is able to achieve high quality building extraction results at instance level. 466 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Fig. 22. Challenges of building instance extraction in different scenes. (a) A protruding component between two adjacent building instances. (b) Gaps within one building due to occlusions. (c) Building instances connected by enclosure walls. Colors are the same as those in Fig. 18. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) extract building points from complex surroundings, we propose a foreground-background segmentation method which integrates local geometric features and planar shape priors derived from segments into an energy model. Finally, the model is minimized by graph cuts. The experimental results show the advantages of our method, especially when the walls are incomplete or intimately connected with other objects. This is mainly contributed by the proposed shape term in the objective function. Besides, we argue that the use of shape priors will also improve the performance of point cloud segmentation in other applications. Our methods still have some limitations. In the building localization step, the multiple detection problems occur, and buildings with only one detected wall segment are easily missed. Our building extraction method has difficulty in distinguishing planar non-building objects close to the walls. Also, for those non-planar building structures, shape prior may not be available. Besides, our methods cannot extract building instances whose walls are spatially connected such as façades in the urban areas. Therefore, the future work will focus on reducing multiple detections, finding advanced shape priors for complex structures, and developing instance-level segmentation methods for connected buildings. In fact, partitioning connected façades into instances is a complicated problem and a different approach such as parsing-based algorithm may be developed. In the work of Martinovic et al. (2015), façade images are first classified into semantic regions such as windows, walls, and doors based on color and geometric features, and then the façade separation is turned into a multi-label optimization problem. In the field of mobile LiDAR point clouds processing, Hammoudi et al. (2010) try to divide façade point clouds into building instances with the help of existing cadastral maps. Serna et al. (2016) propose a city block-level façade segmentation method based on the influence zone analysis, and they test their methods in a urban building dataset acquired by mobile LiDAR (Vallet et al., 2015). But, the problem of how to divide blocklevel façades into building instances is not discussed. In summary, extracting instance-level buildings directly from façade point clouds is a challenging task and remains unsolved. Tackling these problems requires new methods which may be similar to the façade parsing studies (Shen et al., 2011). In short, the building localization method is proposed to divide building regions into building instances. Existing methods such as rulebased and clustering methods are not performing well in many realworld situations, and they also cannot process connected façades. Moreover, our building localization method also provides an approach to detect buildings from original point clouds without supervised classification, and the buildings positions can be used for extracting instances from classified point clouds. To deal with the connected building instances such as façades, methods based on façade parsing may be developed in the future. Acknowledgments The first author is supported by China Scholarship Council and University of Calgary. This work is partially supported by Natural Sciences and Engineering Research Council(NSERC). We would like to thank the City of Calgary Council and Dr.Ruigang Yang for providing the mobile LiDAR data. 5. Conclusion References Building instance extraction from MLS point clouds in residential areas has several challenges such as how to separate adjacent buildings and how to extract buildings from cluttered vegetations. In this paper, we propose a “localization then segmentation” framework which can solve most of these problems and achieve instance-level building extraction results. The building localization is turned into a problem of finding rectangles formed by projected vertical wall segments. A hypothesis and selection strategy is proposed to approach this problem. First, hundreds of rectangle proposals are generated using vertical walls. Then, the selection of rectangle hypotheses is formed as an energy maximization problem solved by linear programming. The building detection results demonstrate that our method can localize buildings in dense and complex residential areas with high accuracy. To Aijazi, A.K., Checchin, P., Trassoudaine, L., 2013. Segmentation based classification of 3d urban point clouds: a super-voxel based approach with evaluation. Rem. Sens. 5 (4), 1624–1650. Awrangjeb, M., Ravanbakhsh, M., Fraser, C.S., 2010. Automatic detection of residential buildings using lidar data and multispectral imagery. ISPRS J. Photogramm. Rem. Sens. 65 (5), 457–467. Belton, D., Lichti, D.D., 2006. Classification and segmentation of terrestrial laser scanner point clouds using local variance information. Int. Arch. Photogram., Rem. Sens. Spatial Inform. Sci. 36 (Part 5), 44–49. Boykov, Y., Kolmogorov, V., 2004. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26 (9), 1124–1137. Boykov, Y., Veksler, O., Zabih, R., 2001. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23 (11), 1222–1239. 467 ISPRS Journal of Photogrammetry and Remote Sensing 144 (2018) 453–468 S. Xia, R. Wang Semantic segmentation of urban scenes from start to end in 3d. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4456–4465. Nurunnabi, A., Belton, D., West, G., 2016. Robust segmentation for large volumes of laser scanning three-dimensional point cloud data. IEEE Trans. Geosc. Rem. Sens. 54 (8), 4790–4805. Ok, A.O., Senaras, C., Yuksel, B., 2013. Automated detection of arbitrarily shaped buildings in complex environments from monocular vhr optical satellite imagery. IEEE Trans. Geosci. Rem. Sens. 51 (3), 1701–1717. Pu, S., Vosselman, G., 2009. Knowledge based reconstruction of building models from terrestrial laser scanning data. ISPRS J. Photogramm. Rem. Sens. 64 (6), 575–584. Qin, R., Huang, X., Gruen, A., Schmitt, G., 2015. Object-based 3-d building change detection on multitemporal stereo images. IEEE J. Sel. Top. Appl. Earth Observ. Rem. Sens. 8 (5), 2125–2137. Rother, C., Kolmogorov, V., Blake, A., 2004. Grabcut: interactive foreground extraction using iterated graph cuts. In: ACM Transactions on Graphics (TOG), vol. 23. ACM, pp. 309–314. Rusu, R.B., Cousins, S., 2011. 3d is here: Point cloud library (pcl). In: 2011 IEEE International Conference on Robotics and automation (ICRA). IEEE, pp. 1–4. Rutzinger, M., Höfle, B., Oude Elberink, S., Vosselman, G., 2011. Feasibility of facade footprint extraction from mobile laser scanning data. PhotogrammetrieFernerkundung-Geoinformation 2011 (3), 97–107. Serna, A., Marcotegui, B., 2014. Detection, segmentation and classification of 3d urban objects using mathematical morphology and supervised learning. ISPRS J. Photogramm. Rem. Sens. 93, 243–255. Serna, A., Marcotegui, B., Hernández, J., 2016. Segmentation of façades from urban 3d point clouds using geometrical and morphological attribute-based operators. ISPRS Int. J. Geo-Inform. 5 (1), 6. Shen, C.-H., Huang, S.-S., Fu, H., Hu, S.-M., 2011. Adaptive partitioning of urban facades. In: ACM Transactions on Graphics (TOG), vol. 30. ACM, pp. 184. Shi, J., Malik, J., 2000. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), 888–905. Vallet, B., Brédif, M., Serna, A., Marcotegui, B., Paparoditis, N., 2015. Terramobilita/ iqmulus urban point cloud analysis benchmark. Comput. Graph. 49, 126–133. Veksler, O., 2008. Star shape prior for graph-cut image segmentation. Comput. Vis.–ECCV 2008, 454–467. Wang, Y., Cheng, L., Chen, Y., Wu, Y., Li, M., 2016. Building point detection from vehicleborne lidar data based on voxel group and horizontal hollow analysis. Rem. Sens. 8 (5), 419. Wang, Z., Zhang, L., Fang, T., Mathiopoulos, P.T., Tong, X., Qu, H., Xiao, Z., Li, F., Chen, D., 2015. A multiscale and hierarchical feature extraction method for terrestrial laser scanning point cloud classification. IEEE Trans. Geosc. Rem. Sens. 53 (5), 2409–2425. Weinmann, M., Jutzi, B., Hinz, S., Mallet, C., 2015. Semantic point cloud interpretation based on optimal neighborhoods, relevant features and efficient classifiers. ISPRS J. Photogramm. Rem. Sens. 105, 286–304. Xiao, J., Gerke, M., Vosselman, G., 2012. Building extraction from oblique airborne imagery based on robust façade detection. ISPRS J. Photogramm. Rem. Sens. 68, 56–68. Xu, S., Wang, R., Zheng, H., 2017. Lidar point cloud segmentation via minimum-cost perfect matching in a bipartite graph. arXiv preprint arXiv:1703.02150. Yang, B., Dong, Z., Zhao, G., Dai, W., 2015. Hierarchical extraction of urban objects from mobile laser scanning data. ISPRS J. Photogramm. Rem. Sens. 99, 45–57. Yang, B., Wei, Z., Li, Q., Li, J., 2012. Automated extraction of street-scene objects from mobile lidar point clouds. Int. J. Rem. Sens. 33 (18), 5839–5861. Yang, B., Xu, W., Dong, Z., 2013. Automated extraction of building outlines from airborne laser scanning point clouds. IEEE Geosci. Rem. Sens. Lett. 10 (6), 1399–1403. Yu, B., Liu, H., Wu, J., Hu, Y., Zhang, L., 2010. Automated derivation of urban building density information using airborne lidar data and object-based method. Landscape Urban Plann. 98 (3), 210–219. Yu, Y., Li, J., Guan, H., Wang, C., Yu, J., 2015. Semiautomated extraction of street light poles from mobile lidar point-clouds. IEEE Trans. Geosci. Rem. Sens. 53 (3), 1374–1386. Zhang, K., Yan, J., Chen, S.-C., 2006. Automatic construction of building footprints from airborne lidar data. IEEE Trans. Geosci. Rem. Sens. 44 (9), 2523–2533. Zhang, W., Qi, J., Wan, P., Wang, H., Xie, D., Wang, X., Yan, G., 2016. An easy-to-use airborne lidar data filtering method based on cloth simulation. Rem. Sens. 8 (6), 501. Zheng, H., Wang, R., Xu, S., 2017. Recognizing street lighting poles from mobile lidar data. IEEE Trans. Geosci. Rem. Sens. 55 (1), 407–420. Boykov, Y.Y., Jolly, M.-P., 2001. Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In: Proceedings. Eighth IEEE International Conference on Computer Vision, 2001. ICCV 2001, vol. 1. IEEE, pp. 105–112. Che, E., Olsen, M.J., 2018. Multi-scan segmentation of terrestrial laser scanning data based on normal variation analysis. ISPRS J. Photogramm. Rem. Sens. Chen, D., Wang, R., Peethambaran, J., 2017. Topologically aware building rooftop reconstruction from airborne laser scanning point clouds. IEEE Trans. Geosc. Rem. Sens. 55 (12), 7032–7052. Cheng, L., Xu, H., Li, S., Chen, Y., Zhang, F., Li, M., 2018. Use of lidar for calculating solar irradiance on roofs and façades of buildings at city scale: methodology, validation, and analysis. ISPRS J. Photogramm. Rem. Sens. 138, 12–29. Cote, M., Saeedi, P., 2013. Automatic rooftop extraction in nadir aerial imagery of suburban regions using corners and variational level set evolution. IEEE Trans. Geosci. Rem. Sens. 51 (1), 313–328. Demantké, J., Mallet, C., David, N., Vallet, B., 2011. Dimensionality based scale selection in 3d lidar point clouds. Int. Arch. Photogramm., Rem. Sens. Spatial Inform. Sci. 38 (Part 5), W12. Deng, H., Zhang, L., Mao, X., Qu, H., 2016. Interactive urban context-aware visualization via multiple disocclusion operators. IEEE Trans. Visual. Comput. Graphics 22 (7), 1862–1874. Fan, H., Yao, W., Tang, L., 2014. Identifying man-made objects along urban road corridors from mobile lidar data. IEEE Geosci. Rem. Sens. Lett. 11 (5), 950–954. Freedman, D., Zhang, T., 2005. Interactive graph cut based segmentation with shape priors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1. IEEE, pp. 755–762. Gao, J., Yang, R., 2013. Online building segmentation from ground-based lidar data in urban scenes. In: 2013 International Conference on 3D Vision-3DV 2013. IEEE, pp. 49–55. Golovinskiy, A., Funkhouser, T., 2009. Min-cut based segmentation of point clouds. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, pp. 39–46. Golovinskiy, A., Kim, V.G., Funkhouser, T., 2009. Shape-based recognition of 3d point clouds in urban environments. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 2154–2161. Guan, H., Li, J., Cao, S., Yu, Y., 2016. Use of mobile lidar in road information inventory: a review. Int. J. Image Data Fusion 7 (3), 219–242. Gurobi Optimization, I., 2016. Gurobi optimizer reference manual. URL <http://www. gurobi.com> Hammoudi, K., Dornaika, F., Soheilian, B., Paparoditis, N., 2010. Extracting wire-frame models of street facades from 3d point clouds and the corresponding cadastral map. IAPRS 38 (Part 3A), 91–96. Hernández, J., Marcotegui, B., 2009. Point cloud segmentation towards urban ground modeling. In: Urban Remote Sensing Event, 2009 Joint. IEEE, pp. 1–5. Kang, J., Körner, M., Wang, Y., Taubenböck, H., Zhu, X.X., 2018. Building instance classification using street view images. ISPRS J. Photogramm. Rem. Sens. Klasing, K., Wollherr, D., Buss, M., 2008. A clustering method for efficient segmentation of 3d laser data. In: IEEE International Conference on Robotics and Automation, 2008. ICRA 2008. IEEE, pp. 4043–4048. Kolmogorov, V., Zabin, R., 2004. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26 (2), 147–159. Lafarge, F., Descombes, X., Zerubia, J., Pierrot-Deseilligny, M., 2008. Automatic building extraction from dems using an object approach and application to the 3d-city modeling. ISPRS J. Photogramm. Rem. Sens. 63 (3), 365–381. Li, M., Nan, L., Liu, S., 2016. Fitting boxes to manhattan scenes using linear integer programming. Int. J. Digital Earth 9 (8), 806–817. Liang, X., Lin, L., Wei, Y., Shen, X., Yang, J., Yan, S., 2017. Proposal-free network for instance-level semantic object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. Lin, C., Nevatia, R., 1998. Building detection and description from a single intensity image. Comput. Vis. Image Understanding 72 (2), 101–121. Liqiang, Z., Hao, D., Dong, C., Zhen, W., 2013. A spatial cognition-based urban building clustering approach and its applications. Int. J. Geogr. Inform. Sci. 27 (4), 721–740. Liu, C., Shi, B., Yang, X., Li, N., Wu, H., 2013. Automatic buildings extraction from lidar data in urban area by neural oscillator network of visual cortex. IEEE J. Sel. Top. Appl. Earth Observations Rem. Sens. 6 (4), 2008–2019. Maalek, R., Lichti, D.D., Ruwanpura, J.Y., 2018. Robust segmentation of planar and linear features of terrestrial laser scanner point clouds acquired from construction sites. Sensors 18 (3), 819. Martinovic, A., Knopp, J., Riemenschneider, H., Van Gool, L., 2015. 3d all the way: 468