Data Aggregation for Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large data sets.


Introduction
The algorithm Partition Around Medoids (PAM, [5,6]), also known as k-medoids, is a popular clustering algorithm used as alternative to k-means clustering when one wants to minimize other distances than squared errors distance.Similar to k-means, it aims at minimizing the sum of distances from a cluster center, but the cluster center in k-medoids is one of the data points and called a medoid, and the distance function here may be arbitrary.This increases the flexibility over k-means, which uses the arithmetic mean as cluster center: the mean minimizes squared errors, and because of this k-means only minimizes Bregman divergences such as the squared Euclidean distance.Even on one-dimensional data it does not minimize the linear error, as easily seen from the difference between the arithmetic mean and the median.While k-means minimizes the sum-of-squared errors, k-medoids,  with k representative medoids m i minimizes the absolute error criterion ("total deviation", TD): where d(x c , m i ) is the distance between data point x c of cluster C i and medoid m i ; not necessarily the Euclidean distance, and not necessarily a metric.The difference between the arithmetic mean, the per-axis median, the geometric median, and the medoid of a data set is exemplified in Figure 1.It can be seen that the medoid is less sensitive to outliers than the arithmetic mean, and also that k-means does not minimize Euclidean distances (but the squared distances).
In operations research, the k-medoids problem is also known as the (discrete) facility location problem.Several variants of this problem have been researched there.The variants differ mainly in the objective function to be minimized.For example, k-center instead minimizes the maximum distance of all points to their assigned cluster centers.There has been substantial research in the area of finding approximation algorithms for all these different problems.
Unfortunately, the algorithms commonly used for k-medoids are not very scalable to large problems, as we will discuss next.

Runtime Complexity of Partition Around Medoids
The k-medoids problem is NP-hard [4], hence we have to resort to approximate solutions, using greedy and local optimization techniques.The PAM algorithm is such an approach: its initialization (known as BUILD) is a greedy approximation to the k-medoids problem, which afterwards is refined using a local search (called SWAP).Greedy initialization chooses k times the point which reduces the error the most; local search then optimizes this solution by searching for the best way to swap one of the cluster centers with a non-center.While the name k-medoids resembles k-means, the standard PAM algorithm works different from the standard k-means algorithm.A k-means like strategy of alternating optimization for k-medoids has been proposed several times [2,9,11,12], but was shown to produce worse solutions than a swap-based approach such as PAM [13,14,17].Kanungo et al. [3] proposed a swap-based approach to also improve the results of k-means, but it is rather expensive as we will see next.).The resulting runtime complexity of PAM is O(kN 2 i), where i is the number of iterations until convergence for which little is known except that it usually is reasonably small, and likely has an unfavorably high worst case just as with k-means.
We have recently proposed improved versions of PAM named FastPAM [15] and FasterPAM [14], which provide a substantial speedup over PAM by eliminating the nested loop over the k medoids.By furthermore greedily performing the first swap that improves the loss (instead of the best swap) and random initialization, this allowed us to decrease the runtime complexity to O(N 2 i) with an empirically much smaller i (but supposedly a similar theoretical worst case).
Because both methods use each pairwise distance several times-and the method is in particular interesting to use with more complex and hence expensive distance function-it is prohibitive to not use it with a pairwise distance matrix.Hence both methods also require O(N 2 ) memory.

Sparse Partitioning Around Medoids
A large part of these pairwise distances may be unnecessary to know exactly.It is easy to see that given some assignment of points to medoids, and the maximum distance τ of this assignment, we could replace all values larger than τ in this input distance matrix with τ , and the solution would not change.Hence there is some natural "cut-off" to distances, and larger values do not contribute to the solution.If our distance function satisfies the triangle inequality, we may be able to omit computing some of these large distances (e.g., with the algorithm of Newling and Fleuret [10]).
In this research, we want to focus on a different scenario, where the cut-off may be given in advance (and may be different for each point), but the distance is not necessarily metric.An real-word example for such as problem will be introduced in Section 4. While we can (and, effectively, will) treat distances considered uninteresting for the application as infinite (or sufficiently large) values, using a sparse storage or the distances only immediately reduces the memory usage, not the runtime.
Unfortunately, this also easily breaks the optimization procedure, which relies on first finding a feasible initial solution, then performing local changes that improve the solution.A greedy strategy such as the one discussed above is usually not able to find a valid initial solution for a small k (and in particular, for a very small k the problem may become unsatisfiable with a finite loss).In such cases, the local optimization will also not help, as neighbor solutions will often still be invalid, and hence make no progress.This is most easily seen if the data set consists of many components that are not connected with edges of finite length.
Instead of searching directly for a solution with k centers, we can solve a second problem of kmedoids clustering at the same time: how should we choose k? Already with k-means clustering, choosing the "optimal" k has shown elusive to a general solution, and is mostly performed by some crude heuristic such as the infamous Elbow criterion, which is frequently misused.
If we allow the algorithm to vary k, we can much more easily find a valid initial solution (e.g., by choosing the best unconnected vertex until everything is covered).But of course this will usually yield a much higher number of clusters k than desired.But if we perform a multi-criteria optimization in the refinement phase, we may be able to reduce the number of clusters along with minimizing our main objective.
When varying k, we will obtain a Pareto front of solutions that are all optimal in one way or another.This can be formalized as solutions not "dominated" by any other solution in each criterion at the same time.To reduce the set of remaining candidate solutions, it is best if we have some additional constraints to satisfy based on the particular problem to solve.

Use Case: Simulation of Electrical Substation
We obtain networks using OSMOGrid, which implements ideas of distribution network generation of Kays et al. [7] on the basis of public data (OpenStreetMap, OSM).The electrical grid is modeled to follow the streets, and the buildings are used to model consumers.Power consumption is estimated based on zoning and building size, and used to simulate the load flow in the grid.We have made some graph simplifications in preparation for the problems presented below.We remove dead ends, and move the consumers locations (i.e., buildings) to the next point in the street network.Figure 2 shows the simulation based on the township Witten Stockum.
On the basis of this graph structure, there are different computational tasks in which resource-efficient clustering models are necessary.One of these tasks is the simulation of electrical substations within the graph.We want to identify the optimal positions of power substations, such the electric losses in the network are minimized.As the electric loss is related to load, voltage, and the cable length, we approximate it using the distance between substations and their connected consumers, which is weighted by the load of the consumers.We describe this as a facility location problem, which comes from urban and public service planning.The objective function FL for facilities and demand points is with c(j) as the cost of opening a facility, and d(i, m(i)) as the distance between consumer i and the assigned center m(i).FL has strong similarities to the objective function of k-medoids.We take the facilities as the electrical substations and the consumers as the demand points.Figure 3 shows results of clustering the consumers with FasterPAM for k = 4 substations for the generated graph for Witten Stockum.We can observe that cluster assignment follows the road network, and consumers are not necessarily assigned to the closest center "as the bird flies".
Even with the FasterPAM improvements, the runtime complexity is O(N 2 i) for N nodes in the graph and i iterations of the optimization procedure.The underlying OSM planet file contains about 1.2 TB of data.Even though we are only interested in modeling smaller areas of the world, we need to achieve reduction in complexity for solving the task for whole cities or regions in an acceptable runtime, as these will nevertheless contain several thousands of houses.We take advantage of some properties of a typical electrical network.We only consider nodes with at least 3 outgoing edges as possible optimal substation locations (except for disconnected points).The optimal position on a single edge is trivial to calculate and is neglected.Hence, it is beneficial to formulate this as an asymmetric problem, where demand points and facility locations are no longer the same set.The distance matrix then no longer has to be calculated for all node pairs, but only for all demand points and substation location connections.This reduces the complexity to O(N mi) for N consumers, m possible substation locations, with m < N .If we further limit the maximum distance between a consumer and a substation (to limit the power losses), this distance matrix becomes sparse, i.e., we now have missing values that we can consider as infinite values.If we do not store these missing values and iterate using appropriate sparse data structures, we expect to further reduce the runtime to O((e + N + m) • i) for e edges.Assuming a similar density of houses and roads everywhere, we can expect the number of edges e to be approximately linear in the area of the map we are processing.

Sparse k-Medoids
To use k-medoids clustering for problems with asymmetric and sparse input data, we have to adapt the objective function of k-medoids.We still want to minimize the "total deviation" of all data points {x 1 , ..., x N } from the current set of medoids M ⊆ {y 1 , ..., y m }, but we no longer assume M ⊂ X as  in tranditional k-medoids.Furthermore, for some points, there currently may be no closest reachable medoid m(x i ), and all distances from this x i to all medoids m i ∈ M are undefined.In such cases, we have to incorporate a penalty π(x i ) in our loss ℓ: Note that we allow the set M to change in size below.The penalty π(i) can be used to trade the loss of not covering all possible data points against having larger distances.We do not further consider tuning this parameter below, but we instead use π(i) = π = const → ∞ to enforce a complete coverage.Because such extreme values can cause numerical problems, our implementation always uses pairs (i, d) to store a loss (and a loss change): an integer i to count the number of unassigned points, and the sum of distances of assigned points d, such that mathematically we have ℓ = i • π + d, but do not suffer from numerical problems.
Based on the objective function, we introduce DynBUILD (Dynamic Asymmetric BUILD initialization) as an adaptation of the BUILD algorithm of Kaufman and Rousseeuw [5,6] to asymmetric sparse input datasets.The greedy BUILD approach is supplemented by a dynamic increase of k, if after choosing k medoids, some objects (still) are not reachable by the current set of medoids.The algorithm hence always choses at least k medoids and covers all consumers.As a baseline, the strategy denoted Random simply uses a given percentage of points as initial cluster centers, and may hence yield an initial solution where constraints are violated, but our improved DynSWAP procedure will repair these while optimizing the assignment.Sparse++ is an adaptation of the well-known k-means++ [1] method to sparse data, where cluster centers are chosen proportionally to how many points they cover (again, we continue choosing additional centers until all constraints are satisfied).
We introduce DynSWAP (Dynamic SWAP for asymmetric sparse data) as a dynamic SWAP algorithm based on FasterPAM [14,15], adapted to dynamically reduce k, while efficiently processing asymmetric and sparse input data.DynSWAP differs from FasterPAM's SWAP in two ways: To dynamically change k depending on the constraints, we check after each swap whether we can reduce k without violating a constraint (line 30) if the current object is not suitable for swapping but reduces the number of violated constraints if added as a new medoid (line 34) then we make it an additional medoid.We deliberately choose to only reduce k if we also perform a swap, as to alternate between optimizing the existing medoids and learning the number of clusters k.Both checks are very efficient to implement, as we already know the removal loss change for all medoids (∆ℓ −m1 , . . ., ∆ℓ −m k , also needed by the FastPAM improvement over PAM) and we also have in ∆ℓ + the loss change when adding a new medoid.We can remove the medoid m i without breaking any constraint if the π component is zero: ∆ℓ −mi π =0, and making the current candidate y j a new medoid is beneficial if its ∆ℓ + π <0.Whenever adding, removing, or swapping a medoid, we need to update for all data points x o the nearest medoid n 1 (o), the distance to the nearest medoid d n1 (o), and the distance to the second nearest medoid d n2 (o).This can be done more efficiently by updating the previous values, exactly as in FasterPAM.Based on this information, we can also update ∆ℓ −m1 , . . ., ∆ℓ −m k , which is the loss change for removing each medoid, efficiently: for each object, removing the nearest medoid incurs a loss change of (0, d n2 (o) − d n1 (o)) if there is a second nearest medoid, and (π(o), −d n1 (o)) otherwise.Removing another medoid except the nearest medoid does not incur a loss change.
When computing the loss change for adding a new candidate medoid y c , we initialize an array ∆ℓ with the removal loss of each existing medoid, an optimization from FastPAM [15].To avoid an inner loop over all medoids k, we also incorporate an idea from FasterPAM [14], namely to accumulate the loss change that applies to all medoids into the variable ∆ℓ + .An interesting property of ∆ℓ + is that it is the loss change for adding a new medoid, which we use for our logic of dynamically increasing the number of clusters, too.We benefit from sparsity in this approach because we do not have to consider objects that are not neighbors of the candidate y c : the loss change by removing existing medoids has already been accounted for, and as they are not reachable from y c , there is no loss change when adding the replacement medoid.Because of this, our loop only needs to iterate over the neighbors.For each neighbor x o , we distinguish four cases: (1) the point is currently not yet covered, hence we gain π(o) but incur d(x o , y c ) in line 12, (2) the new medoid is closer than all existing medoids and hence we gain d n1 (o) − d(x o , y c ) in line 14.For the case of removing the nearest medoid, we have already included d n1 (o), and hence we have to cancel this out (either with −π(o) or d n2 (o)).If the new medoid is only second nearest, and there is (3) no previous second nearest, only the loss of removing the nearest medoid needs to be updated in line 20.If (4) a previous second nearest exists, but which is farther than the new medoid, we also need to adjust the loss of removing the nearest medoid by the difference between assigning to the new medoid instead of the previous second closest in line 22.Similar case distinctions -except for handling the case of an undefined second closest -can already be found in FasterPAM [14].
/* Choose the first medoid: */ 2 foreach y j do // compute loss for each y j 3 ℓ j ← ( i π(i), 0) // everything is unassigned  We observe that the two loops in lines 5 and 9 iterate over all edges, hence the complexity of the procedure is O((e , where e is the number of edges and i the number of iterations.In the street network example, we can argue that e ∈ O(N ) as we scale the approach to larger networks (as we would keep the maximum distance constant, but increase the area).Hence, this sparse k-medoids version scales linearly for this application.If we have a densely connected graph, then e ∈ O(N 2 ) and the runtime matches that of standard FasterPAM.

Experiments
In our experiments, we expect to see a speedup compared to FasterPAM.We also want to check how the dynamic change of k works under consideration of constraints.We have to evaluate how well the Sparse k-Medoids is able to find the smallest possible k still meeting the constraints.Hence, we analyze three initialization methods DynBUILD, Random, and Sparse++.Finally, we perform a qualitative evaluation by comparing our simulations with original substation locations from OSM.
Data Sets To verify the algorithm, we need sufficiently large test data sets, and choose constraints to obtain sparse distance matrices.In this work, we focus on the processing and evaluation of energy grids generated by OSMOGrid.For quality evaluation, we choose areas where many substations are documented in OSM. Figure 4 shows a cutout of the electrical grid generated by OSMOGrid for the city of Witten, using the 127 substation locations from OSM (although this likely is not complete, as seen in Figure 4).We can then compare the quality of our calculated models to the model based on the real substations, but we need to remember that there may be additional substations missing in OSM, and that real power networks have grown historically, have to satisfy additional constraints, and are hence not optimal either.For the purpose of generating "realistic" networks, it is desirable to achieve a comparable quality, without overfitting to the example data we have.On the dataset, we evaluate the dynamic methods for choosing k.The "optimal" k depends on the constraints set, and thus on the sparsity of the distance matrix.Figure 5 shows the smallest k to meet the constraint of a maximum cable length in the grid.With increasing cable length, the number of missing distances in the matrix decreases, but the best number of substations decreases much faster.
We evaluate the algorithms in the ELKI open-source toolkit [16] in Java.For comparability, we perform all computations in ELKI and use the original implementations of FasterPAM as reference.This way we avoid side effects caused by different implementations [8].We run 100 restarts on an AMD EPYC 7302 processor using a single thread, and evaluate the average, maximum and minimum values.500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 0  Dynamic k To evaluate the quality with a variable k, we compare the solutions found by the algorithms to the best known solution of all runs.We also compare the different initialization algorithms, and the variants of the dynamic SWAP.We measure the solution's k, the runtime of initialization and SWAP, and whether the result satisfies all constraints.Summarized results are shown in Table 1.Only Random 5 initialization without dynamic increase of k fails to satisfy constraints.This was to be expected, because it only uses 5% of the possible substations as medoids, but there does not appear to be a solution with just this many clusters.Because DynBUILD is deterministic, it always produced k=93 clusters after initialization.After the SWAP phase, the average k was 76.2, which is 2.2 more than the best known k=74 (we iterate in a randomized order in SWAP to avoid dependence on the input data order).Since the initial solution already satisfied all constraints, and SWAP preserves this property, k can only decrease.Among the random initializations Random 5 with DynSWAP ↓↑ found the best results on average.With k=77.9 after the SWAP phase, the number of stations is on average 3.9× larger than the best k.Finally, the SWAP after DynBUILD needs on average 48.4% of the runtime of the SWAP after a random initialization.Due to random initialization, the average number of medoid changes during the SWAP increases significantly from 141 to 310, showing that DynBUILD finds superior starting conditions than both random sampling and Sparse++.
Runtime Speedup In order to evaluate the runtime of the different methods, we perform experiments for varying constraints and values of k. Figure 6 shows the total runtime (initialization and SWAP) for DynBUILD, Random 5 , Random 10 , and Sparse++ initialization.We again use the Witten data set and choose the distance constraint such that all methods can achieve the desired k.We compare the runtime with the FasterPAM implementation with a random initialization (as recommended for FasterPAM).For DynBUILD we evaluate the SWAP with dynamic decrease of k (Dyn ↓ ) and for all random initialization the SWAP with both dynamic increase and decrease of k (Dyn ↓↑ ).We use a log scale on this plot because of the huge differences: the sparsity optimized DynSWAP over all initializations on average uses only 7% of the runtime of the original FasterPAM with random initialization, the DynBUILD and DynSWAP ↓ combination on average even only uses 4%.This was expected as FasterPAM has to process the much larger dense matrix.With increasing k, we can use a more sparse matrix here, which is the reason why the DynSWAP approaches become faster while FasterPAM becomes slower due to the higher number of clusters.The various random initializations differ only slightly in runtime, but require on average about twice as long as DynBUILD.In addition to the fast runtime, DynBUILD with Dyn ↓ SWAP also produces the lowest number of excess clusters compared to the best known k with an average of 2.2 stations more than the best known k = 74.  1and Sparse++ initialization depending on the best number of k for the simulation of the grid of Witten.For reference, the random initialization and SWAP runtime of the FasterPAM implementation is included, where the k chosen is the best one we know.The best k is controlled indirectly by the constraints set, see Figure 5.In addition to the runtime, the deviation from the best known k after SWAP is also shown.2. All consumers are color-coded to their nearest substation.
Quality In order to evaluate the resulting quality, we compare the optimized substation locations to the substations tagged in the OSM (however, these are likely incomplete).Table 2 shows the results for a target k=127 compared to the loss of the 127 tagged substations, shown in Figure 4. Sparse++ and Random 10 initialization with Dyn ↓↑ SWAP results in the lowest loss with 1.3149×10 7 and is 29% lower than the loss of the tagged substations; but the quality difference between the randomized initializations is not significant (the SWAP does a good enough job to always reach a good solution); the main difference here is in the runtimes, where the strategy to sample more centers than necessary, then decrease, seems to be superior to the others.The minimum loss over 100 restarts is obtained with Sparse++ and Random 5 with 1.3083×10 7 .DynBUILD initialization with Dyn ↓ SWAP finds a slightly higher loss of 1.3271×10 7 , but yields the fastest total runtime with 9551.0 ms, despite using the slowest initialization by far.
term to the opening or closing of locations, and to weigh these costs differently.In this experiment, we used a strict requirement to cover all locations (i.e., π → ∞), but using a smaller weight may yield interesting approximations.
So far, we have considered the problem of optimal substation positions without a maximum capacity of substations.In reality, there is a maximum load that can be served by a substation.In densely populated areas, we hence may need more substations.This results in a capacitated facility location problem, which contains such an additional capacity constraint, and is worth exploring in future work.

Figure 1 :
Figure 1: Four different central points: the arithmetic mean , per-axis median , geometric median , and the Euclidean medoid .
Both the greedy initialization as well as the local search require all pairwise distances stored in a distance matrix.Greedy initialization performs k iterations, each of cost O(N 2 ) to find the best medoid to add.PAM's swap evaluates O(k(N − k)) potential swaps, each with a reduced effort of O(N − k) operations by computing only the change in the loss function.Hence each swap takes O(k(N − k) 2 ) time to find, which already was an improvement over the naive approach in O(k 2 (N − k)2

Figure 2 :
Figure 2: Simulation of an electrical grid based on OSM data of Witten Stockum.

Figure 3 :
Figure 3: Clustering of the demand points of the generated graph structure according to optimal substation locations with k = 4 using FasterPAM.

remove medoid m r 33 update n 1 Figure 4 :
Figure 4: Grid simulation and known substation locations in OSM for Witten.The road network contains 37287 edges and 36844 nodes, N =35713 consumer and m=1130 possible places for substations.The location of 127 substations is documented in OSM, but very likely several are missing in particular in the east.

Figure 5 :
Figure5: Sparsity of the distance matrix depending on a maximum cable length constraint between consumer and substation for the simulation of Witten, and the minimum number of substations k for which no constraint is broken.

5 Figure 6 :
Figure6: Runtime of the initialization and SWAP for DynBUILD, Random 5 , Random10 and Sparse++ initialization depending on the best number of k for the simulation of the grid of Witten.For reference, the random initialization and SWAP runtime of the FasterPAM implementation is included, where the k chosen is the best one we know.The best k is controlled indirectly by the constraints set, see Figure5.In addition to the runtime, the deviation from the best known k after SWAP is also shown.

Figure 7 :
Figure 7: Simulation of an electrical grid based on OSM data of Witten.127 substations calculated with Sparse++ and Dyn ↓↑ SWAP from Table2.All consumers are color-coded to their nearest substation.

Table 1 :
Comparison of k depending on initialization and swap algorithms for generating a grid for Witten.All results are averaged over 100 restarts.