具有未知簇数的无监督聚类

我有一大堆三维向量。我需要根据欧几里得度量对这些数据进行聚类,这样任何特定聚类中的所有向量之间的欧几里得度量都小于一个阈值“ t”。

我不知道有多少星团存在。最后,可能存在不属于任何集群的单个向量,因为它与空间中任何向量的欧几里得度量不小于“ t”。

这里应该使用哪些现有的算法/方法?

60754 次浏览

You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python's scipy.

See for example the following script:

import matplotlib.pyplot as plt
import numpy
import scipy.cluster.hierarchy as hcluster


# generate 3 clusters of each around 100 points and one orphan point
N=100
data = numpy.random.randn(3*N,2)
data[:N] += 5
data[-N:] += 10
data[-1:] -= 20


# clustering
thresh = 1.5
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")


# plotting
plt.scatter(*numpy.transpose(data), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
plt.title(title)
plt.show()

Which produces a result similar to the following image. clusters

The threshold given as a parameter is a distance value on which basis the decision is made whether points/clusters will be merged into another cluster. The distance metric being used can also be specified.

Note that there are various methods for how to compute the intra-/inter-cluster similarity, e.g. distance between the closest points, distance between the furthest points, distance to the cluster centers and so on. Some of these methods are also supported by scipys hierarchical clustering module (single/complete/average... linkage). According to your post I think you would want to use complete linkage.

Note that this approach also allows small (single point) clusters if they don't meet the similarity criterion of the other clusters, i.e. the distance threshold.


There are other algorithms that will perform better, which will become relevant in situations with lots of data points. As other answers/comments suggest you might also want to have a look at the DBSCAN algorithm:


For a nice overview on these and other clustering algorithms, also have a look at this demo page (of Python's scikit-learn library):

Image copied from that place:

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

As you can see, each algorithm makes some assumptions about the number and shape of the clusters that need to be taken into account. Be it implicit assumptions imposed by the algorithm or explicit assumptions specified by parameterization.

Check out the DBSCAN algorithm. It clusters based on local density of vectors, i.e. they must not be more than some ε distance apart, and can determine the number of clusters automatically. It also considers outliers, i.e. points with an unsufficient number of ε-neighbors, to not be part of a cluster. The Wikipedia page links to a few implementations.

The answer by moooeeeep recommended using hierarchical clustering. I wanted to elaborate on how to choose the treshold of the clustering.

One way is to compute clusterings based on different thresholds t1, t2, t3,... and then compute a metric for the "quality" of the clustering. The premise is that the quality of a clustering with the optimal number of clusters will have the maximum value of the quality metric.

An example of a good quality metric I've used in the past is Calinski-Harabasz. Briefly: you compute the average inter-cluster distances and divide them by the within-cluster distances. The optimal clustering assignment will have clusters that are separated from each other the most, and clusters that are "tightest".

By the way, you don't have to use hierarchical clustering. You can also use something like k-means, precompute it for each k, and then pick the k that has the highest Calinski-Harabasz score.

Let me know if you need more references, and I'll scour my hard disk for some papers.

Use OPTICS, which works well with large data sets.

OPTICS: Ordering Points To Identify the Clustering Structure Closely related to DBSCAN, finds core sample of high density and expands clusters from them 1. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN

from sklearn.cluster import OPTICS
db = OPTICS(eps=3, min_samples=30).fit(X)

Fine tune eps, min_samples as per your requirement.

You may have no solution: it is the case when the distance between any two distinct input data points is always greater than T. If you want to compute the number of clusters only from the input data, you may look at MCG, a hierarchical clustering method with an automatic stop criterion: see the free seminar paper at https://hal.archives-ouvertes.fr/hal-02124947/document (contains bibliographic references).

I want to add to moooeeeep's answer by using hierarchical clustering. This solution work for me, though it quite "random" to pick threshold value. By referrence to other source and test by myself, I got better method and threshold could be easily picked by dendrogram:

from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt


ori_array = ["Your_list_here"]
ward_array = hierarchy.ward(pdist(ori_array))
dendrogram = hierarchy.dendrogram(hierarchy.linkage(ori_array, method  = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

You will see the plot like this click here. Then by drawing the horizontal line, let say at distance = 1, the number of conjunctions will be your desire number of clusters. So here I choose threshold = 1 for 4 clusters.

threshold = 1
clusters_list = hierarchy.fcluster(ward_array, threshold, criterion="distance")
print("Clustering list: {}".format(clusters_list))

Now each value in cluster_list will be an assigned cluster-id of the corresponding point in ori_array.

I needed a way to "fuzzy sort" lines from OCR output, when the output is sometimes out of order, but within blocks, the lines are usually in order. In this case, the items to sort are dictionaries, which describe words at a location 'x','y' and with size 'w','h'. The general clustering algorithms seemed like overkill and I needed to maintain the order of the items during the sort. Here, I can set the tolerance tol to be about 1/4 the line spacing, and this is called with the field being 'y'.

def fuzzy_lod_sort(lod, field, tol):
# fuzzy sort lod into bins within +/- tol
# maintain original order.
    

# first determine the bins.
val_list = [d[field] for d in lod]
vals_sorted = sorted(val_list)
    

bins_lol = []
i = 0
for j, v in enumerate(vals_sorted):
if not j:
bins_lol.append([v])
continue
            

cur_bin_avg = statistics.mean(bins_lol[i])
if abs(cur_bin_avg - v) <= tol:
bins_lol[i].append(v)
continue
        

i += 1
bins_lol.append([v])
        

# now sort into the bins, maintaining the original order.
# the bins will be the center of the range of 'y'.
bins = [statistics.mean(binlist) for binlist in bins_lol]
    

# initialize the list of bins
lolod = []
for _ in range(len(bins)):
lolod.append([])
    

for d in lod:
bin_idx = closest_bin_idx(bins, d[field])
lolod[bin_idx].append(d)


# now join the bins.
result_lod = []
for lod in lolod:
result_lod.extend(lod)
    

return result_lod
    

        

def closest_bin(bins, val):
return min(bins, key=lambda bin:abs(bin - val))
    

    

def closest_bin_idx(bins, val):
return bins.index(closest_bin(bins, val))

The trouble is that the 'y' coordinate of the OCR output are based on the outline around the word and a later word in the same line might have a 'y' coordinate that is lower than an earlier word. So a full sort by 'y' does not work. This is much like the clustering algorithm, but the intention is a bit different. I am not interested in the statistics of the data points, but I am interested in exactly which cluster each is placed and also it is important to maintain the original order.

Maybe there is some way to fuzzy sort using the sorting built-ins, and it might be an alternative to the clustering options for 1-D problems.