K-平均算法：修订间差异

删除的内容添加的内容

行内

2015年4月2日 (四) 01:54的版本

$k$ -均值算法源于信号处理中的一种向量量化方法，现在则更多地作为一种聚类分析方法流行于数据挖掘领域。 $k$ -均值聚类的目的是：把 $n$ 个点（可以是样本的一次观察或一个实例）划分到 $k$ 个聚类中，使得每个点都属于离他最近的均值（此即聚类中心）对应的聚类，以之作为聚类的标准。这个问题将归结为一个把数据空间划分为Voronoi cells的问题。

这个问题在计算上是困难的（NP-hard），不过存在高效的启发式算法。一般情况下，都使用效率比较高的启发式算法，它们能够快速收敛于一个局部最优解。这些算法通常类似于通过迭代优化方法处理高斯混合分布的期望-最大化算法（EM算法）。而且，它们都使用聚类中心来为数据建模；然而 $k$ -均值聚类倾向于在可比较的空间范围内寻找聚类，期望-最大化技术却允许聚类有不同的形状。

$k$ -均值聚类与 $k$ -近邻之间没有任何关系（后者是另一流行的机器学习技术）。

算法描述

已知观测集 $(x_{1},x_{2},...,x_{n})$ ，其中每个观测都是一个 $d$ -维实向量， $k$ -均值聚类要把这 $n$ 个观测划分到 $k$ 个集合中(k≤n),使得组内平方和（WCSS within-cluster sum of squares）最小。换句话说，它的目标是找到使得满足下式

argmin_{s}\sum _{i=1}^{k}\sum _{x\in S_{i}}\left\|x-\mu _{i}\right\|^{2}

其中 $\mu _{i}$ 是 $S_{i}$ 中所有点的均值。

历史源流

虽然其思想能够追溯到1957年的Hugo Steinhaus ^[1] ，术语“K-均值”于1967年才被James MacQueen ^[2] 首次使用。标准算法则是在1957年被Stuart Lloyd作为一种脉冲码调制的技术所提出，但直到1982年才被贝尔实验室公开出版 ^[3] 。在1965年，E.W.Forgy发表了本质上相同的方法，所以这一算法有时被称为Lloyd-Forgy方法。更高效的版本则被Hartigan and Wong提出（1975/1979） ^[4] A more efficient version was proposed and published in Fortran by Hartigan and Wong in 1975/1979.^[5]^[6] .

算法

标准算法

最常用的算法使用了迭代优化的技术。它被称为k-平均算法而广为使用，有时也被称为Lloyd算法（尤其在计算机科学领域）。已知初始的 $k$ 个均值点 $m_{1}^{(1)},...,m_{k}^{(1)}$ ,算法的按照下面两个步骤交替进行 ^[7] ：

分配(Assignment)：将每个观测分配到聚类中，使得组内平方和（WCSS）达到最小。

因为这一平方和就是平方后的欧氏距离，所以很直观地把观测分配到离它最近得均值点即可 ^[8] 。（数学上，这意味依照由这些均值点生成的Voronoi图来划分上述观测）。

S_{i}^{(t)}=\left\{x_{p}:\left\|x_{p}-m_{i}^{(t)}\right\|^{2}\leq \left\|x_{p}-m_{j}^{(t)}\right\|^{2}\forall j,1\leq j\leq k\right\}

其中每个 $x_{p}$ 都只被分配到一个确定的聚类 $S^{t}$ 中，尽管在理论上它可能被分配到2个或者更多的聚类。

更新(Update)：计算得到上步得到聚类中每一聚类观测值的图心，作为新的均值点。

m_{i}^{(t+1)}={\frac {1}{\left|S_{i}^{(t)}\right|}}\sum _{x_{j}\in S_{i}^{(t)}}x_{j}

因为算术平均是最小平方估计，所以这一步同样减小了目标函数组内平方和（WCSS）的值。

这一算法将在对于观测的分配不再变化时收敛。由于交替进行的两个步骤都会减小目标函数WCSS的值，并且分配方案只有有限种，所以算法一定会收敛于某一（局部）最优解。注意：使用这一算法无法保证得到全局最优解。

这一算法经常被描述为“把观测按照距离分配到最近的聚类”。标准算法的目标函数是组内平方和（WCSS），而且按照“最小平方和”来分配观测，确实是等价于按照最小欧氏距离来分配观测的。如果使用不同的距离函数来代替（平方）欧氏距离，可能使得算法无法收敛。然而，使用不同的距离函数，也能得到K-均值聚类的其他变体，如球体K-均值算法和K-中心点算法。

初始化方法

通常使用的初始化方法有Forgy和随机划分(Random Partition)方法 ^[9] 。Forgy方法随机地从数据集中选择 $k$ 个观测作为初始的均值点；而随机划分方法则随机地为每一观测指定聚类，然后运行“更新(Update)”步骤,即计算随机分配的各聚类的图心，作为初始的均值点。Forgy方法易于使得初始均值点散开，随机划分方法则把均值点都放到靠近数据集中心的地方。参考Hamerly et al的文章 ^[9] ，可知随机划分方法一般更适用于K-调和均值和模糊K-均值算法。对于期望-最大化(EM)算法和标准K-均值算法，Forgy方法作为初始化方法的表现会更好一些。

这是一个启发式算法，无法保证收敛到全局最优解，并且聚类的结果会依赖于初始的聚类。又因为算法的运行速度通常很快，所以一般都以不同的起始状态运行多次来得到更好的结果。不过，在最差的情况下，K-均值算法会收敛地特别慢：尤其是已经证明了存在这一的点集（甚至在2维空间中），使得K-均值算法收敛的时间达到指数级（ $2^{\Omega (n)}$ ） ^[10] 。好在在现实中，这样的点集几乎不会出现：因为K-均值算法的平滑运行时间是多项式时间的 ^[11] 。

注：把“分配”步骤视为“期望”步骤，把“更新”步骤视为“最大化步骤”，可以看到，这一算法实际上是广义期望-最大化算法（GEM）的一个变体。

复杂度

在 $d$ 维空间中找到K-均值聚类问题的最优解的计算复杂度：

NP-hard：一般欧式空间中，即使目标聚类数仅为2^[12]^[13]
NP-hard: 平面中，不对聚类数目 $k$ 作限制^[14]
如果 $k$ 和 $d$ 都是固定的，时间复杂度为 $O(n^{dk+1}logn)$ ,其中 $n$ 为待聚类的观测点数目^[15]

相比之下，Lloyds算法的运行时间通常为 $O(nkdi)$ , $k$ 和 $d$ 定义如上， $i$ 为直到收敛时的迭代次数。如果数据本身就有一定的聚类结构，那么收敛所需的迭代数目通常是很少的，并且进行少数迭代之后，再进行迭代的话，对于结果的改善效果很小。鉴于上述原因，Lloyds算法在实践中通常被认为几乎是线性复杂度的。

下面有几个关于这一算法复杂度的近期研究：

Lloyd's K-均值算法具有多项式平滑运行时间。对于落在空间 $[0,1]^{d}$ 任意的 $n$ 点集合，如果每一个点都独立地受一个均值为 $0$ ，标准差为 $\sigma$ 的正态分布所影响，那么K-均值算法的期望运行时间商界为 $O(n^{34}k^{34}d^{8}log^{4}(n)/\sigma ^{6})$ ，即对于 $n,k,i,d$ 和 $1/\sigma$ 都是多项式时间的^[11]。
在更简单的情况下，有更好的上界。例如^[16]，在整数网格 $\left\{1,...,M\right\}^{d}$ 中，K-均值算法运行时间的上界为 $O(dn^{4}M^{2})$ 。

算法的变体

与其他统计机器学习方法的关系

K-均值聚类，以及它与EM算法的联系，是高斯混合模型的一个特例。很容易能把K-均值问题一般化为高斯混合模型^[19]。另一个K-均值算法的推广则是K-SVD算法，后者把数据点视为“编码本向量”的稀疏线性组合。而K-均值对应于使用单编码本向量的特殊情形（其权重为1）^[20]。

Mean Shift 聚类

基本的Mean Shift聚类要维护一个与输入数据集规模大小相同的数据点集。初始时，这一集合就是输入集的副本。然后对于每一个点，用一定距离范围内的所有点的均值来迭代地替换它。与之对比，K-均值把这样的迭代更新限制在（通常比输入数据集小得多的）K个点上，而更新这些点时，则利用了输入集中与之相近的所有点的均值（亦即，在每个点的Voronoi划分内）。还有一种与K-均值类似的Mean shift算法，即似然Mean shift，对于迭代变化的集合，用一定距离内在输入集中所有点的均值来更新集合里的点^[21]。Mean Shift聚类与K-均值聚类相比，有一个优点就是不用指定聚类数目，因为Mean shift倾向于找到尽可能少的聚类数目。然而，Mean shift会比K-均值慢得多，并且同样需要选择一个“宽度”参数。和K-均值一样，Mean shift算法有许多变体。

主成分分析（PCA）

有一些研究^[22]^[23]表明，K-均值的放松形式解（由聚类指示向量表示），可由主成分分析中的主成分给出，并且主成分分析由主方向张成的子空间与聚类图心空间是等价的。不过，主成分分析是K-均值聚类的有效放松形式并不是一个新的结果(如，见^[24])，并且还有的研究结果直接揭示了关于“聚类图心子空间是由主成分方向张成的”这一论述的反例^[25] 。

独立成分分析(ICA)

有研究表明^[26]，在稀疏假设以及输入数据经过白化的预处理后，K-均值得到的解就是独立成分分析的解。这一结果对于解释K-均值在特征学习方面的成功应用很有帮助。

双向过滤

K-均值算法隐含地假设输入数据的顺序不影响结果。双向过滤与K-均值算法和Mean shift算法类似之处在于它同样维护着一个迭代更新的数据集（亦是被均值更新）。然而，双向过滤限制了均值的计算只包含了在输入数据中顺序相近的点^[21]，这使得双向过滤能够被应用在图像去噪等数据点的空间安排是非常重要的问题中。

相似问题

目标函数是使得聚类平方误差最小化的算法还有K-中心点算法，该方法保持聚类的中心在一个真实数据点上，亦即使用中心而非图心作为均值点。

参考资料

^ Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1957, 4 (12): 801–804. MR 0090073. Zbl 0079.16403 （French）.
^ MacQueen, J. B. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press: 281–297. 1967 [2009-04-07]. MR 0214227. Zbl 0214.46201.
^ Lloyd, S. P. Least square quantization in PCM. Bell Telephone Laboratories Paper. 1957. Published in journal much later: Lloyd., S. P. Least squares quantization in PCM (PDF). IEEE Transactions on Information Theory. 1982, 28 (2): 129–137 [2009-04-15]. doi:10.1109/TIT.1982.1056489.
^ E.W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965, 21: 768–769.
^ J.A. Hartigan. Clustering algorithms. John Wiley & Sons, Inc. 1975.
^ Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 1979, 28 (1): 100–108. JSTOR 2346830.
^ MacKay, David. Chapter 20. An Example Inference Task: Clustering (PDF). Information Theory, Inference and Learning Algorithms. Cambridge University Press. 2003: 284–292. ISBN 0-521-64298-1. MR 2012999.
^ Since the square root is a monotone function, this also is the minimum Euclidean distance assignment.
^ ^9.0 ^9.1 Hamerly, G. and Elkan, C. Alternatives to the k-means algorithm that find better clusterings (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM). 2002.
^ Vattani., A. k-means requires exponentially many iterations even in the plane (PDF). Discrete and Computational Geometry. 2011, 45 (4): 596–616. doi:10.1007/s00454-011-9340-1.
^ ^11.0 ^11.1 Arthur, D.; Manthey, B.; Roeglin, H. k-means has polynomial smoothed complexity. Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS). 2009.
^ Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009, 75: 245–249. doi:10.1007/s10994-009-5103-0.
^ Dasgupta, S. and Freund, Y. Random Projection Trees for Vector Quantization. Information Theory, IEEE Transactions on. July 2009, 55: 3229–3242. arXiv:0805.1390 . doi:10.1109/TIT.2009.2021326.
^ Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The Planar k-Means Problem is NP-Hard. Lecture Notes in Computer Science. 2009, 5431: 274–285. doi:10.1007/978-3-642-00202-1_24.
^ Inaba, M.; Katoh, N.; Imai, H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. Proceedings of 10th ACM Symposium on Computational Geometry: 332–339. 1994. doi:10.1145/177424.178042.
^ Arthur; Abhishek Bhowmick. A theoretical analysis of Lloyd's algorithm for k-means clustering (学位论文). 2009. [1]^{[失效連結]}
^ Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling, Mathematical Geosciences, 42: 487 - 517
^ Coates, Adam; Ng, Andrew Y. Learning feature representations with k-means (PDF). G. Montavon, G. B. Orr, K.-R. Müller (编). Neural Networks: Tricks of the Trade. Springer. 2012.
^ Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP. Section 16.1. Gaussian Mixture Models and k-Means Clustering. Numerical Recipes: The Art of Scientific Computing 3rd. New York: Cambridge University Press. 2007. ISBN 978-0-521-88068-8.
^ Template:Cite Journal
^ ^21.0 ^21.1 Little, M.A.; Jones, N.S. Generalized Methods and Solvers for Piecewise Constant Signals: Part I (PDF). Proceedings of the Royal Society A. 2011.
^ H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. Spectral Relaxation for K-means Clustering (PDF). Neural Information Processing Systems vol.14 (NIPS 2001) (Vancouver, Canada). Dec 2001: 1057–1064.
^ Chris Ding and Xiaofeng He. K-means Clustering via Principal Component Analysis (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2004). July 2004: 225–232.
^ Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V. Vinay. Clustering large graphs via the singular value decomposition (PDF). Machine learning. 2004, 56: 9–33 [2012-08-02]. doi:10.1023/b:mach.0000033113.59016.96.
^ Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu. Dimensionality reduction for k-means clustering and low rank approximation (Appendix B). ArXiv. 2014 [2014-11-29].
^ Alon Vinnikov and Shai Shalev-Shwartz. K-means Recovers ICA Filters when Independent Components are Sparse (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2014). 2014.

外部链接

[1] Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 1957, 4 (12): 801–804. MR 0090073. Zbl 0079.16403 （French）.

[macqueen1967-2] MacQueen, J. B. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press: 281–297. 1967 [2009-04-07]. MR 0214227. Zbl 0214.46201.

[lloyd1957-3] Lloyd, S. P. Least square quantization in PCM. Bell Telephone Laboratories Paper. 1957. Published in journal much later: Lloyd., S. P. Least squares quantization in PCM (PDF). IEEE Transactions on Information Theory. 1982, 28 (2): 129–137 [2009-04-15]. doi:10.1109/TIT.1982.1056489.

[forgy65-4] E.W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965, 21: 768–769.

[hartigan1975-5] J.A. Hartigan. Clustering algorithms. John Wiley & Sons, Inc. 1975.

[hartigan1979-6] Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 1979, 28 (1): 100–108. JSTOR 2346830.

[7] MacKay, David. Chapter 20. An Example Inference Task: Clustering (PDF). Information Theory, Inference and Learning Algorithms. Cambridge University Press. 2003: 284–292. ISBN 0-521-64298-1. MR 2012999.

[8] Since the square root is a monotone function, this also is the minimum Euclidean distance assignment.

[hamerly-9] 9.0 ^9.1 Hamerly, G. and Elkan, C. Alternatives to the k-means algorithm that find better clusterings (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM). 2002.

[10] Vattani., A. k-means requires exponentially many iterations even in the plane (PDF). Discrete and Computational Geometry. 2011, 45 (4): 596–616. doi:10.1007/s00454-011-9340-1.

[Arthur,_D.;_Manthey,_B.;_Roeglin,_H._2009-11] 11.0 ^11.1 Arthur, D.; Manthey, B.; Roeglin, H. k-means has polynomial smoothed complexity. Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS). 2009.

[12] Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009, 75: 245–249. doi:10.1007/s10994-009-5103-0.

[13] Dasgupta, S. and Freund, Y. Random Projection Trees for Vector Quantization. Information Theory, IEEE Transactions on. July 2009, 55: 3229–3242. arXiv:0805.1390 . doi:10.1109/TIT.2009.2021326.

[14] Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The Planar k-Means Problem is NP-Hard. Lecture Notes in Computer Science. 2009, 5431: 274–285. doi:10.1007/978-3-642-00202-1_24.

[15] Inaba, M.; Katoh, N.; Imai, H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. Proceedings of 10th ACM Symposium on Computational Geometry: 332–339. 1994. doi:10.1145/177424.178042.

[16] Arthur; Abhishek Bhowmick. A theoretical analysis of Lloyd's algorithm for k-means clustering (学位论文). 2009. [1]^{[失效連結]}

[17] Honarkhah, M and Caers, J, 2010, Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling, Mathematical Geosciences, 42: 487 - 517

[Coates2012-18] Coates, Adam; Ng, Andrew Y. Learning feature representations with k-means (PDF). G. Montavon, G. B. Orr, K.-R. Müller (编). Neural Networks: Tricks of the Trade. Springer. 2012.

[19] Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP. Section 16.1. Gaussian Mixture Models and k-Means Clustering. Numerical Recipes: The Art of Scientific Computing 3rd. New York: Cambridge University Press. 2007. ISBN 978-0-521-88068-8.

[20] Template:Cite Journal

[Little2011-21] 21.0 ^21.1 Little, M.A.; Jones, N.S. Generalized Methods and Solvers for Piecewise Constant Signals: Part I (PDF). Proceedings of the Royal Society A. 2011.

[22] H. Zha, C. Ding, M. Gu, X. He and H.D. Simon. Spectral Relaxation for K-means Clustering (PDF). Neural Information Processing Systems vol.14 (NIPS 2001) (Vancouver, Canada). Dec 2001: 1057–1064.

[23] Chris Ding and Xiaofeng He. K-means Clustering via Principal Component Analysis (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2004). July 2004: 225–232.

[24] Drineas, P.; A. Frieze; R. Kannan; S. Vempala; V. Vinay. Clustering large graphs via the singular value decomposition (PDF). Machine learning. 2004, 56: 9–33 [2012-08-02]. doi:10.1023/b:mach.0000033113.59016.96.

[25] Cohen, M.; S. Elder; C. Musco; C. Musco; M. Persu. Dimensionality reduction for k-means clustering and low rank approximation (Appendix B). ArXiv. 2014 [2014-11-29].

[26] Alon Vinnikov and Shai Shalev-Shwartz. K-means Recovers ICA Filters when Independent Components are Sparse (PDF). Proc. of Int'l Conf. Machine Learning (ICML 2014). 2014.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]