Data Mining 笔记聚类

一、概念

Cluster: A collection of data objects,?similar (or related) to one another within the same group,?dissimilar (or unrelated) to the objects in other groups.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?–Jiawei Han

是数据对象集合,同一个簇中的对象之间具有很高的相似度,而不同簇中的对象高度相异
相异度根据描述对象的属性值评估,通常使用距离度量
聚类clustering是将物理或抽象对象的集合分成相似的对象类或cluster的过程 

Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised
是非先验的学习,没有定义好的类。这是分类最大的差别。?

二、应用领域

典型应用:

  • 一个单独的工具了解数据分布;
  • 作为其他方法的预处理。
¨市场研究、模式识别、数据分析、图像处理
¨在某些应用中,聚类又称数据分割data segmentation,因为它根据数据的相似性把大型数据集合划分成组
¨聚类还可以用于离群点检测outlier detection,其中离群点(“远离”任何簇的值)可能比普通情况更值得注意 。

三、聚类技术

Partitioning approach:

Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus

四、好的聚类算法

A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
完。

原创文章。为了维护文章的版本一致、最新、可追溯,转载请注明: 转载自idouba

本文链接地址: Data Mining 笔记聚类


,

No comments yet.

发表评论