# Data Mining 笔记聚类

### 一、概念

Cluster: A collection of data objects,?similar (or related) to one another within the same group,?dissimilar (or unrelated) to the objects in other groups.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?–Jiawei Han

Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised

### 二、应用领域

• 一个单独的工具了解数据分布；
• 作为其他方法的预处理。
¨市场研究、模式识别、数据分析、图像处理
¨在某些应用中，聚类又称数据分割data segmentation，因为它根据数据的相似性把大型数据集合划分成组
¨聚类还可以用于离群点检测outlier detection，其中离群点（“远离”任何簇的值）可能比普通情况更值得注意 。

### 三、聚类技术

#### Partitioning approach:

Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Objects are often linked together in various ways

### 四、好的聚类算法

A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters