【基礎資料探勘】Ch3 Part1

Ch3 Classification & Model Overfitting

Classification

Def : 學習一模型，將每個Attribute Set $x$ 轉換成預定義的 Class Label $y$
Classification Model :
Classification Techniques :
- Base Classifiers : Decision Tree、Rule-based、Nearest-neighbor、NN、DL、Naïve Bayes、SVM
- Ensemble Classifiers(集成分類) : Boosting、Bagging、隨機森林
Decision Tree 之 Induction (歸納) — Hunt’s Algorithm :
- 假設 $D_t$ 是抵達 Node $t$ 的資料集 :
  1. 若 $D_t$ 的資料皆屬於 class $y_t$ : $t$ 即為 Leaf Node
  2. 若 $D_t$ 的資料屬於一個以上 class : 使用一 attribut e去切割資料成更小的 subsets
Expressing Test Conditions :
- Nominal Attributes : Multi-way split & Binary spilt
- Ordinal Attributes : Multi-way split & Binary spilt
- Continuous Attributes :
  - 分法 : Multi-way split & Binary spilt
  - 處理方式 :
    - Discretization : 建構出 Ordinal Categorical Attribute
      - 靜態 : 開始時進行一次
      - 動態 : 每個 Node 都做一次
    - Binary Decision : 切成 $(A \le v)$ 或 $(A \ge v)$
選擇更加純淨 (purer) 的分類節點 → 可決定出最優的 Spilt
Measures of Node Impurity (節點純淨度測量) :
- Information Gain 算法 : $$GAIN = P \text{(Spilt 前的純淨度)}- M \text{(Spilt 後的加權純淨度)}$$
  → $GAIN$越大越好
- GINI Index (GINI指數) :
  - 單一 Node : $GINI(t)=1-\sum_j {[p(j|t)]}^2$
  - GAIN 算法 : $GAIN_{split}=GINI(p)-\sum^k_{i=1} \frac{n_i}{n} GINI(i)$
  - 從 Continuous Attributes 轉成 Binary Attributes 之方式 :
    1. 依照 value 排序 Atrributes
    2. 兩兩 values 之間切割出一 Partition，並計算其 GINI Index
    3. 取 GINI Index 最小之 Partition 作為 Split position
- Entropy (熵) :
  - 單一 Node : $Entropy(t)=-\sum_j {[p(j|t)]}log_2{[p(j|t)]}$
  - GAIN 算法 : $GAIN_{split}=Entropy(p)-\sum^k_{i=1} \frac{n_i}{n} Entropy(i)$
  - GAIN Ratio : $GainRATIO_{split}=\frac{GAIN_{split}}{SplitINFO}$，其中$SplitINFO=-\sum_{i=1}^k\frac{n_i}{n}\cdot \log\frac{n_i}{n}$
    - Entropy partitioning 越高，懲罰越多
- Misclassification error (分類錯誤) :
  - 單一 Node : $Error(t)=1-max{P(i|t)}$
- 三種算法之差異 :
Decision Tree Based Classification :
- 優點 :
  - 建構成本低
  - 對未知資料進行分類速度快
  - 易解析較小顆的 Tree
  - 易抵擋 Noise
  - 易處理多餘或不相關的 Atrributes
- 缺點 :
  - 所佔空間可能很大
  - Greedy 方式可能找不到最好的樹
  - 沒有考慮 Attributes 之間的互相作用
  - 每個決策邊界只考慮單一 Attribute