1-nci60-pca-clustering-svm

简介

参见 [ISLR-nci60]:An Introduction to Statistical Learning.pdf page 18

流程为:pca->clustering->svm 半监督学习方法,首先对数据降维, 然后聚类, 最后使用 SVM 进行分类学习

1. load package

Code
import MLJ:transform,predict
using DataFrames,MLJ,CSV,MLJModelInterface,GLMakie,Random
Random.seed!(45454)
WARNING: using DataFrames.transform in module Main conflicts with an existing identifier.
TaskLocalRNG()

2. import data

Code
    df= CSV.File("./data/NCI60.csv") |> DataFrame |> dropmissing
    Xtr = df[:,2:end]
    Xtr_labels = Vector(df[:,1])
    # # split other half to testing set
    Xte=df[1:3:end,2:end]
    Xte_labels = Vector(df[1:3:end,1])
    first(df,10)
10×6831 DataFrame
6731 columns omitted
Row Column1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
String3 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 V1 0.3 1.18 0.55 1.14 -0.265 -0.07 0.35 -0.315 -0.45 -0.65498 -0.65 -0.94 0.31 0.0150098 -0.08 -2.37 -0.54 -0.615 0.0 -0.51999 -0.37 -0.29 -0.17499 0.07 -0.04 0.025 -0.74 -0.47999 -0.45 -0.93 0.16 -0.55 -0.55001 0.055 -0.37 -0.165 0.21 0.47 0.0 -2.60208e-18 0.139981 -0.215 -0.065 -0.225 -0.35 -1.335 0.0 0.2175 0.25 0.13 -0.48 -0.42 -0.7 -0.275 -0.34499 -0.16 -0.35 0.555 0.29 -0.27 -0.339981 0.305 -0.005 0.7 0.45002 0.21 0.29 0.0849902 -0.45501 0.12 -0.66 0.1 0.1 -0.099961 -0.399981 -0.195 0.28 2.36 0.47 0.18 -0.64499 1.3 0.0 -0.48 0.595 -0.0599805 0.055 0.0975 0.4 0.28 0.76 1.425 -0.51 0.94 0.94 0.68 -0.21 -1.19 0.0
2 V2 0.679961 1.28996 0.169961 0.379961 0.464961 0.579961 0.699961 0.724961 -0.040039 -0.285019 -0.310039 -0.720039 -0.010039 0.0 -0.570039 0.0 -0.470039 -0.355039 0.00498051 -0.480029 -0.140039 -0.090039 0.00497074 -0.220039 -0.370039 0.0 -0.320039 0.159971 0.179961 -0.320039 -0.440039 0.349961 0.449951 0.104961 0.489961 0.204961 -0.050039 -0.010039 0.269961 0.019961 0.0499415 -0.315039 -0.325039 -0.055039 -0.280039 -0.255039 0.229961 -0.342539 -0.560039 -0.900039 -0.060039 -0.200039 -0.670039 0.324961 0.134971 0.539961 0.229961 0.084961 -0.080039 0.949961 0.93998 0.194961 1.90496 0.499961 0.349981 0.899961 1.21996 0.0 0.374951 0.279961 -3.89862e-5 -0.090039 -0.050039 0.0 0.90998 0.274961 -0.040039 0.869961 -0.100039 1.40996 1.00497 0.779961 -0.110039 -0.350039 -0.215039 -0.0600195 0.324961 0.267461 0.129961 0.229961 0.079961 0.514961 -0.420039 -0.350039 -0.790039 -0.290039 -0.010039 -1.05004 -2.04004
3 V3 0.94 -0.04 -0.17 -0.04 -0.605 0.0 0.09 0.645 0.43 0.475019 0.41 0.13 -0.35 0.0 0.0 0.0 -0.8 0.0 -0.00498051 0.0 -0.14 0.05 -0.0649903 -0.06 0.29 0.715 -0.07 -0.0899903 -0.31 0.58 -0.48 0.23 -0.0400098 -0.935 -0.75 -0.385 -0.34 0.12 -0.47 0.17 -0.86002 -0.175 -0.715 -0.965 -0.54 -0.005 -0.06 -0.7225 -0.92 0.47 0.7 0.67 -0.9 -0.265 -0.42499 -0.24 -0.03 0.215 0.29 0.07 0.12002 0.515 0.545 -0.03 0.19002 -0.07 0.4 0.0 -0.0950098 0.27 0.08 0.0 0.0 -0.319961 0.0900195 0.105 -0.28 1.99 0.0 0.87 0.0 0.74 0.0 -1.2 -0.335 0.630019 0.345 0.6975 0.27 -0.02 0.0 -0.115 -0.44 0.54 0.49 0.64 0.66 0.0 0.0
4 V4 0.28 -0.31 0.68 -0.81 0.625 -1.38778e-17 0.17 0.245 0.02 0.0950195 -0.01 -0.12 -0.21 0.0 0.61 -1.02 -0.47 0.0 -0.76498 0.0 -0.31 -0.62 -0.28499 -0.54 -0.52 -0.135 -0.89 -0.26999 -0.84 -0.23 0.32 0.0 0.10999 0.455 -0.34 -0.895 -1.08 -0.43 -0.03 -0.13 -0.540019 -1.225 -1.265 -1.415 -0.27 -0.705 -0.22 -0.5025 -0.04 -0.15 -0.16 -0.29 -0.18 -0.665 -0.77499 0.21 -0.77 -0.605 -0.19 0.17 -0.20998 -0.615 0.165 -0.19 0.0 -0.06 -0.01 -0.50501 -0.77501 -0.29 -0.71 -0.45 -0.61 -0.809961 -1.53998 -1.035 0.0 3.6 0.0 0.85 -0.19499 -1.21 1.02 0.66 0.775 -0.279981 -0.245 -0.5025 -0.8 -0.75 0.06 -1.075 0.54 0.16 0.0 0.23 -0.74 0.0 -2.5
5 V5 0.485 -0.465 0.395 0.905 0.2 -0.005 0.085 0.11 0.235 1.49002 0.685 0.605 0.355 1.22001 2.425 0.0 -0.315 0.31 -0.519981 -0.0749902 -0.865 -0.455 -0.49999 -0.245 -0.235 -0.33 0.0 0.0150097 -0.105 -0.225 -0.105 -0.275 -0.57501 -0.45 -0.465 -0.39 -0.995 -0.355 0.0 -0.475 -0.38502 -0.77 -0.96 -0.97 -0.895 -0.63 -0.535 -0.8875 -0.945 -0.535 0.005 0.185 -0.105 -0.34 -0.19999 -0.665 -0.225 -0.41 -0.215 -0.175 -0.784981 -0.33 -0.39 -0.075 -0.20498 -0.325 -0.485 -0.35001 -0.23001 -0.155 -0.575 -0.615 -0.695 -0.634961 -1.09498 -1.0 -0.335 -1.385 0.345 0.815 0.56001 -0.155 0.0 -1.195 -0.16 -0.10498 0.0 -0.0075 -0.945 -0.965 -0.225 -0.46 0.045 0.795 1.305 0.705 0.055 0.715 0.925
6 V6 0.31 -0.03 -0.1 -0.46 -0.205 -0.54 -0.64 -0.585 -0.77 -0.24498 -0.12 0.0 -0.69 -0.73499 -0.67 -0.05 0.09 -0.805 0.59502 -0.42999 -0.85 -0.09 -0.0149903 0.0 0.15 0.805 -0.7 0.36001 -0.16 0.04 -0.17 0.09 0.0599902 -0.635 -0.51 -0.585 0.72 -0.17 -0.55 0.21 -0.13002 0.125 -0.415 -0.475 -0.02 -0.405 -0.4 0.0475 0.22 -1.4 -0.65 -0.65 -0.15 -0.475 -0.51499 0.0 -0.35 -0.755 -0.54 -1.16 -1.70998 -0.415 -1.275 -0.89 -0.269981 -1.18 -1.89 -1.93501 -2.56501 -2.5 -1.01 -0.75 -0.85 -0.589961 0.0 -0.085 0.18 1.37 -0.15 -2.84 0.0 -1.74 -1.38 -2.62 -0.715 -0.819981 -0.545 -0.5525 -0.76 -1.02 -1.71 -0.495 -0.76 -0.28 -0.39 -0.5 -0.53 -0.85 -0.38
7 V7 -0.83 0.0 0.13 -1.63 0.075 -0.36 0.1 0.155 -0.29 -0.0849805 -0.01 -0.24 -0.44 0.0 0.0 0.0 -0.38 0.495 0.0 0.43001 0.31 0.09 -0.30499 -0.24 -0.58 0.445 -0.28 0.37001 0.0 -0.27 -0.12 0.05 0.00999023 -0.195 -0.86 -0.405 -0.6 -0.27 -0.54 0.19 -0.0100195 -0.465 -1.375 -0.945 -0.31 -0.435 -0.1 -0.8725 -0.74 -2.66 -2.02 -1.52 -1.33 -0.555 -0.84499 -0.76 -0.84 -0.735 -0.37 -0.95 -1.14998 -1.185 -1.485 -0.58 -0.0299805 -1.26 -1.25 -0.0450098 -1.08501 -0.08 -0.44 -0.1 -0.28 -0.269961 0.0900195 -0.335 -0.17 -0.48 0.05 -1.12 -0.58499 -1.93 0.0 -1.29 -0.875 -1.62998 -0.755 -0.8325 -0.73 -0.47 -0.69 -1.145 -1.6 -0.58 -1.39 -0.62 -0.15 -0.79 -0.83
8 V8 -0.19 -0.87 -0.45 0.08 0.005 0.35 -0.04 -0.265 -0.31 -0.24498 -0.91 -1.23 -0.52 -0.26499 -0.48 -1.4 -0.19 0.125 -0.51498 -0.83999 -0.72 -0.32 -0.59499 -0.34 -0.82 0.0 -1.18 -0.32999 -0.69 -1.67 1.3 0.0 -0.43001 -0.685 -1.55 -0.955 -0.74 -0.2 -0.58 0.01 -0.44002 -0.055 -0.565 -0.795 -0.8 -0.345 -0.37 -0.2725 -0.71 -0.21 -0.32 -0.3 -0.92 -0.115 0.0150098 0.13 0.13 0.475 -0.57 -0.07 -0.389981 -0.185 -1.055 0.15 -0.18998 0.84 0.84 0.73499 0.18499 0.21 -0.48 -0.28 0.12 0.100039 -0.389981 -0.715 0.14 -0.82 -0.47 -1.64 0.0 0.0 0.0 -0.4 -0.335 0.11002 -0.125 0.0675 -0.77 -0.65 -0.43 1.215 0.4 -0.7 -1.79 -1.13 -0.07 0.53 0.21
9 V9 0.46 0.0 1.15 -1.4 -0.005 -0.7 -0.92 -0.515 -0.28 -0.11498 -0.17 -0.4 -0.52 -0.23499 2.21 -2.44 0.26 0.155 0.135019 -0.90999 -0.13 -0.4 -0.29499 0.16 -0.23 -0.315 0.51 0.20001 -0.19 0.09 -0.85 -0.78 -1.05001 0.005 -1.06 -0.425 -0.76 -0.27 -0.96 -1.26 -1.05002 -0.835 -1.705 -1.195 -1.24 -1.065 0.04 -0.0225 -0.48 -0.46 0.22 0.46 0.56 0.335 0.35501 0.39 0.18 -0.055 -0.11 0.46 0.46002 -0.005 0.445 -0.52 -0.339981 -0.01 -0.55 0.17499 -0.41501 -0.18 0.05 3.44234e-18 0.0 -0.249961 0.360019 0.195 -0.3 -0.77 -0.37 -0.34 -0.45499 -0.82 0.0 -0.7 0.145 0.0200195 -0.155 -0.1825 0.74 0.52 0.05 -0.105 -0.54 0.0 0.53 0.21 0.0 0.0 1.36
10 V10 0.76 1.49 0.28 0.1 -0.525 0.36 0.6 0.175 0.58 1.14502 1.75 1.71 0.51 0.66501 -0.1 -0.24 -0.69 -0.115 0.235019 0.89001 0.3 0.08 0.12501 0.32 1.28 0.895 1.35 0.42001 0.37 0.45 -0.25 -0.01 -0.27001 -0.455 0.61 0.615 0.41 0.2 0.13 0.7 -0.12002 -0.135 -0.235 0.475 0.04 -0.295 1.05 1.1275 0.74 -0.26 -0.09 0.15 -0.09 -0.265 0.15501 -0.23 0.25 0.095 0.36 -0.08 -0.12998 0.205 0.985 0.42 0.32002 0.25 0.19 0.21499 0.0449902 -0.2 -0.31 0.0 -0.41 -0.00996101 0.19002 0.095 0.0 1.94 0.14 -0.18 -0.58499 -0.42 -0.38 -1.52 -0.745 -0.55998 -0.285 -0.2025 -0.05 0.35 0.03 0.755 0.8 1.52 2.11 0.91 -0.25 -0.56 -0.77

2. MLJ WorkFlow

2.1 define models

需要定义三个模型:

  1. pca model
  2. clustering model
  3. classficiation model
Code
 PCA = @load PCA pkg=MultivariateStats
 KMeans = @load KMeans pkg=Clustering
 SVC = @load SVC pkg=LIBSVM

 model=PCA(maxoutdim=2) # pca model
 model2 = KMeans(k=3)   # clustering model
 model3 = SVC()        # svm dodel
[ Info: For silent loading, specify `verbosity=0`. 
[ Info: For silent loading, specify `verbosity=0`. 
[ Info: For silent loading, specify `verbosity=0`. 
import MLJMultivariateStatsInterface ✔
import MLJClusteringInterface ✔
import MLJLIBSVMInterface ✔
SVC(
  kernel = LIBSVM.Kernel.RadialBasis, 
  gamma = 0.0, 
  cost = 1.0, 
  cachesize = 200.0, 
  degree = 3, 
  coef0 = 0.0, 
  tolerance = 0.001, 
  shrinking = true)

2.2 PCA

在 PCA 流程中要完成两步:

  1. PCA 模型训练(如果为了便于可视化, 维度为 2或者 3)
  2. 将原始数据映射到降维的空间上
Code
mach = machine(model, Xtr) |> fit!
Xproj =transform(mach, Xtr)
first(Xproj,10)
[ Info: Training machine(PCA(maxoutdim = 2, …), …).
10×2 DataFrame
Row x1 x2
Float64 Float64
1 -19.7958 0.115269
2 -21.5461 -1.45735
3 -25.0566 1.52609
4 -37.4095 -11.3895
5 -50.2186 -1.34617
6 -26.4352 0.462982
7 -27.3393 2.65031
8 -21.4897 4.95414
9 -20.8525 10.1631
10 -26.9529 21.4733

2.3 生成决策边界测试数据集

Code
function boundary_data(df,;n=200)
    n1=n2=n
    xlow,xhigh=extrema(df[:,:x1])
    ylow,yhigh=extrema(df[:,:x2])
    tx = range(xlow,xhigh; length=n1)
    ty = range(ylow,yhigh; length=n2)
    x_test = mapreduce(collect, hcat, Iterators.product(tx, ty));
    xtest=MLJ.table(x_test')
    return tx,ty, xtest
end
tx,ty, xtest=boundary_data(Xproj)  #xtest  生成决策边界的数据
(-50.21864152783379:0.5383986285981281:56.9226855631937, -51.362711708066826:0.3660101806028475:21.473314231899835, Tables.MatrixTable{LinearAlgebra.Adjoint{Float64, Matrix{Float64}}} with 40000 rows, 2 columns, and schema:
 :x1  Float64
 :x2  Float64)

2.5 Clustering and SVM training

Code
 mach2= machine(model2, Xproj) |> fit!
 yhat = predict(mach2, Xproj)  # 聚类结果
 cat=yhat|>Array|>levels

 mach3 = machine(model3, Xproj, yhat)|>fit!
 ypred=predict(mach3, xtest)|>Array|>d->reshape(d,200,200) #SVM 结果
[ Info: Training machine(KMeans(k = 3, …), …).
[ Info: Training machine(SVC(kernel = RadialBasis, …), …).
200×200 Matrix{Int64}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1  1  1  1  1
 ⋮              ⋮              ⋮        ⋱        ⋮              ⋮           
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3  …  3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3  …  3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3
 3  3  3  3  3  3  3  3  3  3  3  3  3     3  3  3  3  3  3  3  3  3  3  3  3

2.6 plot results

Code
    function plot_model()
        fig = Figure()
        ax = Axis(fig[1, 1],title="NCI60 Machine Learning",subtitle="pca->clustering->svm")

        colors = [:red, :orange, :blue]
        contourf!(ax, tx,ty,ypred)
        for (i, c) in enumerate(Array(yhat))
            data = Xproj[i, :]
            
            scatter!(ax, data.x1, data.x2;marker=:circle,markersize=12,color=(colors[c],0.3),strokewidth=1,strokecolor=:black)
            text!(ax,data.x1, data.x2;text="v$(i)")
        end

        fig
        #save("NCI60 Machine Learning:pca->clustering->svm-with-tag.png",fig)
    end

    plot_model()