Jarak Data

Data dapat diketahui equivalensinya menggunakan penggurukan jarak dari dataset.

Sebelum itu, mari kita tarik sampel data dan beri label.

from pandas import *
import itertools
import scipy.spatial.distance as spad
columns = ['Specimen Number', 'Eccentricity', 'Aspect Ratio', 'Elongation', 'Solidity']
df = read_csv('leaf.csv', nrows=4, usecols=columns)
data = [[["","A","B","C","D"][int(x[0])]]+[round(i*2,2) for i in x[1:]] for x in df.values.tolist()]
df = DataFrame(data, columns=['Class']+columns[1:])
df
Class Eccentricity Aspect Ratio Elongation Solidity
0 A 1.45 2.95 0.65 1.97
1 B 1.48 3.05 0.72 1.96
2 C 1.53 3.15 0.78 1.96
3 D 1.48 2.92 0.71 1.95

Minkowski Distance

Jarak Minkowski adalah jarak spatial antara dua record (x dan y) dengan m sebagai parameter real dan n sebagai jumlah dimensi pada entity.

d_{\operatorname{minkowski}} = \left(\sum_{i=1}^{n}|x_{i}-y_{i}|^{m}\right)^{\frac{1}{m}},m\geq 1

Special Case: + Jika M = 1 maka bisa disebut Manhattan (Cityblock) distance

d_{\operatorname{manhattan}} = \sum_{i=1}^{n}|x_{i}-y_{i}|
  • Jika M = 2 maka bisa disebut Euclidean distance.
d_{\operatorname{euclidean}} = \sqrt{\sum_{i=1}^{n}|x_{i}-y_{i}|^{2}}

Ilustrasi menggambarkan hitungan Cityblock (garis putus-putus) dan Euclidean (garis lurus) secara visual:

columns = ['v1-v2', 'Manhattan distance (M=1)', 'Euclidean distance (M=2)', 'Minkowski distance at M=3']
data2 = [(
    "{} - {}".format(a[0],b[0]),
    spad.cityblock(a[1:],b[1:]),
    spad.euclidean(a[1:],b[1:]),
    spad.minkowski(a[1:],b[1:],3),
    )
    for a, b in itertools.combinations(data, 2)]
DataFrame(data2,columns=columns)
v1-v2 Manhattan distance (M=1) Euclidean distance (M=2) Minkowski distance at M=3
0 A - B 0.21 0.126095 0.111091
1 A - C 0.42 0.251794 0.220426
2 A - D 0.14 0.076158 0.065265
3 B - C 0.21 0.126886 0.110275
4 B - D 0.15 0.130767 0.130039
5 C - D 0.36 0.245764 0.232918

Average Distance


columns = ['v1-v2', 'Average Distance', 'Euclidean distance (M=2)', 'Minkowski M=3']
data2 = [(
    "{} - {}".format(a[0],b[0]),
    spad.cityblock(a[1:],b[1:]),
    spad.euclidean(a[1:],b[1:]),
    spad.euclidean(a[1:],b[1:]),
    )
    for a, b in itertools.combinations(data, 2)]
DataFrame(data2,columns=columns)
v1-v2 Average Distance Euclidean distance (M=2) Minkowski M=3
0 A - B 0.21 0.126095 0.126095
1 A - C 0.42 0.251794 0.251794
2 A - D 0.14 0.076158 0.076158
3 B - C 0.21 0.126886 0.126886
4 B - D 0.15 0.130767 0.130767
5 C - D 0.36 0.245764 0.245764

Weighted Distance

Chord distance

Mahalanobis distance

Cosine Measure

Pearson correlation

Summary