VC: Multidimensional Scaling (MDS)
Intuition for MDS
It is a dimensional reduction technique mapping pairwise distances to coordinates. It can turn non-geometric features into geometric embedding. It is normally used for visualizing high-dimensional data and preprocessing data before clustering.
Knowing only the distances or differences between them can give reasonable geometric coordinates as you can see above. We only know distances between the cities but MDS calculates the almost perfect geometric coordinates. Maybe, we need to shift and rotate the US map.
Basic Concepts
The inputs of the algorithm are symmetric dissimilarity(distance) matrix, containing pair-wise dissimilarity between data samples(in high dimensional space) and desired dimensionality k (k is less than previous high dimension, most of the case in visualization k = 2 or 3). The output will be an image of data in a Euclidean frame. The data points in the frame are the best approximation to the original dissimilarity for desired k dimensions.
The algorithm needs to find the data points and it approximates the original differences in high dimensional space. It has to be invariant against rigid transformations i.e. translation, rotation, and mirroring because those do not change the distance. It also takes into account n(n-1)/2 errors between the individual distances. Now, we have to define the loss function to optimize this and we have three options:
The 𝛿𝑖𝑗 is the dissimilarity of input data points in high dimensional space. The dij is the distance between data point i and j in the desired k dimension. Let’s take a look at what is differences between the three error functions above. The Jee(the right top) considers the size itself so that it punishes large absolute deviations. The Jff(the left bottom) takes into account the relative size so that it punishes large relative deviations. Jef(the right bottom), so-called Sammon mapping, compromises between Jee and Jff. The distance calculation in output space commonly is L2 distances.
MDS: Solution
The common solution is Gradient descent.
- Initialize embedded points in random positions
- Iteratively shift points to reduce the stress(error function), until the overall configuration stabilizes.
- Shifts are obtained by some constant times the negative derivatives of stress with respect to point positions.
The derivatives of error functions:
Derivatives direct where the embedded points(yk) move based on minimizing the differences between original distances and embedded distances. We fix the one point(yj) and the other point(yk) will move. The difference between the original distance and the embedded distance is positive when the embedded distance is longer than the desired distance(the original distance, dkj-𝛿kj). Then, it makes the desired distance shorter by the negative sign of the derivatives. The opposit case will give the same logic.
MDS via Kernel PCA
We can calculate the MDS more easily via Kernel PCA(check my post.) if we assume the input matrix(dissimilarity matrix) contains squared Euclidean distances between points in an unknown feature space. Kernel PCA consists of dot products of features.
We want to change the squared matrix into the inner products of each vector. We can change it as in Kernel PCA. Centered K can be inner products.
Centered K is calculated by the inner product of centered vectors in Kernel PCA but Centered K in MDS assumes the euclidean distance, it is also assumed, is centered by 0.
The only difference is -1/2 and we can also use matrix H. Therefore, centered K is -1/2 HΔH. Δ is the input matrix. Now, we apply eigenvalue decomposition on the Centered K. Then, we can get the desired coordinates. We have to take care of the eigenvalue, it should be positive or we need to use a gradient descent method if we cannot eliminate the negative values.
We talked about how we calculate the MDS. There are two methods, the gradient descent method, and the kernel PCA.
This post is published on 9/8/2020