MDSCUDA

Multidimensional Scaling

Multidimensional Scaling (MDS) is an algorithm which takes a set of points in M-dimensional space and attempts to embed them into a lower dimensional space while preserving pairwise distances as accurately as possible. More precisely, if z_i denote the original n points in M-dimensional space and x_i represent the embedding of these points into m-dimensional space, MDS tries to minimize:

    \[\sigma(X) = \sum_{i=1}^n\sum_{j = 1}^n (|z_i - z_j| - |x_i - x_j|)^2.\]

Naturally, MDS has applications to dimension reduction for datasets with a large number of features, as well as to data visualization.

Aside: one interesting thing to note here is that because the loss function depends only on the pairwise distances, it is possible to start with just a matrix of pairwise distances \delta and use MDS to try to reconstruct an arrangement of points in space whose pairwise distance matrix is equal to or close to \delta.

The SMACOF algorithm

The MDS loss function can be minimized with a general purpose algorithm like gradient descent. However, there is a much more efficient algorithm, created specifically for MDS called the SMACOF algorithm, which stands for “Scaling by MAjorizing a COmplicated Function” (see de Leeuw, J. (1977), “Applications of convex analysis to multidimensional scaling”).

If you aren’t interested in seeing the details of the algorithm, you can skip to the next section.

Before describing the SMACOF algorithm, we need some notation.

Let z_i denote a set of n points in \mathbb{R}^M, and let Z denote the corresponding n\times M matrix whose rows are z_i. Suppose we wish to apply MDS to Z to obtain an embedding z_i\mapsto x_i into \mathbb{R}^m. We will represent the image of the embedding as an n\times m matrix X.

For any matrix X, let d_{ij}(X) denote the Euclidean distance between row i and row j. Set \delta_{ij} = d_{ij}(Z).

Further, for any matrix X with n rows, define an n\times n matrix B(X) by

    \[b_{ij} = \left\{ \begin{array}{ll} \frac{\delta_{ij}}{d_{ij}(X)} & i \neq j, d_{ij}(X) \neq 0 \\ 0 & i \neq j, d_{ij}(X) = 0 \\ -\sum_{k \neq i}b_{ik} & i = j \\ \end{array} \right.\]

The SMACOF algorithm:

  • Initialize an n\times m matrix X_0 randomly or with a specific first guess.
  • Iteratively set X_k = \frac{1}{n}B(X_{k-1})X_{k-1}.
  • Run for a set number of iterations, or until a stopping condition is met, such as the loss not decreasing by more than a given threshold. Set X equal to X_k where k is the final iteration. Then the rows of X give the image of the embedding into \mathbb{R}^m.

You can find a proof of convergence, and even monotone decreasingness of \sigma(X_k) in de Leeuw’s paper cited previously. Note: in the paper, they cover a more general case, and they set X_k = V^+B(X_{k-1})X_{k-1} where V^+ is an n\times n matrix defined in the paper. However, one can check that with our setup, V^+B(X) = \frac{1}{n}B(X).

Cuda implementation

Sklearn has an open source implementation of MDS using the SMACOF algorithm, but I found it to be painfully slow on large datasets. However, because the SMACOF algorithm mainly consists of repeated matrix multiplication, it makes a good candidate to be parallelized and run on the GPU. I’ve implemented a fast, parallelized version of the SMACOF algorithm using CUDA and released on PyPI under the name “mdscuda” (pip install mdscuda).

As a basic benchmark, I ran mdscuda on a randomly generated dataset using numpy.random.normal with 10,000 samples and 1000 features on my AMD Ryzen 5 2600 CPU, Nvidia RTX 2080 Ti GPU, and 16 GB RAM. I reduced the dimensionality to 3, running 50 iterations of the SMACOF algorithm. mdscuda completed the task in 1.5 seconds, while sklearn took 125 seconds – a speedup of about 83 times. I think there is likely room for improvement still, but in practical terms, mdscuda runs fast enough for my use cases. Memory actually becomes the bigger issue, and I’ve taken steps to reduce the memory useage of mdscuda, such as storing symmetric matrices in “longform” (as a 1d array that stores the values of the upper triangle of the matrix).

In terms of comparing the quality of the embedding between mdscuda and sklearn, the final loss values for the benchmark were 11,014,284,288 for mdscuda and 11,022,683,577 for sklearn – no real difference. Note that for this dataset we expect the losses to be high as data randomly generated in a high dimensional space should not be expected to easily reduce to a low dimension.

While a randomly generated dataset is great for benchmarks, it isn’t very interesting to visualize. Below I’ll cover applications of MDS to some real datasets with some nice visualizations made with plotly.

Iris

The Iris dataset is one of the simplest and best known and datasets for classification tasks. The dataset consists of 150 samples falling into 3 classes (3 different types of Iris plants) with 4 features – sepal length, speal width, petal length, and petal and width. One of the three classes is linearly separable from the others, while the remaining two are not.

The Iris dataset is already low dimensional, meaning the number of samples exceeds the number of features. However, it provides a nice dataset to test out visualizations, and since the dataset has 4 features, we need to apply some sort of dimensional reduction technique to visualize it. I ran MDS to reduce to 2 dimensions, and used plotly to create a scatterplot of the MDS embeddings with each point color coded by class. You can see that the three different classes are clustered nicely, though the two classes that are not linearly separable blur together somewhat. If you click on the image, it will take you to an interactive version – you can zoom in, hover, and click the legend on the right to hide some of the classes.

https://sethebaldwin.github.io/iris-mdscuda/

Khan

The Khan Gene dataset is good example of a high-dimensional dataset. In this dataset, 63 patients had 2308 gene expression measurements recorded. Each of the patients had one of 4 types of cancerous tumors. Since this dataset has a large number of features, I ran MDS to generate a 3 dimensional embedding and made a scatterplot with each point color coded by class. Again, if you click on the image, it will take you to an interactive version – you can zoom in, rotate, hover, and click the legend on the right to hide classes. From looking around, you can see that members of each class tend to be clustered together, although where the boundary lines should be is not clear without looking at the color labels.

https://sethebaldwin.github.io/khan-mdscuda/

Digits

The last dataset we will cover is the Digits dataset. This is a sort of mini MNIST, consisting of 1797 images of the digits 0 through 9 handwritten. The images are 8×8, so the dataset has 64 features total. I ran MDS to generate a 3 dimensional embedding again. This dataset has many more samples than the previous, so the resulting visualization is much busier. But you can again see the same phenomenon where each class tends to cluster together. Again, if you click on the image, you will be taken to an interactive version.

https://sethebaldwin.github.io/digits-mdscuda/