PyMDE: Minimum-Distortion Embedding#

PyMDE is a Python library for computing vector embeddings of items, such as images, biological cells, nodes in a network, or any other type of abstract object. The embeddings are designed to minimally distort relationships between pairs of items, while possibly satisfying some constraints.

For more background about the methods that PyMDE uses, please see our monograph Minimum-Distortion Embedding, which introduces a simple but general framework for embedding.

You can use PyMDE to:

  • visualize datasets, small or large;

  • embed into arbitrarily large dimensions (much greater than 2 or 3);

  • generate feature vectors for supervised learning;

  • compress high-dimensional vector data;

  • draw graphs;

  • and more.

Unlike other popular embedding methods, PyMDE also lets you

  • compute large embeddings quickly, using a GPU;

  • fit embeddings that satisfy constraints, such as having uncorrelated feature columns;

  • design custom embeddings (and recreate well-known ones, like PCA, Laplacian eigenmap, LargeVis, and UMAP);

  • sanity-check embeddings using principled methods;

  • look for outliers in original data.

Example#

Here’s an example that shows how to use PyMDE to visualize high-dimensional vectors (which in this case are images from the MNIST dataset).

import pymde

mnist = pymde.datasets.MNIST()
embedding = pymde.preserve_neighbors(mnist.data, embedding_dim=2, verbose=True).embed()
pymde.plot(embedding, color_by=mnist.attributes['digits'])
_images/mnist.png

Each point represents an image in the dataset, and every point is colored by the digit represented by its underlying image. However, only the raw pixel data was used to create the embedding. This example embedded into two dimensions, allowing us to plot the result, but PyMDE supports embedding into much larger dimensions.

This is just one of the many things that PyMDE can do.

Why use PyMDE?#

There are many reasons to try out PyMDE. Here are some.

PyMDE is expressive. PyMDE is based on a simple but general framework for embedding, called Minimum-Distortion Embedding (MDE). The MDE framework generalizes well-known methods like PCA, spectral embedding, multi-dimensional scaling, LargeVis, and UMAP. With PyMDE, it is easy to recreate well-known embeddings and to create new ones, tailored to your particular application.

Unlike other embedding methods, PyMDE lets you impose constraints on the embedding. For example, the below GIFs show PyMDE computing three different kinds of embeddings of MNIST. The first and third embeddings are constrained to be standardized (which roughly means they have uncorrelated features), while the second is unconstrained. Standardized embeddings are favorably scaled for downstream machine learning tasks, such as supervised learning.

_images/mnist_quadratic.gif _images/mnist_centered.gif _images/mnist_std.gif

PyMDE is easy to use. It’s easy to get started with PyMDE. With just one line of code, you can embed high-dimensional vectors or the nodes of a graph into any target dimension. Creating custom embeddings is also easy, because PyMDE separates the description of an embedding from how it is computed: you say what kind of embedding you want, and PyMDE will compute it for you.

In addition to computing embeddings, PyMDE can plot them (if the embedding dimension is three or less), create GIFs of the embedding process, help you compare different embeddings, preprocess your data, and more.

PyMDE is principled. The MDE framework provides principled methods to debug and analyze embeddings. These methods can be used to make your embeddings more faithful to your original data, and to search for anomalies and outliers in the original data.

PyMDE is fast. PyMDE is competitive in runtime with more specialized embedding methods, like UMAP, and can scale to datasets with millions of items. With a GPU, it can be even faster. PyMDE scales well with the embedding dimension, meaning that you can easily embed into dimensions such as 50, 100, or 250.

Finally, PyMDE can draw graphs in orders of magnitude less time than libraries like NetworkX.