Getting started#

This page shows how to get started with PyMDE for four common tasks:

  • visualizing datasets in two or three dimensions;

  • generating feature vectors for supervised learning;

  • computing classical embeddings, like PCA and spectral embedding;

  • drawing graphs in 2 or 3 dimensions;

Note

To learn how to create custom embeddings (with custom objective functions and constraints), sanity check embeddings, identify possible outliers in the original data, embed new data, and more, see the MDE guide. (We recommend reading the getting started guide first.)

What is an embedding?#

An embedding of a finite set of items (such as biological cells, images, words, nodes in a graph, or any other abstract object) is an assignment of each item to a vector of fixed length; the original items are embedded or mapped into a real vector space. The length of the vectors is called the embedding dimension. An embedding is represented concretely by a matrix, in which each row is the embedding vector of an item.

Embeddings provide concrete numerical representations of abstract items, for use in downstream computational tasks. For example, when the embedding dimension is 2 or 3, embeddings can be used to create a sort of chart or atlas of the items. In such a chart, each point corresponds to an item, and its coordinates in space are given by the embedding vector. These visualizations can help scientists and analysts identify patterns or anomalies in the original data, and more generally make it easier to explore large collections of data. PyMDE can embed into 2 or 3 dimensions, but it can also be used to embed into many more dimensions, which is useful when generating features for machine learning tasks.

For an embedding to be useful, it must be faithful to the original data (the items) in some way. To make it easy to get started, PyMDE provides two high-level functions for creating embeddings, based on related but different notions of faithfulness. These functions handle the common case in which each item is associated with either an original high-dimensional vector or a node in a graph. The functions are

The first creates embeddings that focus on the local structure of the data, putting similar items near each other and dissimilar items not near each other. The second focuses more on the global structure, choosing embedding vectors to respect some notion of original distance or dissimilarity between items.

We’ll see how to use these functions below.

Visualizing data#

When the embedding dimension is 2 (or 3), embeddings can be used to visualize large collections of items. These visualizations can sometimes lead to new insights into the data.

Preserving neighbors#

Let’s create an embedding that preserves the local structure of some data, using the pymde.preserve_neighbors function. This function is based on on preserving the k-nearest neighbors of each original vector (where k is a parameter that by default is chosen on your behalf).

We’ll use the MNIST dataset, which contains images of handwritten digits, as an example. The original items are the images, and each item (image) is represented by an original vector containing the pixel values.

import pymde

mnist = pymde.datasets.MNIST()

Next, we embed.

mde = pymde.preserve_neighbors(mnist.data, verbose=True)
embedding = mde.embed(verbose=True)

The first argument to preserve_neighbors is the data matrix: there are 70,000 images, each represented by a vector of length 784 , so mnist.data is a torch.Tensor of shape (70,000, 784). The optional keyword argument verbose=True flag turns on helpful messages about what the function is doing. The embedding dimension is 2 by default.

The function returns a pymde.MDE object, which can be thought of as describing the kind of embedding we would like. To compute the embedding, we call the embed method of the mde object. This returns a torch.Tensor of shape (70,000, 2), in which embedding[k] is the embedding vector assigned to the image mnist.data[k].

We can visualize the embedding with a scatter plot. In the scatter plot, we’ll color each point by the digit represented by the underlying image.

pymde.plot(embedding, color_by=mnist.attributes['digits'])
../_images/mnist.png

We can see that similar images are near each other in the embedding, while dissimilar images are not.

It is also possible to embed into three or more dimensions. Here is an example with three dimensions.

mde = pymde.preserve_neighbors(mnist.data, embedding_dim=3, verbose=True)
embedding = mde.embed(verbose=True)
pymde.plot(embedding, color_by=mnist.attributes['digits'])
../_images/mnist_3d.png

Customizing embeddings#

The pymde.preserve_neighbors function takes a few keyword arguments that can be used to customize the embedding. For example, you can impose a pymde.Standardized constraint: this causes the embedding to have uncorrelated columns, and prevents it from spreading out too much.

embedding = pymde.preserve_neighbors(mnist.data, constraint=pymde.Standardized()).embed()
pymde.plot(embedding, color_by=mnist.attributes['digits'])
../_images/mnist_std.png

To learn about the other keyword arguments, read the tutorial on MDE, then consult the API documentation.

For more in-depth examples of creating neighborhood-based visualizations, including 3D embeddings, see the MNIST and single-cell genomics example notebooks.

Accessing the underlying graph#

You can access the graph underlying the MDE problem returned by pymde.preserve_neighbors, using the following code.

edges = mde.edges
weights = mde.distortion_function.weights

The value weights[i] is the weight for the edge edges[i].

Preserving distances#

Next, we’ll create an embedding that roughly preserves the global structure of some original data, by preserving some known original distances between some pairs of items. We will embed the nodes of an unweighted graph. For the original distance between two nodes, we’ll use the length of the shortest path connecting them.

The specific graph we’ll use is an academic coauthorship graph, from Google Scholar: the nodes are authors (with h-index at least 50), and two authors have an edge between them if either has listed the author as a coauthor.

import pymde
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
google_scholar = pymde.datasets.google_scholar()
mde = pymde.preserve_distances(google_scholar.data, device=device, verbose=True)
embedding = mde.embed()

The data attribute of the google_scholar dataset is a pymde.Graph object, which encodes the coauthorship network. The pymde.preserve_distances function returns a pymde.MDE object, and calling the embed method computes the embedding.

Notice that we passed in a device to pymde.preserve_distances; this embedding approximately preserves over 80 million distances, so using a GPU can speed things up.

Next we plot the embedding, coloring each point by how many coauthors the author has in the network (normalized to be a percentile).

pymde.plot(embedding, color_by=google_scholar.attributes['coauthors'])
../_images/scholar.jpg

The most collaborative authors are near the embedding, and less collaborative ones are on the fringe. It also turns out that the diameter of the embedding is close to the true diameter of the graph.

For a more in-depth study of this example, see the notebook on Google Scholar.

Customizing embeddings#

The pymde.preserve_distances function takes a few keyword arguments that can be used to customize the embedding.

To learn about the keyword arguments, read the tutorial on MDE, then consult the API documentation.

Accessing the underlying graph#

You can access the graph underlying the MDE problem returned by pymde.preserve_distances, using the following code.

edges = mde.edges
distances = mde.distortion_function.deviations

The value distances[i] is the weight (which should be interpreted as a distance) for the edge edges[i].

Plotting#

Scatter plots#

The pymde.plot function can be used to plot embeddings with dimension at most 3. It takes an embedding as the argument, as well a number of optional keyword arguments. For example, to plot an embedding and color each point by some attribute, use:

pymde.plot(embedding, color_by=attribute)

The attribute variable is a NumPy array of length embedding.shape[0], in which attribute[k] is a tag or numerical value associated with item k. For example, in the MNIST data, each entry in attribute is an int between 0 and 9 representing the digit depicted in the image; for single-cell data, each entry might be a string describing the type of cell. Typically the attribute is not used to create the embedding, so coloring by it is a sanity-check that the embedding has preserved prior knowledge about the original data.

This function can be configured with a number of keyword arguments, which can be seen in the API documentation.

Movies#

The pymde.MDE.play method can be used to create an animated GIF of the embedding process. To create a GIF, first call pymde.MDE.embed with the snapshot_every keyword argument, then call play:

mde.embed(snapshot_every=1)
mde.play(savepath='/path/to/file.gif')

The snapshot_every=1 keyword argument instructs the MDE object to take a snapshot of the embedding during every iteration of the solution algorithm. The play method generates the GIF, and saves it to savepath.

This method can be configured with a number of keyword arguments, which can be seen in the API documentation.

Generating feature vectors#

The embeddings made via pymde.preserve_neighbors and pymde.preserve_distances can be used as feature vectors for supervised learning tasks. You can choose the dimension of the vectors by specifying the embedding_dim keyword argument, e.g.,

embedding = pymde.preserve_neighbors(data, embedding_dim=50).embed()

Classical embeddings#

PyMDE provides a few implementations of classical embeddings, for convenience. To produce a PCA embedding of a data matrix, use the pymde.pca method, which returns an embedding:

embedding = pymde.pca(data_matrix, embedding_dim)

To create a Laplacian embedding based on the nearest neighbors of each row in a data matrix or each node in a graph, use the pymde.laplacian_embedding method, which returns an MDE problem:

mde = pymde.laplacian_embedding(data, embedding_dim, verbose=True)
embedding = mde.embed()

To create a spectral embedding based on a sequence of edges (a torch.Tensor of shape (n_edges, 2)) and weights, use pymde.quadratic.spectral. (These embeddings are called “quadratic embeddings” in the MDE monograph.)

Drawing graphs#

PyMDE can be used to draw graphs in 2 or 3 dimensions. Here is a very simple example that draws a cycle graph on 3 nodes.

edges = torch.tensor([
         [0, 1],
         [0, 2],
         [1, 2]
])
triangle = pymde.Graph.from_edges(edges)
triangle.draw()
../_images/triangle.png

Here is a more interesting example, which embeds a ternary tree. The tree is created using the NetworkX package.

import networkx

ternary_tree = networkx.balanced_tree(3, 6)
graph = pymde.Graph(networkx.adjacency_matrix(ternary_tree))
embedding = graph.draw()
../_images/tree.png

On a standard CPU, it takes PyMDE just 2 seconds to compute this layout; for comparison, it takes NetworkX 30 seconds to compute a similar layout.

You can embed into 3 dimensions by passing embedding_dim=3 to the draw method.

For more in-depth examples, see the notebook on drawing graphs, and the API documentation of pymde.Graph.

Using a GPU#

If you have a CUDA-enabled GPU, you can use it to speed up the optimization routine which computes the embedding.

The functions pymde.preserve_neighbors and pymde.preserve_distances, as well as the method Graph.draw, all take a keyword argument, called device, which controls whether or not a GPU is used. Pass device='cuda' to use your GPU. (PyMDE computes embeddings on CPU by default.)

For example, the below code shows how to create a neighbor-preserving embedding of MNIST using a GPU.

import pymde

mnist = pymde.datasets.MNIST()
mde = pymde.preserve_neighbors(mnist.data, device='cuda', verbose=True)
embedding = mde.embed(verbose=True)

On an NVIDIA GeForce GTX 1070, the embed method took just 5 seconds.

Reproducibility#

PyMDE’s optimization algorithm does not rely on randomness. However, some functions, such as pymde.preserve_neighbors, may use random sampling when generating the edges for an MDE problem. This means that if you call pymde.preserve_neighbors multiple times on the same data set, you might obtain slightly different MDE problems and therefore slightly different embeddings. To prevent this from happening, you can set random seed state via the pymde.seed function.

For example, the following code block will always create the same embedding.

import pymde

mnist = pymde.datasets.MNIST()

pymde.seed(0)
mde = pymde.preserve_neighbors(mnist.data, verbose=True)
embedding = mde.embed(verbose=True)