# Getting started#

This page shows how to get started with PyMDE for four common tasks:

visualizing datasets in two or three dimensions;

generating feature vectors for supervised learning;

computing classical embeddings, like PCA and spectral embedding;

drawing graphs in 2 or 3 dimensions;

Note

To learn how to create custom embeddings (with custom objective functions and constraints), sanity check embeddings, identify possible outliers in the original data, embed new data, and more, see the MDE guide. (We recommend reading the getting started guide first.)

## What is an embedding?#

An *embedding* of a finite set of items (such as biological cells, images,
words, nodes in a graph, or any other abstract object) is an assignment of each
item to a vector of fixed length; the original items
are *embedded* or mapped into a real vector space. The length of the vectors is
called the *embedding dimension*. An embedding is represented concretely by a
matrix, in which each row is the embedding vector of an item.

Embeddings provide concrete numerical representations of abstract items, for use in downstream computational tasks. For example, when the embedding dimension is 2 or 3, embeddings can be used to create a sort of chart or atlas of the items. In such a chart, each point corresponds to an item, and its coordinates in space are given by the embedding vector. These visualizations can help scientists and analysts identify patterns or anomalies in the original data, and more generally make it easier to explore large collections of data. PyMDE can embed into 2 or 3 dimensions, but it can also be used to embed into many more dimensions, which is useful when generating features for machine learning tasks.

For an embedding to be useful, it must be faithful to the original data (the items) in some way. To make it easy to get started, PyMDE provides two high-level functions for creating embeddings, based on related but different notions of faithfulness. These functions handle the common case in which each item is associated with either an original high-dimensional vector or a node in a graph. The functions are

The first creates embeddings that focus on the local structure of the data, putting similar items near each other and dissimilar items not near each other. The second focuses more on the global structure, choosing embedding vectors to respect some notion of original distance or dissimilarity between items.

We’ll see how to use these functions below.

## Visualizing data#

When the embedding dimension is 2 (or 3), embeddings can be used to visualize large collections of items. These visualizations can sometimes lead to new insights into the data.

### Preserving neighbors#

Let’s create an embedding that preserves the local structure
of some data, using the `pymde.preserve_neighbors`

function. This function
is based on on preserving the `k`

-nearest neighbors of each original vector
(where `k`

is a parameter that by default is chosen on your behalf).

We’ll use the MNIST dataset, which contains images of handwritten digits, as an example. The original items are the images, and each item (image) is represented by an original vector containing the pixel values.

```
import pymde
mnist = pymde.datasets.MNIST()
```

Next, we embed.

```
mde = pymde.preserve_neighbors(mnist.data, verbose=True)
embedding = mde.embed(verbose=True)
```

The first argument to `preserve_neighbors`

is the data matrix: there are
70,000 images, each represented by a vector of length 784 , so `mnist.data`

is a `torch.Tensor`

of shape `(70,000, 784)`

. The optional keyword argument
`verbose=True`

flag turns on helpful messages about what the function is
doing. The embedding dimension is 2 by default.

The function returns a `pymde.MDE`

object, which can be thought of as
describing the kind of embedding we would like. To compute the embedding, we
call the `embed`

method of the `mde`

object. This returns a
`torch.Tensor`

of shape `(70,000, 2)`

, in which `embedding[k]`

is
the embedding vector assigned to the image `mnist.data[k]`

.

We can visualize the embedding with a scatter plot. In the scatter plot, we’ll color each point by the digit represented by the underlying image.

```
pymde.plot(embedding, color_by=mnist.attributes['digits'])
```

We can see that similar images are near each other in the embedding, while dissimilar images are not.

It is also possible to embed into three or more dimensions. Here is an example with three dimensions.

```
mde = pymde.preserve_neighbors(mnist.data, embedding_dim=3, verbose=True)
embedding = mde.embed(verbose=True)
pymde.plot(embedding, color_by=mnist.attributes['digits'])
```

#### Customizing embeddings#

The `pymde.preserve_neighbors`

function takes a few keyword arguments
that can be used to customize the embedding. For example, you
can impose a `pymde.Standardized`

constraint: this
causes the embedding to have uncorrelated columns, and prevents it from
spreading out too much.

```
embedding = pymde.preserve_neighbors(mnist.data, constraint=pymde.Standardized()).embed()
pymde.plot(embedding, color_by=mnist.attributes['digits'])
```

To learn about the other keyword arguments, read the tutorial on MDE, then consult the API documentation.

For more in-depth examples of creating neighborhood-based visualizations, including 3D embeddings, see the MNIST and single-cell genomics example notebooks.

#### Accessing the underlying graph#

You can access the graph underlying the MDE problem returned by
`pymde.preserve_neighbors`

, using the following code.

```
edges = mde.edges
weights = mde.distortion_function.weights
```

The value `weights[i]`

is the weight for the edge `edges[i]`

.

### Preserving distances#

Next, we’ll create an embedding that roughly preserves the global structure of some original data, by preserving some known original distances between some pairs of items. We will embed the nodes of an unweighted graph. For the original distance between two nodes, we’ll use the length of the shortest path connecting them.

The specific graph we’ll use is an academic coauthorship graph, from Google Scholar: the nodes are authors (with h-index at least 50), and two authors have an edge between them if either has listed the author as a coauthor.

```
import pymde
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
google_scholar = pymde.datasets.google_scholar()
mde = pymde.preserve_distances(google_scholar.data, device=device, verbose=True)
embedding = mde.embed()
```

The `data`

attribute of the `google_scholar`

dataset is a
`pymde.Graph`

object, which encodes the coauthorship network.
The `pymde.preserve_distances`

function returns a `pymde.MDE`

object, and calling the `embed`

method computes the embedding.

Notice that we passed in a `device`

to `pymde.preserve_distances`

;
this embedding approximately preserves over 80 million distances, so using a
GPU can speed things up.

Next we plot the embedding, coloring each point by how many coauthors the author has in the network (normalized to be a percentile).

```
pymde.plot(embedding, color_by=google_scholar.attributes['coauthors'])
```

The most collaborative authors are near the embedding, and less collaborative ones are on the fringe. It also turns out that the diameter of the embedding is close to the true diameter of the graph.

For a more in-depth study of this example, see the notebook on Google Scholar.

#### Customizing embeddings#

The `pymde.preserve_distances`

function takes a few keyword arguments
that can be used to customize the embedding.

To learn about the keyword arguments, read the tutorial on MDE, then consult the API documentation.

#### Accessing the underlying graph#

You can access the graph underlying the MDE problem returned by
`pymde.preserve_distances`

, using the following code.

```
edges = mde.edges
distances = mde.distortion_function.deviations
```

The value `distances[i]`

is the weight (which should be interpreted as a
distance) for the edge `edges[i]`

.

### Plotting#

#### Scatter plots#

The `pymde.plot`

function can be used to plot embeddings with dimension
at most 3. It takes an embedding as the argument, as well a number of optional
keyword arguments. For example, to plot an embedding and color each point
by some attribute, use:

```
pymde.plot(embedding, color_by=attribute)
```

The `attribute`

variable is a NumPy array of length `embedding.shape[0]`

,
in which `attribute[k]`

is a tag or numerical value associated with item `k`

.
For example, in the MNIST data, each entry in `attribute`

is an `int`

between `0`

and `9`

representing the digit depicted in the image;
for single-cell data, each entry might be a string describing the type of
cell. Typically the attribute is not used to create the embedding, so coloring
by it is a sanity-check that the embedding has preserved prior knowledge about
the original data.

This function can be configured with a number of keyword arguments, which can
be seen in the `API documentation`

.

#### Movies#

The `pymde.MDE.play`

method can be used to create an animated GIF of the
embedding process. To create a GIF, first call `pymde.MDE.embed`

with
the `snapshot_every`

keyword argument, then call `play`

:

```
mde.embed(snapshot_every=1)
mde.play(savepath='/path/to/file.gif')
```

The `snapshot_every=1`

keyword argument instructs the `MDE`

object to
take a snapshot of the embedding during every iteration of the solution
algorithm. The `play`

method generates the GIF, and saves it to `savepath`

.

This method can be configured with a number of keyword arguments,
which can be seen in the `API documentation`

.

## Generating feature vectors#

The embeddings made via `pymde.preserve_neighbors`

and
`pymde.preserve_distances`

can be used as feature vectors for supervised
learning tasks. You can choose the dimension of the vectors by specifying the
`embedding_dim`

keyword argument, e.g.,

```
embedding = pymde.preserve_neighbors(data, embedding_dim=50).embed()
```

## Classical embeddings#

PyMDE provides a few implementations of classical embeddings, for convenience.
To produce a PCA embedding of a data matrix, use the `pymde.pca`

method, which returns an embedding:

```
embedding = pymde.pca(data_matrix, embedding_dim)
```

To create a Laplacian embedding based on the nearest neighbors of each row in a
data matrix or each node in a graph, use the `pymde.laplacian_embedding`

method, which returns an MDE problem:

```
mde = pymde.laplacian_embedding(data, embedding_dim, verbose=True)
embedding = mde.embed()
```

To create a spectral embedding based on a sequence of edges (a `torch.Tensor`

of shape `(n_edges, 2)`

) and weights, use `pymde.quadratic.spectral`

.
(These embeddings are called “quadratic embeddings” in the MDE monograph.)

## Drawing graphs#

PyMDE can be used to draw graphs in 2 or 3 dimensions. Here is a very simple example that draws a cycle graph on 3 nodes.

```
edges = torch.tensor([
[0, 1],
[0, 2],
[1, 2]
])
triangle = pymde.Graph.from_edges(edges)
triangle.draw()
```

Here is a more interesting example, which embeds a ternary tree. The tree is created using the NetworkX package.

```
import networkx
ternary_tree = networkx.balanced_tree(3, 6)
graph = pymde.Graph(networkx.adjacency_matrix(ternary_tree))
embedding = graph.draw()
```

On a standard CPU, it takes PyMDE just 2 seconds to compute this layout; for comparison, it takes NetworkX 30 seconds to compute a similar layout.

You can embed into 3 dimensions by passing `embedding_dim=3`

to the `draw`

method.

For more in-depth examples, see the notebook on drawing graphs,
and the API documentation of `pymde.Graph`

.

## Using a GPU#

If you have a CUDA-enabled GPU, you can use it to speed up the optimization routine which computes the embedding.

The functions `pymde.preserve_neighbors`

and
`pymde.preserve_distances`

, as well as the method `Graph.draw`

, all take a
keyword argument, called `device`

, which controls whether or not a GPU is
used. Pass `device='cuda'`

to use your GPU. (PyMDE computes embeddings on CPU
by default.)

For example, the below code shows how to create a neighbor-preserving embedding of MNIST using a GPU.

```
import pymde
mnist = pymde.datasets.MNIST()
mde = pymde.preserve_neighbors(mnist.data, device='cuda', verbose=True)
embedding = mde.embed(verbose=True)
```

On an NVIDIA GeForce GTX 1070, the `embed`

method took just 5 seconds.

## Reproducibility#

PyMDE’s optimization algorithm does not rely on randomness. However,
some functions, such as `pymde.preserve_neighbors`

, may use random sampling
when generating the edges for an MDE problem. This means that if you call
`pymde.preserve_neighbors`

multiple times on the same data set, you might
obtain slightly different MDE problems and therefore slightly different
embeddings. To prevent this from happening, you can set random seed state via
the `pymde.seed`

function.

For example, the following code block will always create the same embedding.

```
import pymde
mnist = pymde.datasets.MNIST()
pymde.seed(0)
mde = pymde.preserve_neighbors(mnist.data, verbose=True)
embedding = mde.embed(verbose=True)
```