# Preprocessing

This part of the tutorial discusses ways to preprocess your original data in order to create the set of edges, and their associated weights or original deviations, needed to create an MDE problem. These preprocessing methods are analogous to feature engineering in machine learning, where raw data are converted to feature vectors. As in feature engineering, the preprocessing can have a strong effect on the final result, i.e., the embedding.

Preprocessing methods can be grouped into two types: those that are based on neighbors of the items, and those that are based on distances between items. The high-level functions `pymde.preserve_neighbors` and `pymde.preserve_distances` (which were used in the Getting Started guide) use neighbor-based and distance-based preprocessing behind-the-scenes, in order to create reasonable MDE problems. But you can just as well preprocess the data yourself, to create custom embeddings.

PyMDE provides a few preprocessing methods for original data that come as a data matrix (one row for each item) or a graph (one node for each item).

## Graph

The preprocessing methods often work with or return `pymde.Graph` instances, which package up a list of edges and their associated weights.

A graph can be created from a weighted adjacency matrix (a scipy.sparse matrix, numpy array, or torch Tensor), or a torch Tensor containing the edges and weights. For example

```adjacency_matrix = sp.csr_matrix(np.array([
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
]))
```

or

```edges = torch.tensor([
[0, 1],
[0, 2],
[1, 2],
])
weights = torch.ones(edges.shape)
graph = pymde.Graph.from_edges(edges, weights)
```

Given a graph, you can access the edges, weights, and adjacency matrix with

```graph.edges
graph.weights
```

The API documentation describes the `pymde.Graph` class in more detail.

## Neighbor-based preprocessing

### Similar items

A neighbor of an item can be thought of an item that is in some sense similar to it. One class of preprocessing methods computes some neighbors of each item, and uses the pairs of neighbors as edges, associating these edges with positive weights. These weights can then be used to create a distortion function based on weights, using one of the classes in `pymde.penalties`. A common preprocessing step is to compute the `k`-nearest neighbors of each item, where `k` is a parameter.

```knn_graph = pymde.preprocess.k_nearest_neighbors(data, k=10)
```

Preprocessing based on neighbors can be thought of as a “sparsifying” operation: they take data and return a sparse graph (the `knn_graph`).

pymde.preprocess.k_nearest_neighbors(data, k, max_distance=None, verbose=False)

Compute k-nearest neighbors, given data matrix or graph.

This function computes a k-nearest neighbor graph, given either a data matrix or an original graph.

When the input is a data matrix, each row is treated as the feature vector of an item, and the Euclidean distance is used to judge how close two items are to each other.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

In the returned graph, if i is a neighbor of j and j a neighbor of i, then the weight w_{ij} will be +2; if only one is a neighbor of the other, then w_{ij} will be +1; if neither are neighbors of each other, then (i, j) will not be in the graph.

Parameters
• data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape `(n_items, n_features)`, or a `pymde.Graph`

• k (int) – The number of nearest neighbors per item

• max_distance (float (optional)) – If not `None`, neighborhoods are restricted to have a radius no greater than max_distance.

• verbose (bool) – If `True`, print verbose output.

Returns

The k-nearest neighbor graph. Access the weights with `graph.weights`, and the edges with `graph.edges`

Return type

pymde.Graph

## Distance-based preprocessing

Another preprocessing step is to compute some distances between pairs of items. We can use the `pymde.preprocess.distances` function to compute distances between pairs of items.

```distance_graph = pymde.preprocess.distances(data)
```

When the original input is a graph, this function can be thought of as a “densifying” operation, since it returns a denser graph. You can control how dense the returned graph is with a keyword argument.

pymde.preprocess.distances(data, retain_fraction=1.0, verbose=False)

Compute distances, given data matrix or graph.

This function computes distances between some pairs of items, given either a data matrix or a `pymde.Graph` instance.

When the input is a data matrix, each row is treated as the feature vector of an item.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

The `retain_fraction` keyword argument can be used to compute only a fraction of the distances. This can be useful when there are many items, in which case storing all the distances may be intractable.

Parameters
• data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape `(n_items, n_features)`, or a `pymde.Graph`

• retain_fraction (float, optional) – A float between 0 and 1, specifying the fraction of all ```(n_items choose 2)``` to compute. For example, if `retain_fraction` is 0.1, only 10 percent of the edges will be stored.

• verbose – If `True`, print verbose output.

Returns

A graph object holding the distances and corresponding edges. Access the distances with `graph.distances`, and the edges with `graph.edges`.

Return type

pymde.Graph