Preprocessing

This part of the tutorial discusses ways to preprocess your original data in order to create the set of edges, and their associated weights or original deviations, needed to create an MDE problem. These preprocessing methods are analogous to feature engineering in machine learning, where raw data are converted to feature vectors. As in feature engineering, the preprocessing can have a strong effect on the final result, i.e., the embedding.

Preprocessing methods can be grouped into two types: those that are based on neighbors of the items, and those that are based on distances between items. The high-level functions pymde.preserve_neighbors and pymde.preserve_distances (which were used in the Getting Started guide) use neighbor-based and distance-based preprocessing behind-the-scenes, in order to create reasonable MDE problems. But you can just as well preprocess the data yourself, to create custom embeddings.

PyMDE provides a few preprocessing methods for original data that come as a data matrix (one row for each item) or a graph (one node for each item).

Graph

The preprocessing methods often work with or return pymde.Graph instances, which package up a list of edges and their associated weights.

A graph can be created from a weighted adjacency matrix (a scipy.sparse matrix, numpy array, or torch Tensor), or a torch Tensor containing the edges and weights. For example

adjacency_matrix = sp.csr_matrix(np.array([
   [0, 1, 1],
   [1, 0, 1],
   [1, 1, 0],
]))
graph = pymde.Graph(adjacency_matrix)

or

edges = torch.tensor([
     [0, 1],
     [0, 2],
     [1, 2],
])
weights = torch.ones(edges.shape[0])
graph = pymde.Graph.from_edges(edges, weights)

Given a graph, you can access the edges, weights, and adjacency matrix with

graph.edges
graph.weights
graph.adjacency_matrix

The API documentation describes the pymde.Graph class in more detail.

Neighbor-based preprocessing

Similar items

A neighbor of an item can be thought of an item that is in some sense similar to it. One class of preprocessing methods computes some neighbors of each item, and uses the pairs of neighbors as edges, associating these edges with positive weights. These weights can then be used to create a distortion function based on weights, using one of the classes in pymde.penalties. A common preprocessing step is to compute the k-nearest neighbors of each item, where k is a parameter.

knn_graph = pymde.preprocess.k_nearest_neighbors(data, k=10)

Preprocessing based on neighbors can be thought of as a “sparsifying” operation: they take data and return a sparse graph (the knn_graph). For dense data, neighbor computation uses a custom exact k-nearest neighbor algorithm implemented in Rust, accelerated with Rayon and BLAS.

Sparse data. When the data matrix is a scipy.sparse matrix, k_nearest_neighbors will use pynndescent for neighbor computation if it is installed, avoiding conversion to a dense matrix. If pynndescent is not installed, the sparse data will be converted to dense before computing neighbors. For very large sparse matrices the conversion can be memory-intensive. Install it with pip install pynndescent.

pymde.preprocess.k_nearest_neighbors(data, k, max_distance=None, knn_method=None, verbose=False)

Compute k-nearest neighbors, given data matrix or graph.

This function computes a k-nearest neighbor graph, given either a data matrix or an original graph.

When the input is a data matrix, each row is treated as the feature vector of an item, and the Euclidean distance is used to judge how close two items are to each other.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

In the returned graph, if i is a neighbor of j and j a neighbor of i, then the weight w_{ij} will be +2; if only one is a neighbor of the other, then w_{ij} will be +1; if neither are neighbors of each other, then (i, j) will not be in the graph.

Parameters:
  • data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape (n_items, n_features), or a pymde.Graph

  • k (int) – The number of nearest neighbors per item

  • max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than max_distance.

  • knn_method (str (optional)) – Which algorithm to use for k-nearest neighbor computation. 'exact' uses brute-force exact search, 'approximate' uses NN-Descent. If None (the default), the method is chosen automatically based on dataset size. Only applies to data matrices, not graphs.

  • verbose (bool) – If True, print verbose output.

Returns:

The k-nearest neighbor graph. Access the weights with graph.weights, and the edges with graph.edges

Return type:

pymde.Graph

Distance-based preprocessing

Another preprocessing step is to compute some distances between pairs of items. We can use the pymde.preprocess.distances function to compute distances between pairs of items.

distance_graph = pymde.preprocess.distances(data)

When the original input is a graph, this function can be thought of as a “densifying” operation, since it returns a denser graph. You can control how dense the returned graph is with a keyword argument.

pymde.preprocess.distances(data, retain_fraction=1.0, verbose=False)

Compute distances, given data matrix or graph.

This function computes distances between some pairs of items, given either a data matrix or a pymde.Graph instance.

When the input is a data matrix, each row is treated as the feature vector of an item.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

The retain_fraction keyword argument can be used to compute only a fraction of the distances. This can be useful when there are many items, in which case storing all the distances may be intractable.

Parameters:
  • data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape (n_items, n_features), or a pymde.Graph

  • retain_fraction (float, optional) – A float between 0 and 1, specifying the fraction of all (n_items choose 2) to compute. For example, if retain_fraction is 0.1, only 10 percent of the edges will be stored.

  • verbose – If True, print verbose output.

Returns:

A graph object holding the distances and corresponding edges. Access the distances with graph.distances, and the edges with graph.edges.

Return type:

pymde.Graph