Preprocessing#
This part of the tutorial discusses ways to preprocess your original data in order to create the set of edges, and their associated weights or original deviations, needed to create an MDE problem. These preprocessing methods are analogous to feature engineering in machine learning, where raw data are converted to feature vectors. As in feature engineering, the preprocessing can have a strong effect on the final result, i.e., the embedding.
Preprocessing methods can be grouped into two types: those that
are based on neighbors of the items, and those that are based on
distances between items. The high-level functions
pymde.preserve_neighbors
and pymde.preserve_distances
(which were used in the Getting Started guide) use neighbor-based and
distance-based preprocessing behind-the-scenes, in order to create reasonable
MDE problems. But you can just as well preprocess the data yourself, to
create custom embeddings.
PyMDE provides a few preprocessing methods for original data that come as a data matrix (one row for each item) or a graph (one node for each item).
Graph#
The preprocessing methods often work with or return pymde.Graph
instances,
which package up a list of edges and their associated weights.
A graph can be created from a weighted adjacency matrix (a scipy.sparse matrix, numpy array, or torch Tensor), or a torch Tensor containing the edges and weights. For example
adjacency_matrix = sp.csr_matrix(np.array([
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
]))
graph = pymde.Graph(adjacency_matrix)
or
edges = torch.tensor([
[0, 1],
[0, 2],
[1, 2],
])
weights = torch.ones(edges.shape[0])
graph = pymde.Graph.from_edges(edges, weights)
Given a graph, you can access the edges, weights, and adjacency matrix with
graph.edges
graph.weights
graph.adjacency_matrix
The API documentation describes the pymde.Graph
class
in more detail.
Neighbor-based preprocessing#
Similar items#
A neighbor of an item can be thought of an item that is in some sense
similar to it. One class of preprocessing methods computes some neighbors
of each item, and uses the pairs of neighbors as edges, associating these
edges with positive weights. These weights can then be used to create
a distortion function based on weights, using one of the classes
in pymde.penalties
. A common preprocessing step is to compute the
k
-nearest neighbors of each item, where k
is a parameter.
knn_graph = pymde.preprocess.k_nearest_neighbors(data, k=10)
Preprocessing based on neighbors can be thought of as a “sparsifying”
operation: they take data and return a sparse graph (the knn_graph
).
- pymde.preprocess.k_nearest_neighbors(data, k, max_distance=None, verbose=False)
Compute k-nearest neighbors, given data matrix or graph.
This function computes a k-nearest neighbor graph, given either a data matrix or an original graph.
When the input is a data matrix, each row is treated as the feature vector of an item, and the Euclidean distance is used to judge how close two items are to each other.
When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.
In the returned graph, if i is a neighbor of j and j a neighbor of i, then the weight w_{ij} will be +2; if only one is a neighbor of the other, then w_{ij} will be +1; if neither are neighbors of each other, then (i, j) will not be in the graph.
- Parameters:
data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape
(n_items, n_features)
, or apymde.Graph
k (int) – The number of nearest neighbors per item
max_distance (float (optional)) – If not
None
, neighborhoods are restricted to have a radius no greater than max_distance.verbose (bool) – If
True
, print verbose output.
- Returns:
The k-nearest neighbor graph. Access the weights with
graph.weights
, and the edges withgraph.edges
- Return type:
Distance-based preprocessing#
Another preprocessing step is to compute some distances between pairs of
items. We can use the pymde.preprocess.distances
function to compute
distances between pairs of items.
distance_graph = pymde.preprocess.distances(data)
When the original input is a graph, this function can be thought of as a “densifying” operation, since it returns a denser graph. You can control how dense the returned graph is with a keyword argument.
- pymde.preprocess.distances(data, retain_fraction=1.0, verbose=False)
Compute distances, given data matrix or graph.
This function computes distances between some pairs of items, given either a data matrix or a
pymde.Graph
instance.When the input is a data matrix, each row is treated as the feature vector of an item.
When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.
The
retain_fraction
keyword argument can be used to compute only a fraction of the distances. This can be useful when there are many items, in which case storing all the distances may be intractable.- Parameters:
data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape
(n_items, n_features)
, or apymde.Graph
retain_fraction (float, optional) – A float between 0 and 1, specifying the fraction of all
(n_items choose 2)
to compute. For example, ifretain_fraction
is 0.1, only 10 percent of the edges will be stored.verbose – If
True
, print verbose output.
- Returns:
A graph object holding the distances and corresponding edges. Access the distances with
graph.distances
, and the edges withgraph.edges
.- Return type: