API documentation#

Note

PyMDE is very young software. When updates are released, we will try to be backward compatible with earlier versions, but sometimes we may be unable to do so.

If a new version carries a breaking change, its minor version number will be greater than the previous version (e.g., v0.2.0 may have some changes that are incompatible with v0.1.5, but v0.1.5 will be fully compatible with v0.1.0).

MDE#

class pymde.MDE(n_items: int, embedding_dim: int, edges: Tensor, distortion_function: Callable | StochasticFunction, constraint: Constraint | None = None, device: str | None = None)#

An MDE problem.

An MDE instance represents a specific MDE problem, specified by the number of items, the embedding dimension, a list of edges, the vector distortion function, and possibly a constraint.

n_items#

The number of items

Type:

int

embedding_dim#

The embedding dimension

Type:

int

edges#

The list of edges.

Type:

torch.Tensor

distortion_function#

The vector distorton function.

Type:

Callable

constraint#

The constraint imposed (or None)

Type:

pymde.constraints.Constraint

device#

The device on which to compute the embedding and store data (like ‘cpu’ or ‘cuda’)

solve_stats#

Summary statistics about the embedding, populated after calling the embed method.

Public Methods:

__init__(n_items, embedding_dim, edges, ...)

Constructs an MDE problem.

to(device)

Move MDE instance to another device.

__str__()

Return str(self).

differences(X)

Compute X[i] - X[j] for each row (i, j) in edges.

distances([X])

Compute embedding distances.

distortions([X])

Compute distortions

average_distortion([X])

Compute average distortion.

high_distortion_pairs([X])

Compute distortions, sorted from high to low.

embed([X, eps, max_iter, memory_size, ...])

Compute an embedding.

forward([X, eps, max_iter, memory_size, ...])

Compute an embedding.

plot([color_by, color_map, colors, edges, ...])

Plot an embedding, in one, two, or three dimensions.

distortions_cdf([figsize_inches])

Plot a cumulative distribution function of the distortions.

play([color_by, color_map, colors, edges, ...])

Create a movie visualizing how the embedding was formed.

Inherited from Module

__init__(*args, **kwargs)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(*input)

Defines the computation performed at every call.

register_buffer(name, tensor[, persistent])

Adds a buffer to the module.

register_parameter(name, param)

Adds a parameter to the module.

add_module(name, module)

Adds a child module to the current module.

register_module(name, module)

Alias for add_module().

get_submodule(target)

Returns the submodule given by target if it exists, otherwise throws an error.

get_parameter(target)

Returns the parameter given by target if it exists, otherwise throws an error.

get_buffer(target)

Returns the buffer given by target if it exists, otherwise throws an error.

get_extra_state()

Returns any extra state to include in the module's state_dict.

set_extra_state(state)

This function is called from load_state_dict() to handle any extra state found within the state_dict.

apply(fn)

Applies fn recursively to every submodule (as returned by .children()) as well as self.

cuda([device])

Moves all model parameters and buffers to the GPU.

ipu([device])

Moves all model parameters and buffers to the IPU.

xpu([device])

Moves all model parameters and buffers to the XPU.

cpu()

Moves all model parameters and buffers to the CPU.

type(dst_type)

Casts all parameters and buffers to dst_type.

float()

Casts all floating point parameters and buffers to float datatype.

double()

Casts all floating point parameters and buffers to double datatype.

half()

Casts all floating point parameters and buffers to half datatype.

bfloat16()

Casts all floating point parameters and buffers to bfloat16 datatype.

to_empty(*, device)

Moves the parameters and buffers to the specified device without copying storage.

to()

Moves and/or casts the parameters and buffers.

register_full_backward_pre_hook(hook[, prepend])

Registers a backward pre-hook on the module.

register_backward_hook(hook)

Registers a backward hook on the module.

register_full_backward_hook(hook[, prepend])

Registers a backward hook on the module.

register_forward_pre_hook(hook, *[, ...])

Registers a forward pre-hook on the module.

register_forward_hook(hook, *[, prepend, ...])

Registers a forward hook on the module.

__call__(*args, **kwargs)

Call self as a function.

__setstate__(state)

__getattr__(name)

__setattr__(name, value)

Implement setattr(self, name, value).

__delattr__(name)

Implement delattr(self, name).

register_state_dict_pre_hook(hook)

These hooks will be called with arguments: self, prefix, and keep_vars before calling state_dict on self.

state_dict()

Returns a dictionary containing references to the whole state of the module.

register_load_state_dict_post_hook(hook)

Registers a post hook to be run after module's load_state_dict is called.

load_state_dict(state_dict[, strict])

Copies parameters and buffers from state_dict into this module and its descendants.

parameters([recurse])

Returns an iterator over module parameters.

named_parameters([prefix, recurse, ...])

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

buffers([recurse])

Returns an iterator over module buffers.

named_buffers([prefix, recurse, ...])

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

children()

Returns an iterator over immediate children modules.

named_children()

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

modules()

Returns an iterator over all modules in the network.

named_modules([memo, prefix, remove_duplicate])

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.

train([mode])

Sets the module in training mode.

eval()

Sets the module in evaluation mode.

requires_grad_([requires_grad])

Change if autograd should record operations on parameters in this module.

zero_grad([set_to_none])

Sets gradients of all model parameters to zero.

share_memory()

See torch.Tensor.share_memory_()

extra_repr()

Set the extra representation of the module

__repr__()

Return repr(self).

__dir__()

Default dir() implementation.


__init__(n_items: int, embedding_dim: int, edges: Tensor, distortion_function: Callable | StochasticFunction, constraint: Constraint | None = None, device: str | None = None)#

Constructs an MDE problem.

Parameters:
  • n_items (int) – Number of things being embedded.

  • embedding_dim (int) – Embedding dimension.

  • edges (torch.Tensor(shape=(num_edges, 2), dtype=torch.int)) – Tensor, where each row is an edge (i, j) between two items; each edge should satisfy 0 <= i < j < n_items. In particular self-edges are not allowed.

  • distortion_function (Callable or pymde.functions.StochasticFunction) – The vectorized distortion function, typically an instance of a class from pymde.penalties or pymde.losses however, this can be any Python callable that maps a torch.Tensor of embedding distances to a torch.Tensor of distortions.

  • constraint (pymde.constraints.Constraint, optional) – A Constraint object, such as pymde.Standardized() Defaults to an unconstrained (centered) embedding.

  • device (str, optional) – Name of device on which to store tensors/compute embedding, such as ‘cpu’ or ‘cuda’ for GPU. Default infers device from edges and distortion_function

average_distortion(X=None)#

Compute average distortion.

This method computes the average distortion of an embedding.

If X is None, this method computes the average distortion for the embedding computed by the last call to embed.

Parameters:

X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.

Returns:

The average distortion.

Return type:

scalar (float)

differences(X)#

Compute X[i] - X[j] for each row (i, j) in edges.

distances(X=None)#

Compute embedding distances.

This function computes the embedding distances corresponding to the list of edges (self.edges)

If X is None, computes distances for the embedding computed by the last call to embed

Parameters:

X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.

Returns:

Vector of embedding distances. The k-th entry is the distance corresponding to the k-th edge (self.edges[k]

Return type:

torch.Tensor(shape=(n_edges,))

distortions(X=None)#

Compute distortions

This function computes the distortions for an embedding.

If X is None, computes distortions for the embedding computed by the last call to embed

Parameters:

X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.

Returns:

Vector of distortions. The k-th entry is the distortion for the k-th edge (self.edges[k]

Return type:

torch.Tensor(shape=(n_edges,))

distortions_cdf(figsize_inches=(8, 3))#

Plot a cumulative distribution function of the distortions.

The embed method must be called sometime before calling this method.

embed(X=None, eps=1e-05, max_iter=300, memory_size=10, verbose=False, print_every=None, snapshot_every=None)#

Compute an embedding.

This method stores the embedding in the X attribute of the problem instance (mde.X). Summary statistics related to the fitting process are stored in solve_stats (mde.solve_stats).

All arguments have sensible default values, so in most cases, it suffices to just type mde.embed() or mde.embed(verbose=True)

Parameters:
  • X (torch.Tensor, optional) – Initial iterate, of shape (n_items, embedding_dim). When None, the initial iterate is chosen randomly (and projected onto the constraints); otherwise, the initial iterate should satisfy the constraints.

  • eps (float) – Residual norm threshold; quit when the residual norm is smaller than eps.

  • max_iter (int) – Maximum number of iterations.

  • memory_size (int) – The quasi-Newton memory. Larger values may lead to more stable behavior, but will increase the amount of time each iteration takes.

  • verbose (bool) – Whether to print verbose output.

  • print_every (int, optional) – Print verbose output every print_every iterations.

  • snapshot_every (int, optional) – Snapshot embedding every snapshot_every iterations; snapshots saved as CPU tensors to self.solve_stats.snapshots. If you want to generate an animation with the play method after embedding, set snapshot_every to a positive integer (like 1 or 5).

Returns:

The embedding, of shape (n_items, embedding_dim).

Return type:

torch.Tensor

high_distortion_pairs(X=None)#

Compute distortions, sorted from high to low.

Computes the distortions for an embedding, sorting them from high to low. Returns the edges and distortions in the sorted order.

This function can be used to debug an embedding or search for outliers in the data. In particular it can be instructive to manually examine the items which incurred the highest distortion.

For example:

edges, distortions = mde.high_distortion_pairs()
maybe_outliers = edges[:10]

You can then examine the items corresponding to the edges in maybe_outliers.

Parameters:

X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.

Returns:

  • torch.Tensor(shape=(n_edges, 2)) – edges, sorted from highest distortion to lowest

  • torch.Tensor(shape=(n_edges,)) – distortions, sorted from highest to lowest

play(color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), fps=None, tmpdir=None, savepath=None)#

Create a movie visualizing how the embedding was formed.

This method creates a GIF showing how the embedding was formed, starting with the initial iterate and ending with the final embedding. In other words it visualizes how the embedding algorithm (the embed method) updated the embedding’s state over time. (The embedding dimension should be at most 3.)

In each frame, the embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.

The embed method must be called sometime before calling this method, with a non-None value for the snapshot_every keyword argument.

If you want to save the GIF (instead of just viewing it in a Jupyter notebook), make sure to supply a path via the savepath keyword argument.

Parameters:
  • color_by (np.ndarray(shape=mde.n_items), optional) –

    A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if n_items is 4,

    categorical_values = np.ndarray(['dog', 'cat', 'zebra', 'cat'])
    other_categorical_values = np.ndarray([0, 1, 1, 2]
    continuous_values = np.ndarray([0.1, 0.5, 0.31, 0.99]
    

    are all acceptable values for this argument. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.

  • color_map (str or matplotlib colormap instance) – Color map to use when resolving color_by to colors; ignored when color_by is None.

  • colors (np.ndarray(shape=(mde.n_items, 4)), optional) –

    A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.

    Only one of color_by and colors should be non-None.

  • edges ({torch.Tensor/np.ndarray}(shape=(any, 2)), optional) – List of edges to superimpose over the scatter plot.

  • axis_limits (tuple(length=2), optional) – tuple (limit_low, limit_high) of axis limits, applied to both the x and y axis.

  • background_color (str, optional) – color of background

  • marker_size (float, optional) – size of each point in the scatter plot

  • figsize_inches (tuple(length=2)) – size of figures in inches: (width_inches, height_inches)

  • fps (float, optional) – the number of frames per second at which the movie should be shown

  • tmpdir (str, optional) – path to directory, where individual images comprising the GIF will be stored

  • savepath (str, optional) – path at which to save the GIF.

plot(color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), savepath=None)#

Plot an embedding, in one, two, or three dimensions.

This method plots embeddings, with embedding dimension at most 3.

The embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.

The embed method must be called sometime before calling this method.

Parameters:
  • color_by (np.ndarray(shape=mde.n_items), optional) –

    A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if n_items is 4,

    np.ndarray(['dog', 'cat', 'zebra', 'cat'])
    np.ndarray([0, 1, 1, 2]
    np.ndarray([0.1, 0.5, 0.31, 0.99]
    

    are all acceptable. The first two are treated as categorical, the third is continuous. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.

  • color_map (str or matplotlib colormap instance) – Color map to use when resolving color_by to colors; ignored when color_by is None.

  • colors (np.ndarray(shape=(mde.n_items, 4)), optional) –

    A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.

    Only one of color_by and colors should be non-None.

  • edges ({torch.Tensor/np.ndarray}(shape=(any, 2)), optional) – List of edges to superimpose over the scatter plot.

  • axis_limits (tuple(length=2), optional) – tuple (limit_low, limit_high) of axis limits, applied to both the x and y axis.

  • background_color (str, optional) – color of background

  • marker_size (float, optional) – size of each point in the scatter plot

  • figsize_inches (tuple(length=2)) – size of figures in inches: (width_inches, height_inches)

  • savepath (str, optional) – path to save the plot.

Returns:

Axis on which the embedding is plotted.

Return type:

matplotlib axis

to(device)#

Move MDE instance to another device.

class pymde.optim.SolveStats(average_distortions, residual_norms, step_size_percents, solve_time, times, snapshots, snapshot_every)#

Summary statistics for a solve.

average_distortions#

The average distortion at each iteration.

Type:

sequence

residual_norms#

The residual norm at each iteration.

Type:

sequence

step_size_percents#

The relative size of each step.

Type:

sequence

solve_time#

The time it took to create the embedding, in seconds.

Type:

float

snapshots#

Snapshots of the embedding.

Type:

sequence

snapshot_every#

The number of iterations between snapshots.

Type:

int

Preserve neighbors#

pymde.preserve_neighbors(data, embedding_dim=2, attractive_penalty=<class 'pymde.functions.penalties.Log1p'>, repulsive_penalty=<class 'pymde.functions.penalties.Log'>, constraint=None, n_neighbors=None, repulsive_fraction=None, max_distance=None, init='quadratic', device='cpu', verbose=False) MDE#

Construct an MDE problem designed to preserve local structure.

This function constructs an MDE problem for preserving the local structure of original data. This MDE problem is well-suited for visualization (using embedding_dim 2 or 3), but can also be used to generate features for machine learning tasks (with embedding_dim = 10, 50, or 100, for example). It yields embeddings in which similar items are near each other, and dissimilar items are not near each other.

The original data can either be a data matrix, or a graph. Data matrices should be torch Tensors, NumPy arrays, or scipy sparse matrices; graphs should be instances of pymde.Graph.

The MDE problem uses distortion functions derived from weights (i.e., penalties).

To obtain an embedding, call the embed method on the returned MDE object. To plot it, use pymde.plot.

embedding = pymde.preserve_neighbors(data).embed()
pymde.plot(embedding)
Parameters:
  • data ({torch.Tensor, numpy.ndarray, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape (n_items, n_features) or a graph. Neighbors are computed using Euclidean distance if the data is a matrix, or the shortest-path metric if the data is a graph.

  • embedding_dim (int) – The embedding dimension. Use 2 or 3 for visualization.

  • attractive_penalty (pymde.Function class (or factory)) – Callable that constructs a distortion function, given positive weights. Typically one of the classes from pymde.penalties, such as pymde.penalties.log1p, pymde.penalties.Huber, or pymde.penalties.Quadratic.

  • repulsive_penalty (pymde.Function class (or factory)) – Callable that constructs a distortion function, given negative weights. (If None, only positive weights are used.) For example, pymde.penalties.Log or pymde.penalties.InversePower.

  • constraint (pymde.constraints.Constraint (optional)) – Embedding constraint, like pymde.Standardized() or pymde.Anchored(anchors, values) (or None). Defaults to no constraint when a repulsive penalty is provided, otherwise defaults to pymde.Standardized().

  • n_neighbors (int (optional)) – The number of nearest neighbors to compute for each row (item) of data. A sensible value is chosen by default, depending on the number of items.

  • repulsive_fraction (float (optional)) – How many repulsive edges to include, relative to the number of attractive edges. 1 means as many repulsive edges as attractive edges. The higher this number, the more uniformly spread out the embedding will be. Defaults to 0.5 for standardized embeddings, and 1 otherwise. (If repulsive_penalty is None, this argument is ignored.)

  • max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than max_distance.

  • init (str) – Initialization strategy; ‘quadratic’ or ‘random’. If the quadratic initialization takes too much time, try a random initialization. When device==’cuda’ and init==’quadratic’ a torch-implemented version will be used, which is fully on device, but uses a CG-based technique (versus Lanczos).

  • device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).

  • verbose (bool) – If True, print verbose output.

Returns:

A pymde.MDE object, based on the original data.

Return type:

pymde.MDE

Preserve distances#

pymde.preserve_distances(data, embedding_dim=2, loss=<class 'pymde.functions.losses.Absolute'>, constraint=None, max_distances=50000000.0, device='cpu', verbose=False) MDE#

Construct an MDE problem based on original distances.

This function constructs an MDE problem for preserving pairwise distances between items. This can be useful for preserving the global structure of the data.

The data can be specified with either a data matrix (a NumPy array, torch Tensor, or sparse matrix), or a pymde.Graph instance encoding the distances:

A NumPy array, torch tensor, or sparse matrix is interpreted as a collection of feature vectors: each row gives the feature vector for an item. The original distances are the Euclidean distances between the feature vectors.

A pymde.Graph instance is interpreted as encoding all (n_items choose 2) distances: the distance between i and j is taken to be the length of the shortest path connecting i and j.

When the number of items n_items is large, the total number of pairs will be very large. When this happens, instead of computing all pairs of distances, this function will sample a subset uniformly at random. The maximum number of distances to compute is specified by the parameter max_distances. Depending on how many items you have (and how much memory your machine has), you may need to adjust this parameter.

To obtain an embedding, call the embed method on the returned object. To plot it, use pymde.plot.

For example:

embedding = pymde.preserve_distances(data).embed()
pymde.plot(embedding)
Parameters:
  • data ({np.ndarray, torch.Tensor, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape (n_items, n_features) or a graph.

  • embedding_dim (int) – The embedding dimension.

  • loss (pymde.Function class (or factory)) – Callable that constructs a distortion function, given original distances. Typically one of the classes defined in pymde.losses, such as pymde.losses.Absolute, or pymde.losses.WeightedQuadratic.

  • constraint (pymde.constraints.Constraint (optional)) – Embedding constraint, such as pymde.Standardized() or pymde.Anchored(anchors, values) (or None). Defaults to no constraint. Note: when the constraint is pymde.Standardized(), the original distances will be scaled by a constant (because the standardization constraint puts a limit on how large any one distance can be).

  • max_distances (int) – Maximum number of distances to compute.

  • device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).

  • verbose (bool) – If True, print verbose output.

Returns:

A pymde.MDE instance, based on preserving the original distances.

Return type:

pymde.MDE

Distortion functions#

Penalties#

Penalties: distortion functions derived from weights.

A vector distortion function \(f : \mathbf{R}^{p} \to \mathbf{R}^p\) derived from weights has component functions

\[f_k(d_k) = w_kp(d_k), \quad k=1, \ldots, p,\]

where \(w_k\) is a scalar weight, \(p\) is a penalty function, and \(d_k\) is an embedding distance. The penalty encourages distances to be small when the weights are positive, and encourages them to be not small when the weights are negative.

When an MDE problem calls a distortion function, \(d_k\) is the Euclidean distance between the items paired by the \(k\)-th edge, so \(w_k\) should be the weight associated with the \(k\)-th edge, and \(f_k(d_k)\) is the distortion associated with the edge.

Every penalty can be used with positive or negative weights. When \(w_k\) is positive, \(f_k\) is attractive, meaning it encourages the embedding distances to be small; when \(w_k\) is negative, \(f_k\) is repulsive, meaning it encourages the distances to be large. Some penalties are better suited to attracting points, while others are better suited to repelling them.

Negative weights. For negative weights, it is recommended to only use one of the following penalties:

pymde.penalties.Log
pymde.penalties.InvPower
pymde.penalties.LogRatio

These penalties go to negative infinity as the input approaches zero, and to zero as the input approaches infinity. With a negative weight, that means the distortion function goes to infinity at 0, and to 0 at infinity.

Using other penalties with negative weights is possible, but it can lead to pathological MDE problems if you are not careful.

Positive weights. Penalties that work well in attracting points are those that are \(0\) when the distance is \(0\), grows when the distance is larger than \(0\). All the penalties in this module, other than the ones listed above (and the function described below), can be safely used with attractive penalties. Some examples inlcude:

pymde.penalties.Log1p
pymde.penalties.Linear
pymde.penalties.Quadratic
pymde.penalties.Cubic
pymde.penalties.Huber

Combining penalties. The PushAndPull function can be used to combine two penalties, an attractive penalty for use with positive weights, and a repulsive penalty for use with negative weights. This leads to a distortion function of the form

\[\begin{split}f_k(d) = \begin{cases} w_k p_{\text{attractive}}(d_k) & w_k > 0 \\ w_k p_{\text{repulsive}}(d_k) & w_k < 0 \\ \end{cases}.\end{split}\]

For example:

weights = torch.tensor([1., 1., -1., 1., -1.])
attractive_penalty = pymde.penalties.Log1p
repulsive_penalty = pymde.penalties.Log

distortion_function = pymde.PushAndPull(
    weights,
    attractive_penalty,
    repulsive_penalty)

Example. Distortion functions are created in a vectorized or elementwise fashion. The constructor takes a sequence (torch.Tensor) of weights, returning a callable object. The object takes a sequence of distances of the same length as the weights, and returns a sequence of distortions, one for each distance.

For example:

weights = torch.tensor([1., 2., 3.])
f = pymde.penalties.Quadratic(weights)

distances = torch.tensor([2., 1., 4.])
distortions = f(distances)
# the distortions are 1 * 2**2 == 4, 2 * 1**2 == 2, 3 * 4**2 = 48
print(distortions)

prints

torch.tensor([4., 2., 48.])
class pymde.penalties.Linear(weights)#

\(p(d) = d\)

class pymde.penalties.Quadratic(weights)#

\(p(d) = d^2\)

class pymde.penalties.Cubic(weights)#

\(p(d) = d^3\)

class pymde.penalties.Power(weights, exponent)#

\(p(d) = d^\text{exponent}\)

class pymde.penalties.Huber(weights, threshold=0.5)#
\[\begin{split}p(d) = \begin{cases} 0.5 \cdot d^2 & d < \text{threshold} \\ \text{threshold}(d - 0.5 \cdot \text{threshold}) & d \geq \text{threshold} \end{cases}\end{split}\]
class pymde.penalties.Logistic(weights, threshold=0.0, alpha=3.0)#

\(p(d) = \log(1 + e^{\alpha(d - \text{threshold})})\)

class pymde.penalties.Log1p(weights, exponent=1.5)#

\(p(d) = \log(1 + d^{\text{exponent}})\)

class pymde.penalties.Log(weights, exponent=1.0)#

\(p(d) = \log(1 - \exp(-d^\text{exponent}))\)

class pymde.penalties.InvPower(weights, exponent=1)#

\(p(d) = 1/d^\text{exponent}\)

class pymde.penalties.LogRatio(weights, exponent=2)#

\(p(d) = \log\left(\frac{d^\text{exponent}}{1 + d^{\text{exponent}}}\right)\)

class pymde.penalties.PushAndPull(weights, attractive_penalty=<class 'pymde.functions.penalties.Log1p'>, repulsive_penalty=<class 'pymde.functions.penalties.LogRatio'>)#

Combine an attractive and repulsive penalty.

\[\begin{split}f_k(d) = \begin{cases} w_k p_{\text{attractive}}(d_k) & w_k > 0 \\ w_k p_{\text{repulsive}}(d_k) & w_k < 0 \\ \end{cases}\end{split}\]

Losses#

Losses: distortion functions derived from original deviations.

A vector distortion function \(f : \mathbf{R}^{p} \to \mathbf{R}^p\) derived from original deviations has component functions

\[f_k(d_k) = \ell(d_k, \delta_k), \quad k=1, \ldots, p,\]

where \(\ell\) is a loss function, \(\delta_k\) is a nonnegative deviation or dissimilarity score, \(d_k\) is an embedding distance,

When an MDE problem calls a distortion function, \(d_k\) is the Euclidean distance between the items paired by the k-th edge, so \(\delta_k\) should be the original deviation associated with the k-th edge, and \(f_k(d_k)\) is the distortion associated with the edge.

The deviations can be interpreted as targets for the embedding distances: the loss function is 0 when \(d_k = \delta_k\), and positive otherwise. So a deviation \(\delta_k`\) of 0 means that the items in the k-th edge are the same, and the larger the deviation, the more dissimilar the items are.

Distortion functions are created in a vectorized or elementwise fashion. The constructor takes a sequence (torch.Tensor) of deviations (target distances), returning a callable object. The object takes a sequence of distances of the same length as the weights, and returns a sequence of distortions, one for each distance.

Some examples of losses inlcude:

pymde.losses.Absolute
pymde.losses.Quadratic
pymde.losses.SoftFractional

Example.

deviations = torch.tensor([1., 2., 3.])
f = pymde.losses.Quadratic(deviations)

distances = torch.tensor([2., 5., 4.])
distortions = f(distances)
# the distortions are (2 - 1)**2 == 1, (5 - 2)**2 == 9, (4 - 3)**2 == 1
print(distortions)

prints

torch.tensor([1., 9., 1.])
class pymde.losses.Absolute(deviations)#

\(\ell(d, \delta) = |d - \delta|\)

class pymde.losses.Quadratic(deviations)#

\(\ell(d, \delta) = (d - \delta)^2\)

class pymde.losses.Cubic(deviations)#

\(\ell(d, \delta) = (d - \delta)^3\)

class pymde.losses.Power(deviations, exponent)#

\(\ell(d, \delta) = (d - \delta)^{\text{exponent}}\)

class pymde.losses.WeightedQuadratic(deviations, weights=None)#

\(\ell(d, \delta) = \frac{1}{\delta^2} (d - \delta)^2\)

If weights is not None, the coefficient then the coefficient \(1/\delta^2\) is replaced by the weights.

class pymde.losses.Fractional(deviations)#

\(\ell(d, \delta) = \max(\delta / d, d / \delta)\)

class pymde.losses.SoftFractional(deviations, gamma=10.0)#

\(\ell(d, \delta) = \frac{1}{\gamma}\log\left( \frac{\exp(\gamma \delta/d) + \exp(\gamma d/\delta)}{2\exp(\gamma)} \right)\)

The parameter gamma controls how close this loss is to the fractional loss. The larger gamma is, the closer to the fractional loss.

Constraints#

class pymde.constraints.Constraint#

A generic constraint.

To create a custom constraint, create a subclass of this class, and implement its abstract methods.

abstract initialization(n_items: int, embedding_dim: int, device=None) Tensor#

Return a random embedding in the constraint set.

Parameters:
  • n_items (int) – The number of items.

  • embedding_dim (int) – The embedding dimension.

  • device (str) – Device on which to store the returned embedding.

Returns:

a tensor of shape (n_items, embedding_dim) satisfying the constraints.

Return type:

torch.Tensor

abstract name() str#

The name of the constraint.

abstract project_onto_constraint(Z: Tensor, inplace=True) Tensor#

Project Z onto the constraint set.

Returns a projection of Z onto the constraint set.

Parameters:
  • Z (torch.Tensor) – The point to project.

  • inplace (bool) – If True, stores the projection in Z.

Return type:

The projection of Z onto the constraints.

abstract project_onto_tangent_space(X: Tensor, Z: Tensor, inplace=True) Tensor#

Project Z onto the tangent space of the constraint set at X.

Returns the Euclidean projection of Z onto the tangent space of the constraint set at X (where X is some matrix satisfying the constraints).

X and Z should have the same shape.

Parameters:
  • X (torch.Tensor) – A point satisfying the constraints.

  • Z (torch.Tensor) – The point to project.

  • inplace (bool) – If True, stores the projection in Z.

Returns:

  • The projection of Z onto the tangent space of the constraint

  • set at X.

pymde.Centered()#

Centering constraint.

This function returns a centering constraint, which requires the embedding vectors to be centered around 0.

pymde.Standardized()#

Standardization constraint.

This function returns a standardization constraint, which constrains an embedding \(X\) to be centered (the rows have mean zero) and to satisfy \((1/n) X^T X = I\).

class pymde.Anchored(anchors, values)#

Anchor some vectors to specific values.

__init__(anchors, values)#

Constructs an anchor constraint, in which some embedding vectors (the anchors) are fixed to specific values.

Parameters:
  • anchors (torch.Tensor, shape (n_anchors)) – a Tensor in which each entry gives the index of an anchored vertex

  • values (torch.Tensor, shape (n_anchors, embedding_dim)) – a Tensor which gives the value to which each anchor should be fixed

Preprocessing#

class pymde.Graph(adjacency_matrix)#

A weighted graph.

This class represents a weighted graph. It is backed by a scipy.sparse adjacency matrix, and can be constructed from either a dense distance matrix, a sparse adjacency matrix, or torch.Tensor’s of edges and weights (using the from_edges static method).

It is an error for the graph to contain self-edges (i.e., non-zero values on the diagonal).

This class is accepted as an argument to various preprocessing functions.

property adjacency_matrix#

The scipy.sparse adjacency matrix backing the graph.

property distances#

The distances associated with each edge.

draw(embedding_dim=2, standardized=False, device='cpu', verbose=False)#

Draw a graph in the Cartesian plane.

This method does some basic preprocessing, constructs an MDE problem that is often suitable for drawing graphs, and computes/returns an embedding by approximately solving the MDE problem.

Parameters:
  • embedding_dim (int) – The number of dimemsions, 1, 2, or 3.

  • standardized (bool) – Whether to impose a standardization constraint.

  • device (str) – Device on which to compute/store embedding, ‘cpu’ or ‘cuda’.

  • verbose (bool) – Whether to print verbose output.

Returns:

The embedding, of shape (n_items, embedding_dim)

Return type:

torch.Tensor

property edges#

A torch.Tensor of the edges in the graph.

static from_edges(edges, weights=None, n_items=None)#

Construct a graph from edges and weights.

Parameters:
  • edges (torch.Tensor) – Tensor of edges, of shape (n_edges, 2), with each edge represented by a row, i.e. by two integers.

  • weights (torch.Tensor, optional) – Tensor of weights, of shape (n_edges,), with each weight a float.

  • n_items (int, optional) – The number of items in the graph; if None, this is taken to be the maximum value in edges plus 1.

Returns:

A Graph object representing edges and weights.

Return type:

pymde.Graph

property n_all_edges#

n_items choose 2.

property n_edges#

The number of edges in the graph.

property n_items#

The number of items in the graph.

neighbor_distances(node) ndarray#

The distances associated with the edges connected to node.

neighbors(node: int) ndarray#

The indices of the neighbors of node.

pymde.preprocess.k_nearest_neighbors(data, k, max_distance=None, verbose=False)#

Compute k-nearest neighbors, given data matrix or graph.

This function computes a k-nearest neighbor graph, given either a data matrix or an original graph.

When the input is a data matrix, each row is treated as the feature vector of an item, and the Euclidean distance is used to judge how close two items are to each other.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

In the returned graph, if i is a neighbor of j and j a neighbor of i, then the weight w_{ij} will be +2; if only one is a neighbor of the other, then w_{ij} will be +1; if neither are neighbors of each other, then (i, j) will not be in the graph.

Parameters:
  • data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape (n_items, n_features), or a pymde.Graph

  • k (int) – The number of nearest neighbors per item

  • max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than max_distance.

  • verbose (bool) – If True, print verbose output.

Returns:

The k-nearest neighbor graph. Access the weights with graph.weights, and the edges with graph.edges

Return type:

pymde.Graph

pymde.preprocess.dissimilar_edges(n_items, similar_edges, num_edges=None, seed=None)#

Sample edges not in similar_edges.

Given a number of items, and a torch.Tensor containing pairs of items known to be similar, this function samples (uniformly at random) some edges in the complement of similar_edges. The returned number of edges will be approximately equal to (and no greater than) the number of edges in similar_edges (or num_edges, if provided).

Parameters:
  • n_items (int) – The number of items.

  • similar_edges (torch.Tensor) – Edges to exclude when sampling.

  • num_edges (int (optional)) – Number of edges to sample, defaults to approximately the same number as similar_edges.

  • seed (int (optional)) – Random seed for the sampling.

Returns:

Edges not in similar_edges.

Return type:

torch.Tensor

pymde.preprocess.distances(data, retain_fraction=1.0, verbose=False)#

Compute distances, given data matrix or graph.

This function computes distances between some pairs of items, given either a data matrix or a pymde.Graph instance.

When the input is a data matrix, each row is treated as the feature vector of an item.

When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.

The retain_fraction keyword argument can be used to compute only a fraction of the distances. This can be useful when there are many items, in which case storing all the distances may be intractable.

Parameters:
  • data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape (n_items, n_features), or a pymde.Graph

  • retain_fraction (float, optional) – A float between 0 and 1, specifying the fraction of all (n_items choose 2) to compute. For example, if retain_fraction is 0.1, only 10 percent of the edges will be stored.

  • verbose – If True, print verbose output.

Returns:

A graph object holding the distances and corresponding edges. Access the distances with graph.distances, and the edges with graph.edges.

Return type:

pymde.Graph

Classical embeddings#

pymde.pca(Y, embedding_dim)#

PCA embedding of a data matrix.

Parameters:
  • Y (torch.Tensor(shape=(n, k))) – data matrix, with n >= k

  • embedding_dim (int) – the number of eigenvectors to retrieve, the embedding dimension; must be <= k

Returns:

The top embedding_dim eigenvectors of YY^T, scaled by sqrt(n)

Return type:

torch.Tensor(shape=(n, embedding_dim))

pymde.laplacian_embedding(data, embedding_dim=2, n_neighbors=None, max_distance=None, init='quadratic', device='cpu', verbose=False) MDE#

Constructs an MDE problem for computing a Laplacian embedding.

The embedding preserves the nearest neighbors of each item, using quadratic distortion functions and a standardization constraint.

Parameters:
  • data ({torch.Tensor, numpy.ndarray, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape (n_items, n_features) or a graph. Neighbors are computed using Euclidean distance if the data is a matrix, or the shortest-path metric if the data is a graph.

  • embedding_dim (int) – The embedding dimension. Use 2 or 3 for visualization.

  • n_neighbors (int (optional)) – The number of nearest neighbors to compute for each row (item) of data. A sensible value is chosen by default, depending on the number of items.

  • max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than max_distance.

  • init (str) – Initialization strategy; ‘quadratic’ or ‘random’. If the quadratic initialization takes too much time, try a random initialization.

  • device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).

  • verbose (bool) – If True, print verbose output.

Returns:

A pymde.MDE object, based on the original data.

Return type:

pymde.MDE

pymde.quadratic.spectral(n_items, embedding_dim, edges, weights, cg=False, max_iter=40, device='cpu')#

Compute a spectral embedding

Solves the quadratic MDE problem

\[\begin{split}\begin{array}{ll} \mbox{minimize} & \sum_{(i, j) in \text{edges}} w_{ij} d_{ij}^2 \\ \mbox{subject to} & (1/n) X^T X = I, \quad d_{ij} = |x_i - x_j|_2. \end{array}\end{split}\]

By default, the problem is solved using a Lanczos method. If cg=True, LOBPCG is used; LOBPCG is warm-started by running a projected quasi-newton method for a small number of iterations. Use cg=True when the number of edges is very large, and when an approximate solution is satisfactory (the Lanczos method typically gives much more accurate solutions, but can be slower).

Parameters:
  • n_items (int) – The number of items

  • embedding_dim (int) – The embedding dimension

  • edges (torch.Tensor(shape=(n_edges, 2))) – A list of edges (i, j), 0 <= i < j < n_items

  • weights (torch.Tensor(shape=(n_edges,))) – A list of nonnegative weights associated with each edge

  • cg (bool) – If True, uses a preconditioned CG method to find the embedding, which requires that the Laplacian matrix plus the identity is positive definite; otherwise, a Lanczos method is used. Use True when the Lanczos method is too slow (which might happen when the number of edges is very large).

  • max_iter (int) – max iteration count for the CG method

  • device (str (optional)) – The device on which to allocate the embedding

Returns:

A spectral embedding, projected onto the standardization constraint

Return type:

torch.Tensor(shape=(n_items, embedding_dim))

Utilities#

pymde.all_edges(n)#

Return a tensor of all (n choose 2) edges

Constructs all possible edges among n items. For example, if n is 4, the return value will be equal to

torch.Tensor([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]])
pymde.align(source, target)#

Align an source embedding to a target embedding

Align the source embedding to another target embedding, via rotation. The alignment is done by solving an orthogonal Procrustes problem.

Parameters:
  • source (torch.Tensor) – The embedding to be aligned.

  • target (torch.Tensor) – The target embedding, of the same shape as source.

Returns:

The rotation of source best aligned to the target.

Return type:

torch.Tensor

pymde.center(X)#

Center an embedding

Returns a new embedding, equal to the given embedding minus the mean of its rows.

pymde.rotate(X, degrees)#

Rotate a 2 or 3D embedding

Rotates a 2/3D embedding by degrees. If X is a 2D embedding, degrees should be a scalar; if it is 3D, degrees should be a length-3 torch.Tensor, with one angle for each axis (the embedding will be rotated along the x-axis first, then the y-axis, then the z-axis).

Parameters:
  • X (torch.Tensor) – The embedding to rotate.

  • degrees (torch.Tensor) – The angles of rotation.

Returns:

The rotated embedding

Return type:

torch.Tensor

pymde.seed(seed: int)#

Set the random seed

This function sets the random seed that PyMDE uses in various preprocessing methods.

pymde.plot(X, color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), savepath=None)#

Plot an embedding, in one, two, or three dimensions.

This function plots embeddings. The input embedding’s dimension should be at most 3.

The embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.

Parameters:
  • X (array-like) – The embedding to plot, of shape (n_items, embedding_dim). The second dimension should be 1, 2, or 3.

  • color_by (array-like, optional) –

    A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if n_items is 4,

    np.ndarray(['dog', 'cat', 'zebra', 'cat'])
    np.ndarray([0, 1, 1, 2]
    np.ndarray([0.1, 0.5, 0.31, 0.99]
    

    are all acceptable. The first two are treated as categorical, the third is continuous. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.

  • color_map (str or matplotlib colormap instance) – Color map to use when resolving color_by to colors; ignored when color_by is None.

  • colors (array-like, optional) –

    A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.

    Only one of color_by and colors should be non-None.

  • edges (array-like, optional) – List of edges to superimpose over the scatter plot, shape (any, 2)

  • axis_limits (tuple, optional) – tuple (limit_low, limit_high) of axis limits, applied to both the x and y axis.

  • background_color (str, optional) – color of background

  • marker_size (float, optional) – size of each point in the scatter plot

  • figsize_inches (tuple) – size of figures in inches: (width_inches, height_inches)

  • savepath (str, optional) – path to save the plot.

Returns:

Axis on which the embedding is plotted.

Return type:

matplotlib.Axes

Datasets#

class pymde.datasets.Dataset(data, attributes, other_data=None, metadata=None)#

Represents a dataset.

Each instance has two main attrs:

data:

either a data matrix, or an instance of pymde.Graph; the data to be embedded.

attributes:

a dictionary whose values give attributes associated with the items in the data, such as the digit labels in MNIST.

Other data that the dataset might carry is in the (dict) attribute other_data. Metadata about where the data came from is stored in metadata.

pymde.datasets.MNIST(root='./', download=True) Dataset#

MNIST dataset (LeCun, et al.).

The MNIST dataset contains 70,000, 28x28 images of handwritten digits.

  • data: torch.Tensor with 70,000 rows, each of length 784 (representing the pixels in the image).

  • attributes dict: the key digits holds an array in which each entry gives the digit depicted in the corresponding row of data.

pymde.datasets.FashionMNIST(root='./', download=True) Dataset#

Fashion-MNIST dataset (Xiao, et al.).

The Fashion-MNIST dataset contains 70,000, 28x28 images of Zalando fashion articles.

  • data: torch.Tensor with 70,000 rows, each of length 784 (representing the pixels in the image).

  • attributes dict: the key class holds an array in which each entry gives the fashion article in the corresponding row of data.

pymde.datasets.google_scholar(root='./', download=True, full=False) Dataset#

Google Scholar dataset (Agrawal, et al.).

The Google Scholar dataset contains an academic coauthorship graph: the nodes are authors, and two authors are connected by an edge if either author listed the other as a coauthor on Google Scholar. (Note that if two authors collaborated on a paper, but neither has listed the other as a coauthor on their Scholar profiles, then they will not be connected by an edge).

If full is False, obtains a small version of the dataset, on roughly 40,000 authors, each with h-index at least 50. If full is True, obtains the whole dataset, on roughly 600,000 authors. The full dataset is roughly 1GB in size.

  • data: a pymde.Graph, with nodes representing authors

  • attributes: the coauthors key has an array holding the number of coauthors of each other, normalized to be a percentile.

  • other_data: holds a dataframe describing the dataset, keyed by dataframe.

pymde.datasets.google_scholar_interests(root='./', download=True) Dataset#

Google Scholar academic interests

This dataset contains a cooccurrence matrix of the 5000 most popular academic interests listed by authors on Google Scholar

  • data: a cooccurrence matrix, counting the number of times two two interests appeared among a single author’s listed interests, with data[i, j] giving the cooccurrence count of interests[i] and interests[j]

  • attributes: one key, interests, where interests[i] is the interest corresponding to the ith row/column of data.

pymde.datasets.covid19_scrna_wilk(root='./', download=True) Dataset#

COVID-19 scRNA data (Wilk, et al.).

The COVID-19 dataset includes a PCA embedding of single-cell mRNA transcriptomes of roughly 40,000 cells, taken from some patients with COVID-19 infections and from healthy controls.

Instructions on how to obtain the full dataset are available in the Wilk et al. paper: https://www.nature.com/articles/s41591-020-0944-y,

  • data: the PCA embedding

  • attributes: two keys, cell_type and health_status.

pymde.datasets.population_genetics(root='./', download=True) Dataset#

Population genetics dataset (Nelson, et al)

The population genetics dataset includes a PCA embedding (in R^20) of single nucleotide polymorphism data associated with 1,387 individuals thought to be of European descent. (The data is from the Population Reference Sample project by Nelson, et al.)

It also contains a “corrupted” version of the data, in which 154 additional points have been injected; the first 10 coordinates of these synthetic points are generated using a discrete uniform distribution on {0, 1, 2}, and the last 10 are generated using a discrete uniform distritubtion on {1/12, /18}.

A study of Novembre et al (2008) showed that a PCA embedding in R^2 roughly resembles the map of Europe, suggesting that the genes encode geographical information. But PCA does not produce interesting visualizations of the corrupted data. If distortion functions are chosen to be robust (eg, using the Log1p or Huber attractive penalties), we can create embeddings that preserve the geographical structure, while placing the synthetic points to the side, in their own cluser.

  • data: the PCA embedding of the clean genetic data, in R^20

  • corrupted_data: the corrupted data, in R^20

  • attributes: two keys, clean_colors and corrupted_colors.

pymde.datasets.counties(root='./', download=True) Dataset#

US counties (2013-2017 ACS 5-Year Estimates)

This dataset contains 34 demographic features for each of the 3,220 US counties. The features were collected by the 2013-2017 ACS 5-Year Estimates longitudinal survey, run by the US Census Bureau.

  • data: the PCA embedding of the clean genetic data, in R^20

  • county_dataframe: the raw ACS data

  • voting_dataframe: the raw 2016 voting data

  • attributes: one key, democratic_fraction, the fraction of of voters who voted Democratic in each county in the 2016 presidential election