API documentation#
Note
PyMDE is very young software. When updates are released, we will try to be backward compatible with earlier versions, but sometimes we may be unable to do so.
If a new version carries a breaking change, its minor version number will be greater than the previous version (e.g., v0.2.0 may have some changes that are incompatible with v0.1.5, but v0.1.5 will be fully compatible with v0.1.0).
MDE#
- class pymde.MDE(n_items: int, embedding_dim: int, edges: Tensor, distortion_function: Callable | StochasticFunction, constraint: Constraint | None = None, device: str | None = None)#
An MDE problem.
An MDE instance represents a specific MDE problem, specified by the number of items, the embedding dimension, a list of edges, the vector distortion function, and possibly a constraint.
- n_items#
The number of items
- Type:
int
- embedding_dim#
The embedding dimension
- Type:
int
- edges#
The list of edges.
- Type:
torch.Tensor
- distortion_function#
The vector distorton function.
- Type:
Callable
- constraint#
The constraint imposed (or None)
- device#
The device on which to compute the embedding and store data (like ‘cpu’ or ‘cuda’)
- solve_stats#
Summary statistics about the embedding, populated after calling the
embed
method.
Public Methods:
__init__
(n_items, embedding_dim, edges, ...)Constructs an MDE problem.
to
(device)Move MDE instance to another device.
__str__
()Return str(self).
differences
(X)Compute
X[i] - X[j]
for each row (i, j) in edges.distances
([X])Compute embedding distances.
distortions
([X])Compute distortions
average_distortion
([X])Compute average distortion.
high_distortion_pairs
([X])Compute distortions, sorted from high to low.
embed
([X, eps, max_iter, memory_size, ...])Compute an embedding.
forward
([X, eps, max_iter, memory_size, ...])Compute an embedding.
plot
([color_by, color_map, colors, edges, ...])Plot an embedding, in one, two, or three dimensions.
distortions_cdf
([figsize_inches])Plot a cumulative distribution function of the distortions.
play
([color_by, color_map, colors, edges, ...])Create a movie visualizing how the embedding was formed.
Inherited from
Module
__init__
(*args, **kwargs)Initializes internal Module state, shared by both nn.Module and ScriptModule.
forward
(*input)Defines the computation performed at every call.
register_buffer
(name, tensor[, persistent])Adds a buffer to the module.
register_parameter
(name, param)Adds a parameter to the module.
add_module
(name, module)Adds a child module to the current module.
register_module
(name, module)Alias for
add_module()
.get_submodule
(target)Returns the submodule given by
target
if it exists, otherwise throws an error.get_parameter
(target)Returns the parameter given by
target
if it exists, otherwise throws an error.get_buffer
(target)Returns the buffer given by
target
if it exists, otherwise throws an error.get_extra_state
()Returns any extra state to include in the module's state_dict.
set_extra_state
(state)This function is called from
load_state_dict()
to handle any extra state found within the state_dict.apply
(fn)Applies
fn
recursively to every submodule (as returned by.children()
) as well as self.cuda
([device])Moves all model parameters and buffers to the GPU.
ipu
([device])Moves all model parameters and buffers to the IPU.
xpu
([device])Moves all model parameters and buffers to the XPU.
cpu
()Moves all model parameters and buffers to the CPU.
type
(dst_type)Casts all parameters and buffers to
dst_type
.float
()Casts all floating point parameters and buffers to
float
datatype.double
()Casts all floating point parameters and buffers to
double
datatype.half
()Casts all floating point parameters and buffers to
half
datatype.bfloat16
()Casts all floating point parameters and buffers to
bfloat16
datatype.to_empty
(*, device)Moves the parameters and buffers to the specified device without copying storage.
to
()Moves and/or casts the parameters and buffers.
register_full_backward_pre_hook
(hook[, prepend])Registers a backward pre-hook on the module.
register_backward_hook
(hook)Registers a backward hook on the module.
register_full_backward_hook
(hook[, prepend])Registers a backward hook on the module.
register_forward_pre_hook
(hook, *[, ...])Registers a forward pre-hook on the module.
register_forward_hook
(hook, *[, prepend, ...])Registers a forward hook on the module.
__call__
(*args, **kwargs)Call self as a function.
__setstate__
(state)__getattr__
(name)__setattr__
(name, value)Implement setattr(self, name, value).
__delattr__
(name)Implement delattr(self, name).
register_state_dict_pre_hook
(hook)These hooks will be called with arguments:
self
,prefix
, andkeep_vars
before callingstate_dict
onself
.state_dict
()Returns a dictionary containing references to the whole state of the module.
register_load_state_dict_post_hook
(hook)Registers a post hook to be run after module's
load_state_dict
is called.load_state_dict
(state_dict[, strict])Copies parameters and buffers from
state_dict
into this module and its descendants.parameters
([recurse])Returns an iterator over module parameters.
named_parameters
([prefix, recurse, ...])Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
buffers
([recurse])Returns an iterator over module buffers.
named_buffers
([prefix, recurse, ...])Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
children
()Returns an iterator over immediate children modules.
named_children
()Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
modules
()Returns an iterator over all modules in the network.
named_modules
([memo, prefix, remove_duplicate])Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
train
([mode])Sets the module in training mode.
eval
()Sets the module in evaluation mode.
requires_grad_
([requires_grad])Change if autograd should record operations on parameters in this module.
zero_grad
([set_to_none])Sets gradients of all model parameters to zero.
share_memory
()See
torch.Tensor.share_memory_()
extra_repr
()Set the extra representation of the module
__repr__
()Return repr(self).
__dir__
()Default dir() implementation.
- __init__(n_items: int, embedding_dim: int, edges: Tensor, distortion_function: Callable | StochasticFunction, constraint: Constraint | None = None, device: str | None = None)#
Constructs an MDE problem.
- Parameters:
n_items (int) – Number of things being embedded.
embedding_dim (int) – Embedding dimension.
edges (torch.Tensor(shape=(num_edges, 2), dtype=torch.int)) – Tensor, where each row is an edge (i, j) between two items; each edge should satisfy 0 <= i < j < n_items. In particular self-edges are not allowed.
distortion_function (Callable or pymde.functions.StochasticFunction) – The vectorized distortion function, typically an instance of a class from
pymde.penalties
orpymde.losses
however, this can be any Python callable that maps a torch.Tensor of embedding distances to a torch.Tensor of distortions.constraint (pymde.constraints.Constraint, optional) – A Constraint object, such as
pymde.Standardized()
Defaults to an unconstrained (centered) embedding.device (str, optional) – Name of device on which to store tensors/compute embedding, such as ‘cpu’ or ‘cuda’ for GPU. Default infers device from
edges
anddistortion_function
- average_distortion(X=None)#
Compute average distortion.
This method computes the average distortion of an embedding.
If
X
is None, this method computes the average distortion for the embedding computed by the last call toembed
.- Parameters:
X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.
- Returns:
The average distortion.
- Return type:
scalar (float)
- differences(X)#
Compute
X[i] - X[j]
for each row (i, j) in edges.
- distances(X=None)#
Compute embedding distances.
This function computes the embedding distances corresponding to the list of edges (
self.edges
)If
X
is None, computes distances for the embedding computed by the last call toembed
- Parameters:
X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.
- Returns:
Vector of embedding distances. The k-th entry is the distance corresponding to the k-th edge (
self.edges[k]
- Return type:
torch.Tensor(shape=(n_edges,))
- distortions(X=None)#
Compute distortions
This function computes the distortions for an embedding.
If
X
is None, computes distortions for the embedding computed by the last call toembed
- Parameters:
X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.
- Returns:
Vector of distortions. The k-th entry is the distortion for the k-th edge (
self.edges[k]
- Return type:
torch.Tensor(shape=(n_edges,))
- distortions_cdf(figsize_inches=(8, 3))#
Plot a cumulative distribution function of the distortions.
The
embed
method must be called sometime before calling this method.
- embed(X=None, eps=1e-05, max_iter=300, memory_size=10, verbose=False, print_every=None, snapshot_every=None)#
Compute an embedding.
This method stores the embedding in the
X
attribute of the problem instance (mde.X
). Summary statistics related to the fitting process are stored insolve_stats
(mde.solve_stats
).All arguments have sensible default values, so in most cases, it suffices to just type
mde.embed()
ormde.embed(verbose=True)
- Parameters:
X (torch.Tensor, optional) – Initial iterate, of shape
(n_items, embedding_dim)
. When None, the initial iterate is chosen randomly (and projected onto the constraints); otherwise, the initial iterate should satisfy the constraints.eps (float) – Residual norm threshold; quit when the residual norm is smaller than
eps
.max_iter (int) – Maximum number of iterations.
memory_size (int) – The quasi-Newton memory. Larger values may lead to more stable behavior, but will increase the amount of time each iteration takes.
verbose (bool) – Whether to print verbose output.
print_every (int, optional) – Print verbose output every
print_every
iterations.snapshot_every (int, optional) – Snapshot embedding every
snapshot_every
iterations; snapshots saved as CPU tensors toself.solve_stats.snapshots
. If you want to generate an animation with theplay
method after embedding, setsnapshot_every
to a positive integer (like 1 or 5).
- Returns:
The embedding, of shape
(n_items, embedding_dim)
.- Return type:
torch.Tensor
- high_distortion_pairs(X=None)#
Compute distortions, sorted from high to low.
Computes the distortions for an embedding, sorting them from high to low. Returns the edges and distortions in the sorted order.
This function can be used to debug an embedding or search for outliers in the data. In particular it can be instructive to manually examine the items which incurred the highest distortion.
For example:
edges, distortions = mde.high_distortion_pairs() maybe_outliers = edges[:10]
You can then examine the items corresponding to the edges in
maybe_outliers
.- Parameters:
X (torch.Tensor(shape=(n_items, 2)), optional) – Embedding.
- Returns:
torch.Tensor(shape=(n_edges, 2)) – edges, sorted from highest distortion to lowest
torch.Tensor(shape=(n_edges,)) – distortions, sorted from highest to lowest
- play(color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), fps=None, tmpdir=None, savepath=None)#
Create a movie visualizing how the embedding was formed.
This method creates a GIF showing how the embedding was formed, starting with the initial iterate and ending with the final embedding. In other words it visualizes how the embedding algorithm (the
embed
method) updated the embedding’s state over time. (The embedding dimension should be at most 3.)In each frame, the embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.
The
embed
method must be called sometime before calling this method, with a non-None value for thesnapshot_every
keyword argument.If you want to save the GIF (instead of just viewing it in a Jupyter notebook), make sure to supply a path via the
savepath
keyword argument.- Parameters:
color_by (np.ndarray(shape=mde.n_items), optional) –
A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if
n_items
is 4,categorical_values = np.ndarray(['dog', 'cat', 'zebra', 'cat']) other_categorical_values = np.ndarray([0, 1, 1, 2] continuous_values = np.ndarray([0.1, 0.5, 0.31, 0.99]
are all acceptable values for this argument. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.
color_map (str or matplotlib colormap instance) – Color map to use when resolving
color_by
to colors; ignored whencolor_by
is None.colors (np.ndarray(shape=(mde.n_items, 4)), optional) –
A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.
Only one of
color_by
andcolors
should be non-None.edges ({torch.Tensor/np.ndarray}(shape=(any, 2)), optional) – List of edges to superimpose over the scatter plot.
axis_limits (tuple(length=2), optional) – tuple (limit_low, limit_high) of axis limits, applied to both the x and y axis.
background_color (str, optional) – color of background
marker_size (float, optional) – size of each point in the scatter plot
figsize_inches (tuple(length=2)) – size of figures in inches: (width_inches, height_inches)
fps (float, optional) – the number of frames per second at which the movie should be shown
tmpdir (str, optional) – path to directory, where individual images comprising the GIF will be stored
savepath (str, optional) – path at which to save the GIF.
- plot(color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), savepath=None)#
Plot an embedding, in one, two, or three dimensions.
This method plots embeddings, with embedding dimension at most 3.
The embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.
The
embed
method must be called sometime before calling this method.- Parameters:
color_by (np.ndarray(shape=mde.n_items), optional) –
A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if
n_items
is 4,np.ndarray(['dog', 'cat', 'zebra', 'cat']) np.ndarray([0, 1, 1, 2] np.ndarray([0.1, 0.5, 0.31, 0.99]
are all acceptable. The first two are treated as categorical, the third is continuous. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.
color_map (str or matplotlib colormap instance) – Color map to use when resolving
color_by
to colors; ignored whencolor_by
is None.colors (np.ndarray(shape=(mde.n_items, 4)), optional) –
A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.
Only one of
color_by
andcolors
should be non-None.edges ({torch.Tensor/np.ndarray}(shape=(any, 2)), optional) – List of edges to superimpose over the scatter plot.
axis_limits (tuple(length=2), optional) – tuple (limit_low, limit_high) of axis limits, applied to both the x and y axis.
background_color (str, optional) – color of background
marker_size (float, optional) – size of each point in the scatter plot
figsize_inches (tuple(length=2)) – size of figures in inches: (width_inches, height_inches)
savepath (str, optional) – path to save the plot.
- Returns:
Axis on which the embedding is plotted.
- Return type:
matplotlib axis
- to(device)#
Move MDE instance to another device.
- class pymde.optim.SolveStats(average_distortions, residual_norms, step_size_percents, solve_time, times, snapshots, snapshot_every)#
Summary statistics for a solve.
- average_distortions#
The average distortion at each iteration.
- Type:
sequence
- residual_norms#
The residual norm at each iteration.
- Type:
sequence
- step_size_percents#
The relative size of each step.
- Type:
sequence
- solve_time#
The time it took to create the embedding, in seconds.
- Type:
float
- snapshots#
Snapshots of the embedding.
- Type:
sequence
- snapshot_every#
The number of iterations between snapshots.
- Type:
int
Preserve neighbors#
- pymde.preserve_neighbors(data, embedding_dim=2, attractive_penalty=<class 'pymde.functions.penalties.Log1p'>, repulsive_penalty=<class 'pymde.functions.penalties.Log'>, constraint=None, n_neighbors=None, repulsive_fraction=None, max_distance=None, init='quadratic', device='cpu', verbose=False) MDE #
Construct an MDE problem designed to preserve local structure.
This function constructs an MDE problem for preserving the local structure of original data. This MDE problem is well-suited for visualization (using
embedding_dim
2 or 3), but can also be used to generate features for machine learning tasks (withembedding_dim
= 10, 50, or 100, for example). It yields embeddings in which similar items are near each other, and dissimilar items are not near each other.The original data can either be a data matrix, or a graph. Data matrices should be torch Tensors, NumPy arrays, or scipy sparse matrices; graphs should be instances of
pymde.Graph
.The MDE problem uses distortion functions derived from weights (i.e., penalties).
To obtain an embedding, call the
embed
method on the returnedMDE
object. To plot it, usepymde.plot
.embedding = pymde.preserve_neighbors(data).embed() pymde.plot(embedding)
- Parameters:
data ({torch.Tensor, numpy.ndarray, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape
(n_items, n_features)
or a graph. Neighbors are computed using Euclidean distance if the data is a matrix, or the shortest-path metric if the data is a graph.embedding_dim (int) – The embedding dimension. Use 2 or 3 for visualization.
attractive_penalty (pymde.Function class (or factory)) – Callable that constructs a distortion function, given positive weights. Typically one of the classes from
pymde.penalties
, such aspymde.penalties.log1p
,pymde.penalties.Huber
, orpymde.penalties.Quadratic
.repulsive_penalty (pymde.Function class (or factory)) – Callable that constructs a distortion function, given negative weights. (If
None
, only positive weights are used.) For example,pymde.penalties.Log
orpymde.penalties.InversePower
.constraint (pymde.constraints.Constraint (optional)) – Embedding constraint, like
pymde.Standardized()
orpymde.Anchored(anchors, values)
(orNone
). Defaults to no constraint when a repulsive penalty is provided, otherwise defaults topymde.Standardized()
.n_neighbors (int (optional)) – The number of nearest neighbors to compute for each row (item) of
data
. A sensible value is chosen by default, depending on the number of items.repulsive_fraction (float (optional)) – How many repulsive edges to include, relative to the number of attractive edges.
1
means as many repulsive edges as attractive edges. The higher this number, the more uniformly spread out the embedding will be. Defaults to0.5
for standardized embeddings, and1
otherwise. (Ifrepulsive_penalty
isNone
, this argument is ignored.)max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than
max_distance
.init (str) – Initialization strategy; ‘quadratic’ or ‘random’. If the quadratic initialization takes too much time, try a random initialization. When device==’cuda’ and init==’quadratic’ a torch-implemented version will be used, which is fully on device, but uses a CG-based technique (versus Lanczos).
device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).
verbose (bool) – If
True
, print verbose output.
- Returns:
A
pymde.MDE
object, based on the original data.- Return type:
Preserve distances#
- pymde.preserve_distances(data, embedding_dim=2, loss=<class 'pymde.functions.losses.Absolute'>, constraint=None, max_distances=50000000.0, device='cpu', verbose=False) MDE #
Construct an MDE problem based on original distances.
This function constructs an MDE problem for preserving pairwise distances between items. This can be useful for preserving the global structure of the data.
The data can be specified with either a data matrix (a NumPy array, torch Tensor, or sparse matrix), or a
pymde.Graph
instance encoding the distances:A NumPy array, torch tensor, or sparse matrix is interpreted as a collection of feature vectors: each row gives the feature vector for an item. The original distances are the Euclidean distances between the feature vectors.
A
pymde.Graph
instance is interpreted as encoding all (n_items choose 2) distances: the distance between i and j is taken to be the length of the shortest path connecting i and j.When the number of items n_items is large, the total number of pairs will be very large. When this happens, instead of computing all pairs of distances, this function will sample a subset uniformly at random. The maximum number of distances to compute is specified by the parameter
max_distances
. Depending on how many items you have (and how much memory your machine has), you may need to adjust this parameter.To obtain an embedding, call the
embed
method on the returned object. To plot it, usepymde.plot
.For example:
embedding = pymde.preserve_distances(data).embed() pymde.plot(embedding)
- Parameters:
data ({np.ndarray, torch.Tensor, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape
(n_items, n_features)
or a graph.embedding_dim (int) – The embedding dimension.
loss (pymde.Function class (or factory)) – Callable that constructs a distortion function, given original distances. Typically one of the classes defined in
pymde.losses
, such aspymde.losses.Absolute
, orpymde.losses.WeightedQuadratic
.constraint (pymde.constraints.Constraint (optional)) – Embedding constraint, such as
pymde.Standardized()
orpymde.Anchored(anchors, values)
(orNone
). Defaults to no constraint. Note: when the constraint ispymde.Standardized()
, the original distances will be scaled by a constant (because the standardization constraint puts a limit on how large any one distance can be).max_distances (int) – Maximum number of distances to compute.
device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).
verbose (bool) – If
True
, print verbose output.
- Returns:
A
pymde.MDE
instance, based on preserving the original distances.- Return type:
Distortion functions#
Penalties#
Penalties: distortion functions derived from weights.
A vector distortion function \(f : \mathbf{R}^{p} \to \mathbf{R}^p\) derived from weights has component functions
where \(w_k\) is a scalar weight, \(p\) is a penalty function, and \(d_k\) is an embedding distance. The penalty encourages distances to be small when the weights are positive, and encourages them to be not small when the weights are negative.
When an MDE problem calls a distortion function, \(d_k\) is the Euclidean distance between the items paired by the \(k\)-th edge, so \(w_k\) should be the weight associated with the \(k\)-th edge, and \(f_k(d_k)\) is the distortion associated with the edge.
Every penalty can be used with positive or negative weights. When \(w_k\) is positive, \(f_k\) is attractive, meaning it encourages the embedding distances to be small; when \(w_k\) is negative, \(f_k\) is repulsive, meaning it encourages the distances to be large. Some penalties are better suited to attracting points, while others are better suited to repelling them.
Negative weights. For negative weights, it is recommended to only use one of the following penalties:
pymde.penalties.Log
pymde.penalties.InvPower
pymde.penalties.LogRatio
These penalties go to negative infinity as the input approaches zero, and to zero as the input approaches infinity. With a negative weight, that means the distortion function goes to infinity at 0, and to 0 at infinity.
Using other penalties with negative weights is possible, but it can lead to pathological MDE problems if you are not careful.
Positive weights. Penalties that work well in attracting points are those that are \(0\) when the distance is \(0\), grows when the distance is larger than \(0\). All the penalties in this module, other than the ones listed above (and the function described below), can be safely used with attractive penalties. Some examples inlcude:
pymde.penalties.Log1p
pymde.penalties.Linear
pymde.penalties.Quadratic
pymde.penalties.Cubic
pymde.penalties.Huber
Combining penalties.
The PushAndPull
function can be used to combine two penalties, an attractive
penalty for use with positive weights, and a repulsive penalty for use with
negative weights. This leads to a distortion function of the form
For example:
weights = torch.tensor([1., 1., -1., 1., -1.])
attractive_penalty = pymde.penalties.Log1p
repulsive_penalty = pymde.penalties.Log
distortion_function = pymde.PushAndPull(
weights,
attractive_penalty,
repulsive_penalty)
Example. Distortion functions are created in a vectorized or elementwise fashion. The constructor takes a sequence (torch.Tensor) of weights, returning a callable object. The object takes a sequence of distances of the same length as the weights, and returns a sequence of distortions, one for each distance.
For example:
weights = torch.tensor([1., 2., 3.])
f = pymde.penalties.Quadratic(weights)
distances = torch.tensor([2., 1., 4.])
distortions = f(distances)
# the distortions are 1 * 2**2 == 4, 2 * 1**2 == 2, 3 * 4**2 = 48
print(distortions)
prints
torch.tensor([4., 2., 48.])
- class pymde.penalties.Linear(weights)#
\(p(d) = d\)
- class pymde.penalties.Quadratic(weights)#
\(p(d) = d^2\)
- class pymde.penalties.Cubic(weights)#
\(p(d) = d^3\)
- class pymde.penalties.Power(weights, exponent)#
\(p(d) = d^\text{exponent}\)
- class pymde.penalties.Huber(weights, threshold=0.5)#
- \[\begin{split}p(d) = \begin{cases} 0.5 \cdot d^2 & d < \text{threshold} \\ \text{threshold}(d - 0.5 \cdot \text{threshold}) & d \geq \text{threshold} \end{cases}\end{split}\]
- class pymde.penalties.Logistic(weights, threshold=0.0, alpha=3.0)#
\(p(d) = \log(1 + e^{\alpha(d - \text{threshold})})\)
- class pymde.penalties.Log1p(weights, exponent=1.5)#
\(p(d) = \log(1 + d^{\text{exponent}})\)
- class pymde.penalties.Log(weights, exponent=1.0)#
\(p(d) = \log(1 - \exp(-d^\text{exponent}))\)
- class pymde.penalties.InvPower(weights, exponent=1)#
\(p(d) = 1/d^\text{exponent}\)
- class pymde.penalties.LogRatio(weights, exponent=2)#
\(p(d) = \log\left(\frac{d^\text{exponent}}{1 + d^{\text{exponent}}}\right)\)
- class pymde.penalties.PushAndPull(weights, attractive_penalty=<class 'pymde.functions.penalties.Log1p'>, repulsive_penalty=<class 'pymde.functions.penalties.LogRatio'>)#
Combine an attractive and repulsive penalty.
\[\begin{split}f_k(d) = \begin{cases} w_k p_{\text{attractive}}(d_k) & w_k > 0 \\ w_k p_{\text{repulsive}}(d_k) & w_k < 0 \\ \end{cases}\end{split}\]
Losses#
Losses: distortion functions derived from original deviations.
A vector distortion function \(f : \mathbf{R}^{p} \to \mathbf{R}^p\) derived from original deviations has component functions
where \(\ell\) is a loss function, \(\delta_k\) is a nonnegative deviation or dissimilarity score, \(d_k\) is an embedding distance,
When an MDE problem calls a distortion function, \(d_k\) is the Euclidean distance between the items paired by the k-th edge, so \(\delta_k\) should be the original deviation associated with the k-th edge, and \(f_k(d_k)\) is the distortion associated with the edge.
The deviations can be interpreted as targets for the embedding distances: the loss function is 0 when \(d_k = \delta_k\), and positive otherwise. So a deviation \(\delta_k`\) of 0 means that the items in the k-th edge are the same, and the larger the deviation, the more dissimilar the items are.
Distortion functions are created in a vectorized or elementwise fashion. The constructor takes a sequence (torch.Tensor) of deviations (target distances), returning a callable object. The object takes a sequence of distances of the same length as the weights, and returns a sequence of distortions, one for each distance.
Some examples of losses inlcude:
pymde.losses.Absolute
pymde.losses.Quadratic
pymde.losses.SoftFractional
Example.
deviations = torch.tensor([1., 2., 3.])
f = pymde.losses.Quadratic(deviations)
distances = torch.tensor([2., 5., 4.])
distortions = f(distances)
# the distortions are (2 - 1)**2 == 1, (5 - 2)**2 == 9, (4 - 3)**2 == 1
print(distortions)
prints
torch.tensor([1., 9., 1.])
- class pymde.losses.Absolute(deviations)#
\(\ell(d, \delta) = |d - \delta|\)
- class pymde.losses.Quadratic(deviations)#
\(\ell(d, \delta) = (d - \delta)^2\)
- class pymde.losses.Cubic(deviations)#
\(\ell(d, \delta) = (d - \delta)^3\)
- class pymde.losses.Power(deviations, exponent)#
\(\ell(d, \delta) = (d - \delta)^{\text{exponent}}\)
- class pymde.losses.WeightedQuadratic(deviations, weights=None)#
\(\ell(d, \delta) = \frac{1}{\delta^2} (d - \delta)^2\)
If
weights
is not None, the coefficient then the coefficient \(1/\delta^2\) is replaced by the weights.
- class pymde.losses.Fractional(deviations)#
\(\ell(d, \delta) = \max(\delta / d, d / \delta)\)
- class pymde.losses.SoftFractional(deviations, gamma=10.0)#
\(\ell(d, \delta) = \frac{1}{\gamma}\log\left( \frac{\exp(\gamma \delta/d) + \exp(\gamma d/\delta)}{2\exp(\gamma)} \right)\)
The parameter
gamma
controls how close this loss is to the fractional loss. The largergamma
is, the closer to the fractional loss.
Constraints#
- class pymde.constraints.Constraint#
A generic constraint.
To create a custom constraint, create a subclass of this class, and implement its abstract methods.
- abstract initialization(n_items: int, embedding_dim: int, device=None) Tensor #
Return a random embedding in the constraint set.
- Parameters:
n_items (int) – The number of items.
embedding_dim (int) – The embedding dimension.
device (str) – Device on which to store the returned embedding.
- Returns:
a tensor of shape
(n_items, embedding_dim)
satisfying the constraints.- Return type:
torch.Tensor
- abstract name() str #
The name of the constraint.
- abstract project_onto_constraint(Z: Tensor, inplace=True) Tensor #
Project
Z
onto the constraint set.Returns a projection of
Z
onto the constraint set.- Parameters:
Z (torch.Tensor) – The point to project.
inplace (bool) – If True, stores the projection in
Z
.
- Return type:
The projection of
Z
onto the constraints.
- abstract project_onto_tangent_space(X: Tensor, Z: Tensor, inplace=True) Tensor #
Project
Z
onto the tangent space of the constraint set atX
.Returns the Euclidean projection of
Z
onto the tangent space of the constraint set atX
(whereX
is some matrix satisfying the constraints).X
andZ
should have the same shape.- Parameters:
X (torch.Tensor) – A point satisfying the constraints.
Z (torch.Tensor) – The point to project.
inplace (bool) – If True, stores the projection in
Z
.
- Returns:
The projection of
Z
onto the tangent space of the constraintset at
X
.
- pymde.Centered()#
Centering constraint.
This function returns a centering constraint, which requires the embedding vectors to be centered around 0.
- pymde.Standardized()#
Standardization constraint.
This function returns a standardization constraint, which constrains an embedding \(X\) to be centered (the rows have mean zero) and to satisfy \((1/n) X^T X = I\).
- class pymde.Anchored(anchors, values)#
Anchor some vectors to specific values.
- __init__(anchors, values)#
Constructs an anchor constraint, in which some embedding vectors (the anchors) are fixed to specific values.
- Parameters:
anchors (torch.Tensor, shape (n_anchors)) – a Tensor in which each entry gives the index of an anchored vertex
values (torch.Tensor, shape (n_anchors, embedding_dim)) – a Tensor which gives the value to which each anchor should be fixed
Preprocessing#
- class pymde.Graph(adjacency_matrix)#
A weighted graph.
This class represents a weighted graph. It is backed by a scipy.sparse adjacency matrix, and can be constructed from either a dense distance matrix, a sparse adjacency matrix, or torch.Tensor’s of edges and weights (using the from_edges static method).
It is an error for the graph to contain self-edges (i.e., non-zero values on the diagonal).
This class is accepted as an argument to various preprocessing functions.
- property adjacency_matrix#
The scipy.sparse adjacency matrix backing the graph.
- property distances#
The distances associated with each edge.
- draw(embedding_dim=2, standardized=False, device='cpu', verbose=False)#
Draw a graph in the Cartesian plane.
This method does some basic preprocessing, constructs an MDE problem that is often suitable for drawing graphs, and computes/returns an embedding by approximately solving the MDE problem.
- Parameters:
embedding_dim (int) – The number of dimemsions, 1, 2, or 3.
standardized (bool) – Whether to impose a standardization constraint.
device (str) – Device on which to compute/store embedding, ‘cpu’ or ‘cuda’.
verbose (bool) – Whether to print verbose output.
- Returns:
The embedding, of shape
(n_items, embedding_dim)
- Return type:
torch.Tensor
- property edges#
A torch.Tensor of the edges in the graph.
- static from_edges(edges, weights=None, n_items=None)#
Construct a graph from edges and weights.
- Parameters:
edges (torch.Tensor) – Tensor of edges, of shape
(n_edges, 2)
, with each edge represented by a row, i.e. by two integers.weights (torch.Tensor, optional) – Tensor of weights, of shape
(n_edges,)
, with each weight a float.n_items (int, optional) – The number of items in the graph; if
None
, this is taken to be the maximum value inedges
plus 1.
- Returns:
A Graph object representing
edges
andweights
.- Return type:
- property n_all_edges#
n_items choose 2.
- property n_edges#
The number of edges in the graph.
- property n_items#
The number of items in the graph.
- neighbor_distances(node) ndarray #
The distances associated with the edges connected to
node
.
- neighbors(node: int) ndarray #
The indices of the neighbors of
node
.
- pymde.preprocess.k_nearest_neighbors(data, k, max_distance=None, verbose=False)#
Compute k-nearest neighbors, given data matrix or graph.
This function computes a k-nearest neighbor graph, given either a data matrix or an original graph.
When the input is a data matrix, each row is treated as the feature vector of an item, and the Euclidean distance is used to judge how close two items are to each other.
When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.
In the returned graph, if i is a neighbor of j and j a neighbor of i, then the weight w_{ij} will be +2; if only one is a neighbor of the other, then w_{ij} will be +1; if neither are neighbors of each other, then (i, j) will not be in the graph.
- Parameters:
data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape
(n_items, n_features)
, or apymde.Graph
k (int) – The number of nearest neighbors per item
max_distance (float (optional)) – If not
None
, neighborhoods are restricted to have a radius no greater than max_distance.verbose (bool) – If
True
, print verbose output.
- Returns:
The k-nearest neighbor graph. Access the weights with
graph.weights
, and the edges withgraph.edges
- Return type:
- pymde.preprocess.dissimilar_edges(n_items, similar_edges, num_edges=None, seed=None)#
Sample edges not in
similar_edges
.Given a number of items, and a
torch.Tensor
containing pairs of items known to be similar, this function samples (uniformly at random) some edges in the complement ofsimilar_edges
. The returned number of edges will be approximately equal to (and no greater than) the number of edges insimilar_edges
(ornum_edges
, if provided).- Parameters:
n_items (int) – The number of items.
similar_edges (torch.Tensor) – Edges to exclude when sampling.
num_edges (int (optional)) – Number of edges to sample, defaults to approximately the same number as
similar_edges
.seed (int (optional)) – Random seed for the sampling.
- Returns:
Edges not in
similar_edges
.- Return type:
torch.Tensor
- pymde.preprocess.distances(data, retain_fraction=1.0, verbose=False)#
Compute distances, given data matrix or graph.
This function computes distances between some pairs of items, given either a data matrix or a
pymde.Graph
instance.When the input is a data matrix, each row is treated as the feature vector of an item.
When the input is a graph, each node is an item, and the distance between two items is taken to be the length of the shortest-path between them.
The
retain_fraction
keyword argument can be used to compute only a fraction of the distances. This can be useful when there are many items, in which case storing all the distances may be intractable.- Parameters:
data (torch.Tensor, np.ndarray, scipy.sparse matrix, or pymde.Graph) – A data matrix, shape
(n_items, n_features)
, or apymde.Graph
retain_fraction (float, optional) – A float between 0 and 1, specifying the fraction of all
(n_items choose 2)
to compute. For example, ifretain_fraction
is 0.1, only 10 percent of the edges will be stored.verbose – If
True
, print verbose output.
- Returns:
A graph object holding the distances and corresponding edges. Access the distances with
graph.distances
, and the edges withgraph.edges
.- Return type:
Classical embeddings#
- pymde.pca(Y, embedding_dim)#
PCA embedding of a data matrix.
- Parameters:
Y (torch.Tensor(shape=(n, k))) – data matrix, with n >= k
embedding_dim (int) – the number of eigenvectors to retrieve, the embedding dimension; must be <= k
- Returns:
The top embedding_dim eigenvectors of YY^T, scaled by sqrt(n)
- Return type:
torch.Tensor(shape=(n, embedding_dim))
- pymde.laplacian_embedding(data, embedding_dim=2, n_neighbors=None, max_distance=None, init='quadratic', device='cpu', verbose=False) MDE #
Constructs an MDE problem for computing a Laplacian embedding.
The embedding preserves the nearest neighbors of each item, using quadratic distortion functions and a standardization constraint.
- Parameters:
data ({torch.Tensor, numpy.ndarray, scipy.sparse matrix} or pymde.Graph) – The original data, a data matrix of shape
(n_items, n_features)
or a graph. Neighbors are computed using Euclidean distance if the data is a matrix, or the shortest-path metric if the data is a graph.embedding_dim (int) – The embedding dimension. Use 2 or 3 for visualization.
n_neighbors (int (optional)) – The number of nearest neighbors to compute for each row (item) of
data
. A sensible value is chosen by default, depending on the number of items.max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than
max_distance
.init (str) – Initialization strategy; ‘quadratic’ or ‘random’. If the quadratic initialization takes too much time, try a random initialization.
device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).
verbose (bool) – If
True
, print verbose output.
- Returns:
A
pymde.MDE
object, based on the original data.- Return type:
- pymde.quadratic.spectral(n_items, embedding_dim, edges, weights, cg=False, max_iter=40, device='cpu')#
Compute a spectral embedding
Solves the quadratic MDE problem
\[\begin{split}\begin{array}{ll} \mbox{minimize} & \sum_{(i, j) in \text{edges}} w_{ij} d_{ij}^2 \\ \mbox{subject to} & (1/n) X^T X = I, \quad d_{ij} = |x_i - x_j|_2. \end{array}\end{split}\]By default, the problem is solved using a Lanczos method. If cg=True, LOBPCG is used; LOBPCG is warm-started by running a projected quasi-newton method for a small number of iterations. Use cg=True when the number of edges is very large, and when an approximate solution is satisfactory (the Lanczos method typically gives much more accurate solutions, but can be slower).
- Parameters:
n_items (int) – The number of items
embedding_dim (int) – The embedding dimension
edges (torch.Tensor(shape=(n_edges, 2))) – A list of edges (i, j), 0 <= i < j < n_items
weights (torch.Tensor(shape=(n_edges,))) – A list of nonnegative weights associated with each edge
cg (bool) – If True, uses a preconditioned CG method to find the embedding, which requires that the Laplacian matrix plus the identity is positive definite; otherwise, a Lanczos method is used. Use True when the Lanczos method is too slow (which might happen when the number of edges is very large).
max_iter (int) – max iteration count for the CG method
device (str (optional)) – The device on which to allocate the embedding
- Returns:
A spectral embedding, projected onto the standardization constraint
- Return type:
torch.Tensor(shape=(n_items, embedding_dim))
Utilities#
- pymde.all_edges(n)#
Return a tensor of all (n choose 2) edges
Constructs all possible edges among n items. For example, if
n
is 4, the return value will be equal totorch.Tensor([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]])
- pymde.align(source, target)#
Align an source embedding to a target embedding
Align the source embedding to another target embedding, via rotation. The alignment is done by solving an orthogonal Procrustes problem.
- Parameters:
source (torch.Tensor) – The embedding to be aligned.
target (torch.Tensor) – The target embedding, of the same shape as source.
- Returns:
The rotation of source best aligned to the target.
- Return type:
torch.Tensor
- pymde.center(X)#
Center an embedding
Returns a new embedding, equal to the given embedding minus the mean of its rows.
- pymde.rotate(X, degrees)#
Rotate a 2 or 3D embedding
Rotates a 2/3D embedding by
degrees
. IfX
is a 2D embedding,degrees
should be a scalar; if it is 3D,degrees
should be a length-3torch.Tensor
, with one angle for each axis (the embedding will be rotated along the x-axis first, then the y-axis, then the z-axis).- Parameters:
X (torch.Tensor) – The embedding to rotate.
degrees (torch.Tensor) – The angles of rotation.
- Returns:
The rotated embedding
- Return type:
torch.Tensor
- pymde.seed(seed: int)#
Set the random seed
This function sets the random seed that PyMDE uses in various preprocessing methods.
- pymde.plot(X, color_by=None, color_map='Spectral', colors=None, edges=None, axis_limits=None, background_color=None, marker_size=1.0, figsize_inches=(8.0, 8.0), savepath=None)#
Plot an embedding, in one, two, or three dimensions.
This function plots embeddings. The input embedding’s dimension should be at most 3.
The embedding is visualized as a scatter plot. The points can optionally be colored according to categorical or continuous values, or according to a pre-defined sequence of colors. Additionally, edges can optionally be superimposed.
- Parameters:
X (array-like) – The embedding to plot, of shape
(n_items, embedding_dim)
. The second dimension should be 1, 2, or 3.color_by (array-like, optional) –
A sequence of values, one for each item, which should be used to color each embedding vector. These values may either be categorical or continuous. For example, if
n_items
is 4,np.ndarray(['dog', 'cat', 'zebra', 'cat']) np.ndarray([0, 1, 1, 2] np.ndarray([0.1, 0.5, 0.31, 0.99]
are all acceptable. The first two are treated as categorical, the third is continuous. A finite number of colors is used when the values are categorical, while a spectrum of colors is used when the values are continuous.
color_map (str or matplotlib colormap instance) – Color map to use when resolving
color_by
to colors; ignored whencolor_by
is None.colors (array-like, optional) –
A sequence of colors, one for each item, specifying the exact color each item should be colored. Each row must represent an RGBA value.
Only one of
color_by
andcolors
should be non-None.edges (array-like, optional) – List of edges to superimpose over the scatter plot, shape
(any, 2)
axis_limits (tuple, optional) – tuple
(limit_low, limit_high)
of axis limits, applied to both the x and y axis.background_color (str, optional) – color of background
marker_size (float, optional) – size of each point in the scatter plot
figsize_inches (tuple) – size of figures in inches:
(width_inches, height_inches)
savepath (str, optional) – path to save the plot.
- Returns:
Axis on which the embedding is plotted.
- Return type:
matplotlib.Axes
Datasets#
- class pymde.datasets.Dataset(data, attributes, other_data=None, metadata=None)#
Represents a dataset.
Each instance has two main attrs:
- data:
either a data matrix, or an instance of pymde.Graph; the data to be embedded.
- attributes:
a dictionary whose values give attributes associated with the items in the data, such as the digit labels in MNIST.
Other data that the dataset might carry is in the (dict) attribute other_data. Metadata about where the data came from is stored in metadata.
- pymde.datasets.MNIST(root='./', download=True) Dataset #
MNIST dataset (LeCun, et al.).
The MNIST dataset contains 70,000, 28x28 images of handwritten digits.
data
:torch.Tensor
with 70,000 rows, each of length 784 (representing the pixels in the image).attributes
dict: the keydigits
holds an array in which each entry gives the digit depicted in the corresponding row ofdata
.
- pymde.datasets.FashionMNIST(root='./', download=True) Dataset #
Fashion-MNIST dataset (Xiao, et al.).
The Fashion-MNIST dataset contains 70,000, 28x28 images of Zalando fashion articles.
data
:torch.Tensor
with 70,000 rows, each of length 784 (representing the pixels in the image).attributes
dict: the keyclass
holds an array in which each entry gives the fashion article in the corresponding row ofdata
.
- pymde.datasets.google_scholar(root='./', download=True, full=False) Dataset #
Google Scholar dataset (Agrawal, et al.).
The Google Scholar dataset contains an academic coauthorship graph: the nodes are authors, and two authors are connected by an edge if either author listed the other as a coauthor on Google Scholar. (Note that if two authors collaborated on a paper, but neither has listed the other as a coauthor on their Scholar profiles, then they will not be connected by an edge).
If
full
is False, obtains a small version of the dataset, on roughly 40,000 authors, each with h-index at least 50. Iffull
is True, obtains the whole dataset, on roughly 600,000 authors. The full dataset is roughly 1GB in size.data
: apymde.Graph
, with nodes representing authorsattributes
: thecoauthors
key has an array holding the number of coauthors of each other, normalized to be a percentile.other_data
: holds a dataframe describing the dataset, keyed bydataframe
.
- pymde.datasets.google_scholar_interests(root='./', download=True) Dataset #
Google Scholar academic interests
This dataset contains a cooccurrence matrix of the 5000 most popular academic interests listed by authors on Google Scholar
data
: a cooccurrence matrix, counting the number of times two two interests appeared among a single author’s listed interests, withdata[i, j]
giving the cooccurrence count ofinterests[i]
andinterests[j]
attributes
: one key,interests
, whereinterests[i]
is the interest corresponding to the ith row/column ofdata
.
- pymde.datasets.covid19_scrna_wilk(root='./', download=True) Dataset #
COVID-19 scRNA data (Wilk, et al.).
The COVID-19 dataset includes a PCA embedding of single-cell mRNA transcriptomes of roughly 40,000 cells, taken from some patients with COVID-19 infections and from healthy controls.
Instructions on how to obtain the full dataset are available in the Wilk et al. paper: https://www.nature.com/articles/s41591-020-0944-y,
data
: the PCA embeddingattributes
: two keys,cell_type
andhealth_status
.
- pymde.datasets.population_genetics(root='./', download=True) Dataset #
Population genetics dataset (Nelson, et al)
The population genetics dataset includes a PCA embedding (in R^20) of single nucleotide polymorphism data associated with 1,387 individuals thought to be of European descent. (The data is from the Population Reference Sample project by Nelson, et al.)
It also contains a “corrupted” version of the data, in which 154 additional points have been injected; the first 10 coordinates of these synthetic points are generated using a discrete uniform distribution on {0, 1, 2}, and the last 10 are generated using a discrete uniform distritubtion on {1/12, /18}.
A study of Novembre et al (2008) showed that a PCA embedding in R^2 roughly resembles the map of Europe, suggesting that the genes encode geographical information. But PCA does not produce interesting visualizations of the corrupted data. If distortion functions are chosen to be robust (eg, using the Log1p or Huber attractive penalties), we can create embeddings that preserve the geographical structure, while placing the synthetic points to the side, in their own cluser.
data
: the PCA embedding of the clean genetic data, in R^20corrupted_data
: the corrupted data, in R^20attributes
: two keys,clean_colors
andcorrupted_colors
.
- pymde.datasets.counties(root='./', download=True) Dataset #
US counties (2013-2017 ACS 5-Year Estimates)
This dataset contains 34 demographic features for each of the 3,220 US counties. The features were collected by the 2013-2017 ACS 5-Year Estimates longitudinal survey, run by the US Census Bureau.
data
: the PCA embedding of the clean genetic data, in R^20county_dataframe
: the raw ACS datavoting_dataframe
: the raw 2016 voting dataattributes
: one key,democratic_fraction
, the fraction of of voters who voted Democratic in each county in the 2016 presidential election