Datasets
The pymde.datasets
module provides functions that download and return
some datasets. You can use these datasets to experiment with custom
MDE problems, or while learning how to use PyMDE.
Each function returns a pymde.datasets.Dataset
object. The data
member of this object holds the raw data. The attributes
member
is a dictionary whose values are (held-out) attributes associated with
the items; you can use these attributes to color your embeddings, when
visualizing them. Other data related to the dataset is in the other_data
dict, and metadata about the dataset (eg, its authors) is available in
metadata
.
The first time one of these functions is called, it will download the dataset and
cache it locally, in the current directory (change the directory with
the root
keyword argument). Subsequent calls will use the cached data.
PyMDE currently provides the below datasets. If you would like to add an additional dataset, please reach out to us on Github.
MNIST
- pymde.datasets.MNIST(root='./', download=True) pymde.datasets.Dataset
MNIST dataset (LeCun, et al.).
The MNIST dataset contains 70,000, 28x28 images of handwritten digits.
data
:torch.Tensor
with 70,000 rows, each of length 784 (representing the pixels in the image).attributes
dict: the keydigits
holds an array in which each entry gives the digit depicted in the corresponding row ofdata
.
Fashion MNIST
- pymde.datasets.FashionMNIST(root='./', download=True) pymde.datasets.Dataset
Fashion-MNIST dataset (Xiao, et al.).
The Fashion-MNIST dataset contains 70,000, 28x28 images of Zalando fashion articles.
data
:torch.Tensor
with 70,000 rows, each of length 784 (representing the pixels in the image).attributes
dict: the keyclass
holds an array in which each entry gives the fashion article in the corresponding row ofdata
.
Google Scholar
- pymde.datasets.google_scholar(root='./', download=True, full=False) pymde.datasets.Dataset
Google Scholar dataset (Agrawal, et al.).
The Google Scholar dataset contains an academic coauthorship graph: the nodes are authors, and two authors are connected by an edge if either author listed the other as a coauthor on Google Scholar. (Note that if two authors collaborated on a paper, but neither has listed the other as a coauthor on their Scholar profiles, then they will not be connected by an edge).
If
full
is False, obtains a small version of the dataset, on roughly 40,000 authors, each with h-index at least 50. Iffull
is True, obtains the whole dataset, on roughly 600,000 authors. The full dataset is roughly 1GB in size.data
: apymde.Graph
, with nodes representing authorsattributes
: thecoauthors
key has an array holding the number of coauthors of each other, normalized to be a percentile.other_data
: holds a dataframe describing the dataset, keyed bydataframe
.
Academic interests
- pymde.datasets.google_scholar_interests(root='./', download=True) pymde.datasets.Dataset
Google Scholar academic interests
This dataset contains a cooccurrence matrix of the 5000 most popular academic interests listed by authors on Google Scholar
data
: a cooccurrence matrix, counting the number of times two two interests appeared among a single author’s listed interests, withdata[i, j]
giving the cooccurrence count ofinterests[i]
andinterests[j]
attributes
: one key,interests
, whereinterests[i]
is the interest corresponding to the ith row/column ofdata
.
scRNA transcriptomes from COVID-19 patients
- pymde.datasets.covid19_scrna_wilk(root='./', download=True) pymde.datasets.Dataset
COVID-19 scRNA data (Wilk, et al.).
The COVID-19 dataset includes a PCA embedding of single-cell mRNA transcriptomes of roughly 40,000 cells, taken from some patients with COVID-19 infections and from healthy controls.
Instructions on how to obtain the full dataset are available in the Wilk et al. paper: https://www.nature.com/articles/s41591-020-0944-y,
data
: the PCA embeddingattributes
: two keys,cell_type
andhealth_status
.
Population genetics
- pymde.datasets.population_genetics(root='./', download=True) pymde.datasets.Dataset
Population genetics dataset (Nelson, et al)
The population genetics dataset includes a PCA embedding (in R^20) of single nucleotide polymorphism data associated with 1,387 individuals thought to be of European descent. (The data is from the Population Reference Sample project by Nelson, et al.)
It also contains a “corrupted” version of the data, in which 154 additional points have been injected; the first 10 coordinates of these synthetic points are generated using a discrete uniform distribution on {0, 1, 2}, and the last 10 are generated using a discrete uniform distritubtion on {1/12, /18}.
A study of Novembre et al (2008) showed that a PCA embedding in R^2 roughly resembles the map of Europe, suggesting that the genes encode geographical information. But PCA does not produce interesting visualizations of the corrupted data. If distortion functions are chosen to be robust (eg, using the Log1p or Huber attractive penalties), we can create embeddings that preserve the geographical structure, while placing the synthetic points to the side, in their own cluser.
data
: the PCA embedding of the clean genetic data, in R^20corrupted_data
: the corrupted data, in R^20attributes
: two keys,clean_colors
andcorrupted_colors
.
US counties
- pymde.datasets.counties(root='./', download=True) pymde.datasets.Dataset
US counties (2013-2017 ACS 5-Year Estimates)
This dataset contains 34 demographic features for each of the 3,220 US counties. The features were collected by the 2013-2017 ACS 5-Year Estimates longitudinal survey, run by the US Census Bureau.
data
: the PCA embedding of the clean genetic data, in R^20county_dataframe
: the raw ACS datavoting_dataframe
: the raw 2016 voting dataattributes
: one key,democratic_fraction
, the fraction of of voters who voted Democratic in each county in the 2016 presidential election