Datasets

The pymde.datasets module provides functions that download and return some datasets. You can use these datasets to experiment with custom MDE problems, or while learning how to use PyMDE.

Each function returns a pymde.datasets.Dataset object. The data member of this object holds the raw data. The attributes member is a dictionary whose values are (held-out) attributes associated with the items; you can use these attributes to color your embeddings, when visualizing them. Other data related to the dataset is in the other_data dict, and metadata about the dataset (eg, its authors) is available in metadata.

The first time one of these functions is called, it will download the dataset and cache it locally, in the current directory (change the directory with the root keyword argument). Subsequent calls will use the cached data.

PyMDE currently provides the below datasets. If you would like to add an additional dataset, please reach out to us on Github.

MNIST

pymde.datasets.MNIST(root='./', download=True) pymde.datasets.Dataset

MNIST dataset (LeCun, et al.).

The MNIST dataset contains 70,000, 28x28 images of handwritten digits.

  • data: torch.Tensor with 70,000 rows, each of length 784 (representing the pixels in the image).

  • attributes dict: the key digits holds an array in which each entry gives the digit depicted in the corresponding row of data.

Fashion MNIST

pymde.datasets.FashionMNIST(root='./', download=True) pymde.datasets.Dataset

Fashion-MNIST dataset (Xiao, et al.).

The Fashion-MNIST dataset contains 70,000, 28x28 images of Zalando fashion articles.

  • data: torch.Tensor with 70,000 rows, each of length 784 (representing the pixels in the image).

  • attributes dict: the key class holds an array in which each entry gives the fashion article in the corresponding row of data.

Google Scholar

pymde.datasets.google_scholar(root='./', download=True, full=False) pymde.datasets.Dataset

Google Scholar dataset (Agrawal, et al.).

The Google Scholar dataset contains an academic coauthorship graph: the nodes are authors, and two authors are connected by an edge if either author listed the other as a coauthor on Google Scholar. (Note that if two authors collaborated on a paper, but neither has listed the other as a coauthor on their Scholar profiles, then they will not be connected by an edge).

If full is False, obtains a small version of the dataset, on roughly 40,000 authors, each with h-index at least 50. If full is True, obtains the whole dataset, on roughly 600,000 authors. The full dataset is roughly 1GB in size.

  • data: a pymde.Graph, with nodes representing authors

  • attributes: the coauthors key has an array holding the number of coauthors of each other, normalized to be a percentile.

  • other_data: holds a dataframe describing the dataset, keyed by dataframe.

Academic interests

pymde.datasets.google_scholar_interests(root='./', download=True) pymde.datasets.Dataset

Google Scholar academic interests

This dataset contains a cooccurrence matrix of the 5000 most popular academic interests listed by authors on Google Scholar

  • data: a cooccurrence matrix, counting the number of times two two interests appeared among a single author’s listed interests, with data[i, j] giving the cooccurrence count of interests[i] and interests[j]

  • attributes: one key, interests, where interests[i] is the interest corresponding to the ith row/column of data.

scRNA transcriptomes from COVID-19 patients

pymde.datasets.covid19_scrna_wilk(root='./', download=True) pymde.datasets.Dataset

COVID-19 scRNA data (Wilk, et al.).

The COVID-19 dataset includes a PCA embedding of single-cell mRNA transcriptomes of roughly 40,000 cells, taken from some patients with COVID-19 infections and from healthy controls.

Instructions on how to obtain the full dataset are available in the Wilk et al. paper: https://www.nature.com/articles/s41591-020-0944-y,

  • data: the PCA embedding

  • attributes: two keys, cell_type and health_status.

Population genetics

pymde.datasets.population_genetics(root='./', download=True) pymde.datasets.Dataset

Population genetics dataset (Nelson, et al)

The population genetics dataset includes a PCA embedding (in R^20) of single nucleotide polymorphism data associated with 1,387 individuals thought to be of European descent. (The data is from the Population Reference Sample project by Nelson, et al.)

It also contains a “corrupted” version of the data, in which 154 additional points have been injected; the first 10 coordinates of these synthetic points are generated using a discrete uniform distribution on {0, 1, 2}, and the last 10 are generated using a discrete uniform distritubtion on {1/12, /18}.

A study of Novembre et al (2008) showed that a PCA embedding in R^2 roughly resembles the map of Europe, suggesting that the genes encode geographical information. But PCA does not produce interesting visualizations of the corrupted data. If distortion functions are chosen to be robust (eg, using the Log1p or Huber attractive penalties), we can create embeddings that preserve the geographical structure, while placing the synthetic points to the side, in their own cluser.

  • data: the PCA embedding of the clean genetic data, in R^20

  • corrupted_data: the corrupted data, in R^20

  • attributes: two keys, clean_colors and corrupted_colors.

US counties

pymde.datasets.counties(root='./', download=True) pymde.datasets.Dataset

US counties (2013-2017 ACS 5-Year Estimates)

This dataset contains 34 demographic features for each of the 3,220 US counties. The features were collected by the 2013-2017 ACS 5-Year Estimates longitudinal survey, run by the US Census Bureau.

  • data: the PCA embedding of the clean genetic data, in R^20

  • county_dataframe: the raw ACS data

  • voting_dataframe: the raw 2016 voting data

  • attributes: one key, democratic_fraction, the fraction of of voters who voted Democratic in each county in the 2016 presidential election