HDF5 Data Frame


A data frame object stored inside a group of a HDF5 file. Simple columns are stored as one-dimensional datasets in the data subgroup, named by their positional 0-based index in the data frame. All such datasets should have the same length. Column names are stored in column_names, a 1-dimensional string dataset of length equal to the number of columns. Row names, if present, are stored in a row_names dataset. For complex columns, the corresponding dataset is omitted and the actual contents are obtained from other files; a pointer to the resource should be stored in the corresponding entry of the data_frame.columns property.

For any column represented by an integer dataset (including boolean columns), missing values are represented by -2147483648.

For any column represented by a string dataset, that dataset may contain a missing-value-placeholder attribute. This should be a scalar string dataset that contains the string used to represent missing values. If no attribute exists, it is assumed that all strings are non-missing. Note that the row_names dataset, if present, should not contain any missing values.

Derived from data_frame/v1.json: virtual data frame object stored in a yet-to-be-defined file format. Simple columns are stored directly in the file. For complex columns, their contents should be stored in other files, and a pointer to a resource is stored in the corresponding entry of columns (a placeholder column may be created in the file).

Type: object

Type: string

The schema to use.

Type: array of object

Authors of this resource.

Each item of this array must be:

Type: object

Type: string

Email of the author.

Must match regular expression: ^[^@]+@[^@]+$

Type: string

Name of the author.

Type: string

ORCID of the author.

Must match regular expression: ^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$

Type: object
No Additional Properties

Type: object

Location of additional metadata for each column, stored as another data_frame. Omitted if there is no additional per-column metadata is provided.

Type: object

Type: string

Relative path of the resource from the root of the project directory.

Type: enum (of string)

Type of file. Local files should be present in the same project directory.

Must be one of:

  • "local"

Type: array of object

Information about the columnar fields in the data frame. This follows the same order as the columns in the on-disk representation.

Each item of this array must be:


Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: enum (of string)

Must be one of:

  • "factor"
  • "ordered"
Type: object

Type: object

Levels for the categorical factor. This is stored as a single-column data_frame. For ordered factors, the order is respected in the saved data frame.

Type: object

Type: string

Relative path of the resource from the root of the project directory.

Type: enum (of string)

Type of file. Local files should be present in the same project directory.

Must be one of:

  • "local"
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "other"
Type: object

Type: object

Type: string

Relative path of the resource from the root of the project directory.

Type: enum (of string)

Type of file. Local files should be present in the same project directory.

Must be one of:

  • "local"

Type: string

Name of the column. Each column must have a non-empty name. Column names should not be duplicated within columns.

Must be at least 1 characters long

Type: enum (of string)

What is the type of the column? Factors and ordered factors have an additional levels property specifying the levels. Dates are stored in YYYY-MM-DD format. Date-times should follow RFC 3339 Section 5.6. Columns listed as other are assumed to be non-atomic and should contain a resource property pointing towards the file containing the column's contents.

Must be one of:

  • "integer"
  • "number"
  • "string"
  • "factor"
  • "ordered"
  • "boolean"
  • "date"
  • "date-time"
  • "other"

Type: array of integer

Dimensions of a two-dimensional object.

Must contain a minimum of 2 items

Must contain a maximum of 2 items

Each item of this array must be:

Type: object

Location of additional metadata for this object, typically stored as a list (via the basic_list schema). Ommitted if no other metadata is provided.

Type: object

Type: string

Relative path of the resource from the root of the project directory.

Type: enum (of string)

Type of file. Local files should be present in the same project directory.

Must be one of:

  • "local"

Type: boolean Default: false

Whether the data frame has row names. If true, these are stored in the first column of the CSV.

Type: string

Description of the resource.

Type: array of object

UCSC, Ensembl or other genome builds involved in generating this resource.

Each item of this array must be:

Type: object

Type: string

Identifier for this genome build.


Examples:

"mm10"
"NCBIm37"

Type: enum (of string)

Source of the genome build identifier.

Must be one of:

  • "Ensembl"
  • "UCSC"
  • "Wormbase"
  • "Flybase"

Type: object
No Additional Properties

Type: string

Name of the group inside the HDF5 file that contains the contents of the data frame.

Type: boolean Default: false

Is this a child document, only to be interpreted in the context of the parent document from which it is linked? This may have implications for search and metadata requirements.

Type: string

MD5 checksum for the file.

Type: array of object

Origins of this resource.

Each item of this array must be:


Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "PubMed"
Type: object

Type: string
Must match regular expression: ^[0-9]+$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "GEO"
Type: object

Type: string
Must match regular expression: ^GSE[0-9]+$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "ArrayExpress"
Type: object

Type: string
Must match regular expression: ^E-MTAB-[0-9]+$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "DOI"
Type: object

Type: string
Must match regular expression: ^[0-9a-zA-Z\._-]+/[0-9a-zA-Z\._-]+$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "URI"
Type: object

Type: string
Must match regular expression: ^(http|ftp|https|s3|sftp)://

Type: string

Identifier for the resource in the specified type.

Type: enum (of string)

Source database or repository.

Must be one of:

  • "PubMed"
  • "GEO"
  • "ArrayExpress"
  • "DOI"
  • "URI"

Type: string

Path to the file in the project directory.

Type: array of integer

Each item of this array must be:

Type: integer

NCBI taxonomy IDs of the species involved in this resource.

Type: array of object

Terms from a controlled vocabulary, used to annotate this resource in a machine-readable manner.

Each item of this array must be:


No Additional Properties

Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "Experimental Factor Ontology"
Type: object

Type: object
Must match regular expression: ^EFO:[0-9]{7}$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "Human Disease Ontology"
Type: object

Type: object
Must match regular expression: ^DOID:[0-9]+$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "Cell Ontology"
Type: object

Type: object
Must match regular expression: ^CL:[0-9]{7}$
Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.

Type: object

Type: const
Specific value: "UBERON"
Type: object

Type: const
Specific value: "^UBERON:[0-9]{7}$"

Type: string

Identifier for the term.


Examples:

"EFO:0008913"
"DOID:13250"
"CL:0000097"
"UBERON:0005870"

Type: enum (of string)

Name of the vocabulary or ontology that is the source for this term.

Must be one of:

  • "Experimental Factor Ontology"
  • "Human Disease Ontology"
  • "Cell Ontology"
  • "UBERON"

Type: string

Version of the vocabulary.

Type: string

Title of the resource.

Type: object

If the conditions in the "If" tab are respected, then the conditions in the "Then" tab should be respected. Otherwise, the conditions in the "Else" tab should be respected.


Must not be:

Type: object

Type: const
Specific value: true
Type: object

The following properties are required:

  • title
  • description
  • authors
  • species
  • genome
  • origin
  • terms