Output Formats
The GET and GRAPH web service commands return data in several formats explained below.
BioPAX (RDF/XML)
BioPAX is the default and most complete output format of PC that offers access to all the details of the biological network model stored in the system. This format is ideal for users wishing to to access specific data not available in simple formats. Since BioPAX is defined using the standard OWL (RDF/XML) syntax, this format can also be used with RDF/OWL tools such as reasoners or triplestores. All pathways and interactions within the database are available in BioPAX Level 3. Due to the richness of representation in BioPAX, reading and using such a large BioPAX document requires knowledge of the format and software development tools available for processing it, such as Paxtools, a Java library for working with BioPAX as object model, or Jena, SPARQL.
JSON-LD
JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is an ideal data format for programming environments, REST Web services, and unstructured databases such as CouchDB and MongoDB. Paxtools' json-converter module, based on the Apache Jena libraries, helps convert a BioPAX model or element to JSON-LD format.
Gene Set Enrichment Format (GSEA - MSigDB GMT)
Over-representation analysis (ORA) is frequently used to assess the statistical enrichment of known gene sets (e.g. pathways) in a discrete or ranked list of genes. This type of analysis is useful for summarizing large gene lists and is commonly applied to genomics data sets. One popular software for analyzing ranked gene lists is Gene Set Enrichment Analysis (GSEA). The Gene sets used by GSEA are stored for convenience in the Molecular Signature Database (MSigDB) in the Gene Matrix Transposed file format (*.gmt). This is the main tab-delimited file format specified by the Broad Molecular Signature Database.
Each gene set is described by a name, a description, and the genes in the gene set: participants in a pathway are specified with one or several HGNC symbols (we can also provide another file using UniProt accession numbers instead). All participants (corresponding BioPAX EntityReferences) for a pathway must come from the same species as the pathway. Participants from cross-species pathways, as well as those for which no identifier is found (i.e., when there're no Xrefs of given type), are removed. Exporting to the MSigDB format will enable computational biologists to use pathway commons data within gene set enrichment algorithms, such as GSEA. Available for all pathways within Pathway Commons (only from pathway database sources, not interaction database sources). Full data format details are available at Broad GSEA Wiki. We used the normalized and merged BioPAX Level3 model and our simple GSEA converter from the Paxtools library to generate the GSEA (.gmt) archives. (Note: to effectively enforce cross-species check, BioSources must have a UnificationXref with "taxonomy" database name and id, and Pathways, ProteinReferences - not empty 'organism' property value).
Simple Interaction Format (SIF)
SIF (or BINARY_SIF)
Many network analysis algorithms require pairwise interaction networks as input. A BioPAX network often contains more complex relationships with multiple participants, such as biochemical reactions. To make it easier to use all of the pathway information in Pathway Commons with typical network analysis tools, we developed a set of rules to reduce BioPAX interactions to pairwise (or binary) relationships. Since SIF interactions are always binary it is not possible to fully represent all of BioPAX, thus this translation is lossy in general. Nonetheless, the SIF network is useful for those applications that require pairwise interaction input. SIF format can be easily imported into popular network analysis tools, such as Cytoscape.
In this output format, all participants are specified as chemical or gene names or identifiers. This format does not contain any cross-species interactions and is available for all pathways and interactions within this database.
A note about identifiers: We uniquely mapped selected protein and gene identifiers, such as HGNC symbols, NCBI Gene, UniProt Isoform, Ensembl and RefSeq, to primary UniProt accession numbers, where possible, and then normalized and merged original protein types to canonical UniProt ones, thus building a larger BioPAX network of all pathways, interactions and participants from different data sources. In some cases, mappings between identifiers cannot be made, so it is possible to lost some information in this process. Also, in cases where the SIF format contains a non-UniProt identifier (e.g. HGNC Symbol), it is possible that more than one identifier maps to a UniProt identifier. In this case, a duplicate SIF interaction is created for each additional non-UniProt identifier.
TXT (or EXTENDED_BINARY_SIF)
Similar to the basic SIF output format, except that this output format is specified in two sections. Each section starts with one row of column headings. The two sections are separated by a single blank line. Each entry is multi-column, tab-delimited. The first section is SIF (edges) as describe above, plus PATHWAYS column. Current edge attributes include the interaction data source and PubMed ID. The second section contains participant (molecule or gene) name followed by several node attributes. Current node attributes include PARTICIPANT_TYPE, PARTICIPANT_NAME(s), UNIFICATION_XREF(s) (e.g., one or more UniProt IDs in the case of a protein reference, or a ChEBI ID in the case of a Small Molecule reference), and RELATIONSHIP_XREF(s) (including RefSeq, Entrez Gene, and Gene Symbol). Xrefs are represented as a NAME:VALUE pair; for example PubMed:9136927. Multiple names or xrefs will be separated by a semicolon ';'. This output format is suitable for Cytoscape - Attribute Table import and loading into Excel. To prevent an unsuccessful import into Cytoscape due to missing attribute values, users should specify during import that all columns are strings. This format is available for all pathways and interactions within Pathway Commons.
Types of Binary Relations
Name | Description | Sample BioPAX Structure | Inferred Binary Relation(s) |
---|---|---|---|
controls-state-change-of | First protein controls a reaction that changes the state of the second protein. | ||
controls-transport-of | First protein controls a reaction that changes the cellular location of the second protein. | ||
controls-phosphorylation-of | First protein controls a reaction that changes the phosphorylation status of the second protein. | ||
controls-expression-of | First protein controls a conversion or a template reaction that changes expression of the second protein. | ||
catalysis-precedes | First protein controls a reaction whose output molecule is input to another reaction controled by the second protein. | ||
in-complex-with | Proteins are members of the same complex. | ||
interacts-with | Proteins are participants of the same MolecularInteraction. | ||
neighbor-of | Proteins are participants or controlers of the same interaction. | ||
consumption-controled-by | The small molecule is consumed by a reaction that is controled by a protein | ||
controls-production-of | The protein controls a reaction of which the small molecule is an output. | ||
controls-transport-of-chemical | The protein controls a reaction that changes cellular location of the small molecule. | ||
chemical-affects | A small molecule has an effect on the protein state. | ||
reacts-with | Small molecules are input to a biochemical reaction. | ||
used-to-produce | A reaction consumes a small molecule to produce another small molecule. |
Legend: |
SBGN
The Systems Biology Graphical Notation (SBGN) is a standard visual notation for network diagrams in biology. SBGN markup language (SBGN-ML) is an associated standard XML format that can be loaded into available software to visualize a diagram of a pathway. BioPAX can be converted to SBGN-ML format, following the process diagram paradigm, one of three paradigms (activity flow, process and entity relationship) available in SBGN.