Protein-Specific Information Matrix (PSIM)

This section provides details on the PSIM. The PSIM is a dataframe which contains all gene-specific features needed to generate the expression module reactions in the ME-Model

Data Sources

We provide a "gold-standard" PSIM (file name "psim_gold.h5") which can be downloaded using make.

Sources:

  • Sequences were generated form MANE Select, RefSeq Select, and APPRIS.
  • The number of exons was found using the Ensembl REST API for the specific transcript isoform.
  • The poly(A)-length was taken from here.
  • Metadata associated with the secretory pathway was taken from human PSIM specified here. The direct PSIM for this is here.
  • mrna degradation rates were taken from here.
  • protein degradation rates were taken from here and here.
  • PTRs were taken from here.

Formatting

We explain the formatting and values for all columns in the gold-standard PSIM. Any user-provided PSIM can be generated following this formatting (column names must match exactly).

Legend for user-provided PSIM:

  • super required columns0: default values unavailable, user must provide
  • required columns1: default values unavailable, but if not provided or incorrect, will fill in with the gold-standard PSIM values when available (otherwise will error out)
  • semi-optional columns2: may or may not be required by pipeline
  • optional columns3: used in pipeline, but default values can be used when not provided
  • optional columns, secretory pathway only4: if provided, these are only used for proteins that will be processed via the secretory pathway (ER, Golgi, exctracellular membrane, plasma membrane, lysosome).
  • additional information5: not used in pipeline, but additional metadata associated with each gene - specific to Recon2.2 and gold-standard PSIM. User provided PSIM does not need these columns.

The PSIM is read into the pipeline using the preprocess.correct_inputs.correct_psim function, which will check that all values make sense. If the fill_na argument is set to 'select', all NaN values in the user-provided PSIM are filled with values from the gold-standard PSIM when available, otherwise default. If the fill_na argument is set to 'default', all NaN values are replaced with a default value according to the expression.gene_expression.gene_information.GeneInformation class.

Columns:

  • HGNC_ID0: The gene ID in HGNC format (HGNC:####). There should be an entry for all genes that are included in the M_Model GPR and in non-machinery.
    • Datatype: str
  • PREMRNA_SEQ1: The gene premrna sequence. Requirements include that values can only include 'A', 'C', 'G', 'U', and the sequence length must be >= mrna sequence length.
    • Datatype: string
    • Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values. Requirements include that values can only include 'A', 'C', 'G', 'U', the sequence length must be >= mrna sequence length, and the sequence length must be >= 3*protein sequence length.
  • MRNA_SEQ1: The gene mrna sequence (isoform specific).
    • Datatype: str
    • Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values.
  • PROTEIN_SEQ1: The gene protein sequence (isoform specific). Requirements include values can only include one-letter amino-acid codes and the sequence length <= (mrna sequence length/3)
    • Datatype: str
    • Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values.
  • POLYA_LENGTH3: The length of the mature mRNA polyA tail.
    • Datatype: int
    • Default value: Randomly draws from a johnsonsu distribution
  • N_EXONS3: The number of exons in the premrna (isoform specific). Use to estimate the number of introns (as # of exons - 1).
    • Datatype: int
    • Default value: Estimated as 1 + (premrna sequence length)/6700
  • TMD4: The number of transmembrane domains contained in the sequence.
    • Datatype: int
    • Default value: 0
  • SP4: Whether the protein contains a secretory pathway signal peptide. This option is not currently implemented as all proteins destined for secretory pathway compartments are assumed to have a signal peptide (SP). In the future, this option will be used for non-canonical secretion
    • Datatype: bool
    • Default value: True for secretory pathway destined proteins, False otherwise
  • DSB4: The number of disulfide bonds in the protein.
    • Datatype: int
    • Default value: 0
  • GPI4: Whether a GPI anchor is present in the protein. 0 if not present, 1 otherwise.
    • Datatype: int
    • Default value: 0
  • OG4: The number of utilized O-linked glycosylation sites in the protein.
    • Datatype: int
    • Default value: 0
  • NG4: The number of utilized N-linked glycosylation sites in the protein.
    • Datatype: int
    • Default value: 0
  • ALPHA_M3: The mrna degradation/turnover rate (hrs^-1). Used in calculating coupling constraints.
    • Datatype: float
    • Default value: 0.06 hrs^-1
  • ALPHA_P3: The protein degradation/turnover rate (hrs^-1). Used in calculating coupling constraints.
    • Datatype: float
    • Default value: 0.02 hrs^-1
  • PTR3: The protein to rna ratio, as described in https://doi.org/10.15252/msb.20188513. Used in calculating coupling constraints.
    • Datatype: float
    • Default value: 65163
  • LOCATION2: The final location of the protein. Required for non-machinery, disregarded for machinery (pipeline infers location from the reaction compartments).
    • Datatype: str, on of utils.parameters.compartments.keys()

Additional Information in Gold-Standard PSIM:

  • Machinery5: Whether a protein is considered machinery according to the full Recon2.2 ('Metabolic'), the GPRs for expression reactions ('Expression'), both ('Both'), or neither ('Non-Machinery').
  • Source5: From which database the isoform sequences were attained.
  • Status5: 1 for entries that should work with the pipeline, 0 for entries that will cause an error in the pipeline.
  • The remaining columns5 are various IDs for the gene: 'GENE_SYMBOL', 'GeneID', 'ENSG_ID', 'ENST_ID', 'ENSP_ID', 'REFT_ID', 'REFP_ID','UNIPROT_ID'.