PSIM

Protein-Specific Information Matrix (PSIM)

This section provides details on the PSIM. The PSIM is a dataframe which contains all gene-specific features needed to generate the expression module reactions in the ME-Model

Data Sources

We provide a "gold-standard" PSIM (file name "psim_gold.h5") which can be downloaded using make.

Sources:

Sequences were generated form MANE Select, RefSeq Select, and APPRIS.
The number of exons was found using the Ensembl REST API for the specific transcript isoform.
The poly(A)-length was taken from here.
Metadata associated with the secretory pathway was taken from human PSIM specified here. The direct PSIM for this is here.
mrna degradation rates were taken from here.
protein degradation rates were taken from here and here.
PTRs were taken from here.

Formatting

We explain the formatting and values for all columns in the gold-standard PSIM. Any user-provided PSIM can be generated following this formatting (column names must match exactly).

Legend for user-provided PSIM:

super required columns⁰: default values unavailable, user must provide
required columns¹: default values unavailable, but if not provided or incorrect, will fill in with the gold-standard PSIM values when available (otherwise will error out)
semi-optional columns²: may or may not be required by pipeline
optional columns³: used in pipeline, but default values can be used when not provided
optional columns, secretory pathway only⁴: if provided, these are only used for proteins that will be processed via the secretory pathway (ER, Golgi, exctracellular membrane, plasma membrane, lysosome).
additional information⁵: not used in pipeline, but additional metadata associated with each gene - specific to Recon2.2 and gold-standard PSIM. User provided PSIM does not need these columns.

The PSIM is read into the pipeline using the preprocess.correct_inputs.correct_psim function, which will check that all values make sense. If the fill_na argument is set to 'select', all NaN values in the user-provided PSIM are filled with values from the gold-standard PSIM when available, otherwise default. If the fill_na argument is set to 'default', all NaN values are replaced with a default value according to the expression.gene_expression.gene_information.GeneInformation class.

Columns:

HGNC_ID⁰: The gene ID in HGNC format (HGNC:####). There should be an entry for all genes that are included in the M_Model GPR and in non-machinery.
- Datatype: str
PREMRNA_SEQ¹: The gene premrna sequence. Requirements include that values can only include 'A', 'C', 'G', 'U', and the sequence length must be >= mrna sequence length.
- Datatype: string
- Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values. Requirements include that values can only include 'A', 'C', 'G', 'U', the sequence length must be >= mrna sequence length, and the sequence length must be >= 3*protein sequence length.
MRNA_SEQ¹: The gene mrna sequence (isoform specific).
- Datatype: str
- Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values.
PROTEIN_SEQ¹: The gene protein sequence (isoform specific). Requirements include values can only include one-letter amino-acid codes and the sequence length <= (mrna sequence length/3)
- Datatype: str
- Default value: Technically none, but preprocess.correct_inputs.correct_psim will fill incorrect values with the gold-standard PSIM values.
POLYA_LENGTH³: The length of the mature mRNA polyA tail.
- Datatype: int
- Default value: Randomly draws from a johnsonsu distribution
N_EXONS³: The number of exons in the premrna (isoform specific). Use to estimate the number of introns (as # of exons - 1).
- Datatype: int
- Default value: Estimated as 1 + (premrna sequence length)/6700
TMD⁴: The number of transmembrane domains contained in the sequence.
- Datatype: int
- Default value: 0
SP⁴: Whether the protein contains a secretory pathway signal peptide. This option is not currently implemented as all proteins destined for secretory pathway compartments are assumed to have a signal peptide (SP). In the future, this option will be used for non-canonical secretion
- Datatype: bool
- Default value: True for secretory pathway destined proteins, False otherwise
DSB⁴: The number of disulfide bonds in the protein.
- Datatype: int
- Default value: 0
GPI⁴: Whether a GPI anchor is present in the protein. 0 if not present, 1 otherwise.
- Datatype: int
- Default value: 0
OG⁴: The number of utilized O-linked glycosylation sites in the protein.
- Datatype: int
- Default value: 0
NG⁴: The number of utilized N-linked glycosylation sites in the protein.
- Datatype: int
- Default value: 0
ALPHA_M³: The mrna degradation/turnover rate (hrs^-1). Used in calculating coupling constraints.
- Datatype: float
- Default value: 0.06 hrs^-1
ALPHA_P³: The protein degradation/turnover rate (hrs^-1). Used in calculating coupling constraints.
- Datatype: float
- Default value: 0.02 hrs^-1
PTR³: The protein to rna ratio, as described in https://doi.org/10.15252/msb.20188513. Used in calculating coupling constraints.
- Datatype: float
- Default value: 65163
LOCATION²: The final location of the protein. Required for non-machinery, disregarded for machinery (pipeline infers location from the reaction compartments).
- Datatype: str, on of utils.parameters.compartments.keys()

Additional Information in Gold-Standard PSIM:

Machinery⁵: Whether a protein is considered machinery according to the full Recon2.2 ('Metabolic'), the GPRs for expression reactions ('Expression'), both ('Both'), or neither ('Non-Machinery').
Source⁵: From which database the isoform sequences were attained.
Status⁵: 1 for entries that should work with the pipeline, 0 for entries that will cause an error in the pipeline.
The remaining columns⁵ are various IDs for the gene: 'GENE_SYMBOL', 'GeneID', 'ENSG_ID', 'ENST_ID', 'ENSP_ID', 'REFT_ID', 'REFP_ID','UNIPROT_ID'.