Frela is a web service for computing Functional relationships of protein pairs based on Gene Ontology (GO) annotations. It allows calculation of protein functional similarity based on their GO annotations for biological process (BP), molecular function (MF), and cellular component (CC) ontologies using various popular semantic similarity measures that can be combined arbitrarily with a set of widely used mixing strategies. The service supports combining ontologies, which was shown to improve scoring performance. Furthermore, in an attempt to overcome annotation bias, we compute z-scores when comparing proteins from human, mouse, and fly model organisms. We demonstrated that introduction of z-scores improves score performance in a large number of functional similarity measures in an orthologous/random gene pair setting, and the results can be explored interactively on this server.

## Functional relationships between protein pairs

Input information: Interactive protein/protein comparison Batch protein/protein comparison with file upload

Select analysis type:

Select organisms: Use electronic annotations (IEA):

Select ontology:

First identifier Second identifier
File:

Panel file:

Measures:
Semantic similarity measures:
Mixing strategies:

## HELP TAB

### Calculation of protein similarity

Frela computes pairwise protein similarity by using Gene Ontology (GO) terms in a two step process. First, it computes a semantic similarity matrix M that contains all possible pairwise semantic similarity scores for the two proteins. The matrix is organized in a way such that columns correspond to one protein's GO terms and rows to the other protein's GO terms. In the second step, a mixing strategy computes a single functional similarity score involving all matrix elements.

We currently support six semantic similarity measures. Given a GO term c, its information content is defined as I(c)=-log p(c), where p(c) is the term probability computed from an annotation corpus. For any two GO terms s and t, let furthermore S(s,t) be the set of all common ancestors of these two GO terms.

 Resnik Lin Schlicker $$\mathit{simRes(s,t)} = \max_{c\in S(s,t)} I(c)$$ $$\mathit{simLin(s,t)} = \max_{c\in S(s,t)} \frac {2 \cdot I(c)} {I(s) + I(t)}$$ $$\mathit{simRel}(s,t) = \max_{c \in S(s,t)} \left( \frac{2 \cdot I(c)}{I(s) + I(t)} \cdot \left( 1 - p(c) \right) \right)$$ Information coefficient Jiang & Conrath Graph information content $$\mathit{simIC} = \frac{2\cdot\max_{c\in S(s,t)}I(c)}{I(s) + I(t)} \cdot \left( 1 - \frac 1 {1 - \max_{c\in S(s,t)}I(c)}\right)$$ $$\mathit{simJC} = \frac 1 {1 + I(s) + I(t) - 2\cdot\max_{c\in S(s,t)}I(c) }$$ $$\mathit{simGIC} = \frac {\sum_{c\in\{S(s,s) \cap S(t,t)\}}{I(c)}} {\sum_{c\in\{S(s,s) \cup S(t,t)\}}{I(c)}}$$

For a pair of proteins, any of the above semantic similarity measures can be used to derive the m×n semantic similarity matrix M=sij. Based on this matrix, we offer five mixing strategies to compute the final pairwise protein functional similarity score:

 Maximum Average Maximum averaged row & column best matches $$fsMax = \max_{i=1}^m \max_{j=1}^n s_{ij}$$ $$fsAvg = \frac 1 {m \cdot n} \sum_{i=1}^m {\sum_{j=1}^n{s_{ij}}}$$ $$fsBMM = \max\left( \frac 1 m \sum_{i=1}^m{\max_{j=1}^n{s_{ij}}}, \frac 1 n \sum_{j=1}^n{\max_{i=1}^m{s_{ij}}} \right)$$ Best match average Averaged best match $$fsBMA = \frac 1 2 \left( \frac 1 m \sum_{i=1}^m{\max_{j=1}^n{s_{ij}}} + \frac 1 n \sum_{j=1}^n{\max_{i=1}^m{s_{ij}}} \right)$$ $$fsABM = \frac 1 {m+n} \left( \sum_{i=1}^m{\max_{j=1}^n{s_{ij}}} + \sum_{j=1}^n{\max_{i=1}^m{s_{ij}}} \right)$$

### Calculation modes and parameters

Frela can be used in two different modes: an interactive mode requires user input through web entry forms, while a batch mode enables the possibility to perform many pairwise calculations via a file upload mechanism.

#### Parameters common to both calculation modes

• Select organisms:

Frela can be queried with any organism identifier that is stored in the GO database, as the backend is connected to a MySQL database downloaded from GO. However, computation of z-scores involves CPU-intensive calculations we have pre-calculated. Therefore, for the supported organisms human, mouse, and fly where z-score statistics are available, we use UniProt accession numbers as protein identifiers. For all other organisms, the corresponding GO identifier system must be taken. Please notice that z-score statistics are available only if both proteins to be compared are from either the human, mouse, or fly organsim.

For any pair of proteins (P, Q) to be scored, we determine its organism in the following order:

1. If the identifier of P is a UniProt accession number from either human, mouse, or fly, it is recognized as a such and annotations from GO are used by an internal mapping of gene identifiers. (Human proteins are already provided by GO in the UniProt accession number identifiers, mouse and fly UniProt accession numbers are internally mapped to the corresponding GO annotation identifiers.
2. The same applies to Q.
3. If the organism pair formed by (P, Q) matches the one specified in the web form and are either human, mouse, or fly, then z-scores are computed.
4. If a protein identifier was not recognized in the steps above, Frela falls back to querying the GO annotation database system. In this case, no mapping of protein or gene identifiers is done, the provided identifier must already match what is stored in the GO MySQL database.
• Use electronic annotations (IEA):

The vast majority of GO annotations are done in an automated way. GO evidence code "IEA" indicates that the annotation was "inferred from electronic annotation" and have not been reviewed by a curator. Frela's semantic similarity and statistical calculations allow to exclude such annotations by selecting "No" in the dropdown menu.

• Select ontology:

Proteins can be compared using annotations from any of the three GO ontologies: biological process (BP), molecular function (MF), and cellular component (CC). In addition, we offer computation of a combined score from different ontologies, which has shown to improve prediction accuracy. Therefore, pairs of ontologies (BP+MF, BP+CC, MF+FF) and all three ontologies (BP+MF+CC) can be combined into a single score by calculating the root mean square of the scores obtained from the individual ontologies.

• Measures:

It is possible to choose any combination of the currently supported six semantic similarity measures with the five mixing strategies. This choice will also be the default visualization for the "Show score performance" button, see also section "Score performance" below.

• Output

After pressing the "Submit" button, the job is added to the history of all previously processed jobs, indicating its current progress. Processing time depends on the job type and therefore the number of pairwise computations, and ranges from less than a second (single pairwise computation) to some minutes (two hundred and fifty thousand pairwise calculations with BP+MF+CC combined ontology).

Once the job has been completed, a button appears which, when pressed, opens the result list. This list is either sorted by z-score if the protein pairs are from human, fly, or mouse organisms annotated with BP, MF, or CC ontology, or otherwise by functional similarity raw score.

The result table always contains four fixed columns. These specify the UniProt accession numbers (if provided as input) and their corresponding identifiers in the GO annotation system (see "Select organisms"), named "UniProt" and "AnnCorp" (annotation corpus identifier system), respectively. UniProt accession numbers are linked to uniprot.org.

Depending on the choice of protein identifiers and ontology, each line of the result table contains one or more raw scores and z-scores. Each score is linked to its semantic similarity matrix, which consists of the pairwise semantic similarities of the GO terms associated with the protein in columns one and two (rows) and protein in columns three and four (columns). Since z-scores are calculated from raw scores, both scores are linked to the same semantic similarity matrix. A color gradient visually encodes semantic similarity from white (no similarity) to red (high similarity). Each mixing strategy combines differently elements from the semantic similarity matrix. Wherever possible, the matrix cells are surrounded by borders that provide visual clues that they have been taken into account during the calculation of the functional similarity.

#### Interactive protein/protein comparison

We support single pairwise protein similarity computation ("Select analysis type": "Protein/Protein") and a whole proteome scan ("Protein/All") against one of our model organisms, human, mouse, or fly. For pairwise calculations, two identifiers are required in the "First identifier" and "Second identifier" entry fields, respectively. Based on the supplied identifiers, Frela will automatically set the corresponding organisms in "Select organisms". The "Example" button fills in predefined values for a quick start. For a whole proteome scan, a single protein (or gene) identifier is scanned against the whole human, mouse, or fly proteome. Z-scores will not be available if the protein belongs to a different organism than human, mouse, or fly.

#### Batch protein/protein comparison with file upload

When choosing analysis type "Protein/Protein", a file with two columns is expected. Either column contains protein/gene identifiers from the organisms specified in the "Select organisms" dropdown menus (the protein identifiers in the columns need not match the order given by the menus, they may even alter between lines in the file). In order to prevent server overload, we currently limit the number of pairwise computations to two hundred and fifty thousand (250,000).

Analysis type "Protein/Panel" is a convenience function for comparing a list of proteins versus a predefined panel of proteins, such as a disease gene panel. Two files need to be provided. First, the list of proteins (uploaded under "File") and the panel itself (uploaded under "Panel file"). Again, the organisms need to be provided in the "Select organisms" section, but the order is irrelevant. Each protein listed in the first file is compared to each protein in the panel and the final list is then sorted by z-score or raw score. If the two files contain M and N proteins, respectively, we limit the number of resulting pairwise calculations to M×N < 250,000.

### Score performance

In an attempt to compare raw functional similarity scores, z-scores, and different semantic similarity measures and mixing strategies, we have created a test scenario where scores are investigated for the possibility to optimally separate a set of orthologous genes from an equally sized set of random chosen control pairs. For any score and some threshold h, we therefore define

Threshold Orthologous pair Control pair
score > h TP FP
score ≤ h FN TN

and select the optimal threshold h*, which minimizes the error rate $$\mathit{err} = \frac{\mathrm{FP} + \mathrm{FN}} {\mathrm{TP} + \mathrm{FP} + \mathrm{TN} + \mathrm{FN}}.$$

The results of this analysis are available on this web site. For any currently selected set of parameters in the "Frela" tab, pressing the "Show score performance" button in the bottom will open a new tab that visualizes a comparison of raw scores versus z-scores for the chosen ontology. Large red dots are the parameters that were chosen in the main "Frela" tab, mouse-over of any data point shows the optimal score and error rate of the semantic similarity measure and mixing strategy for that data point. This provides immediate overview of and insight into measure performance, shows the selected measure's performance, and allows to select a suitable combination of semantic similarity and mixing strategy.

### Citing Frela

If you find Frela useful for your work, please cite the following publication:

Weichenberger CX, Palermo A, Pramstaller PP, Domingues FS (2017) Exploring Approaches for Detecting Protein Functional Similarity within an Orthology-based Framework. Scientific Reports, 7:381. [PubMed] [DOI]

### EU General Data Protection Regulation

For information on the EU Regulation No. 2016/679 - General Data Protection Regulation (GDPR) with respect to this web site, we refer to the privacy policy statement on our main web site.

It is possible to download and run Frela on your local system, and in addition, set up a local mirror of the Frela web server running on an Apache server. Frela has been developed and tested on a Debian GNU/Linux testing system (stretch), but we expect it to run on any major Linux distribution.

Please install the required software/modules needed to run all the tools present in the package. The most important libraries are:

• python >= 2.7
• python-numpy
• python-igraph or python-networkx
• python-mysql.connector

When setting up the web interface, also include the following libraries:

• apache2
• libapache2-mod-wsgi

The package has been developed on the basis of the Dintor framework. Thus, in addition to protein functional similarity calculation, there is also a set of command line tools which allow to export a full or partial GO graph as a graphML XML document, export all annotated genes or gene products in GO Annotation File (GAF) format 2.0 output and generate a graph that serves as input to Frela. For optimal performance, this however requires installing the GO MySQL database.

For a complete documentation furnished with examples and descriptions, please read the INSTALL file available in the package folder.