Differences between revisions 4 and 14 (spanning 10 versions)

GreedyPlus: An Algorithm for the Alignment of Interface Interaction Networks

Brian Law and Gary Bader

Abstract

The increasing ease and accuracy of protein-protein interaction detection has resulted in the ability to map the interactomes of multiple species. We now have an opportunity to compare species to better understand how interactomes evolve. As DNA and protein sequence alignment algorithms were required for comparative genomics, network alignment algorithms are required for comparative interactomics. A number of network alignment methods have been developed for protein-protein interaction networks, where proteins are represented as vertices linked by edges if they interact. Recently, protein interactions have been mapped at the level of amino acid positions, which can be represented as an interface-interaction network (IIN), where vertices represent binding sites, such as protein domains and short sequence motifs. However, current algorithms are not designed to align these networks and generally fail to do so in practice. We present a greedy algorithm, GreedyPlus, for IIN alignment, combining data from diverse sources, including network, protein and binding site properties, to identify putative orthologous relationships between interfaces in available worm and yeast data. GreedyPlus is fast and simple, allowing for easy customization of behaviour, yet still capable of generating biologically meaningful network alignments.

Downloads

Latest Release

GreedyPlus: (May 02, 2015) Source: GreedyPlus_v0.2.zip
Note: Initial release.

Usage

Open a terminal and type in java -jar <file location>/GreedyPlus_v0.1.jar <parameters file (optional)>

By default, with no parameters file specified, this implementation of GreedyPlus will use the file at <file location>/worm_yeast/best.reduced_max.params as the input parameter file. This file represents the trained, minimal set of parameters as described in the paper. The output will be generated at <file location>/GreedyPlus.worm_yeast.reduced_max.xgmml, which can be then directly imported into Cytoscape.

Description of parameters file:

The parameters file contains 6 sections. Lines are read in order from the top. Comment lines are indicated by a preceding !

The first two lines are the locations of two adjacency lists, each representing one of the two input networks. The third line is the location of a file listing all pairs of orthologous proteins between the two input networks. The fourth line is the location of a file list all pairs of orthologous nodes between the two input networks. The fifth line specifies where the output should be written to. The .xgmml extension will be appended automatically. The remaining lines specify the input scoring matrix files for the algorithm. They should be listed, one per line, in tab-delimited format: file location, type (p for protein, d for domain, b for binding site/ligand), weight, and threshold (- for no threshold).

Network Adjacency Files

Each line in the adjacency file represents two nodes in the network and an edge between them. Each node should be specified in the format <protein name>, <start position>, <end position>. The ordering of the nodes/edges is not relevant. These files do not need to contain a complete list of all nodes to be aligned; the algorithm uses the scoring matrix files (described below) as the definitive source for nodes, as there may be disconnected nodes with the input networks.

Orthologous Proteins File

Each line in this file should be two orthologous proteins, in tab-delimited format, with the first protein coming from the first species and the second protein coming from the second species (order as specified in the first two lines of the parameters file).

This file may be omitted, but the RPO statistic will not be calculated.

Orthologous Nodes File

Each line in this file should be two orthologous nodes, in tab-delimited format, with the first node coming from the first species and the second node coming from the second species (order as specified in the first two lines of the parameters file).

This file may be omitted, but the OVP statistic will not be calculated.

Scoring Matrix Files

This file should be a tab-delimited file listing the similarity scores between every pair of proteins/domains/binding sites in the input networks. Those from the first species should be listed in the first row, while those from the second species should be listed in the first column.

These files need to correspond with the input network files: every node listed in the network files must be also be present in either every domain scoring matrix file or every binding site scoring matrix file, and the protein name for each node must be present in every protein scoring matrix file. The reverse is not required.

-  ⇤ ← Revision 4 as of 2015-05-02 07:07:00 → 
  Size: 4973
  Editor: BrianLaw
  Comment:
+   ← Revision 14 as of 2015-05-02 07:31:54 → ⇥
  Size: 4965
  Editor: BrianLaw
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
-The increasing ease and accuracy of protein-protein interaction detection has resulted in the ability to map the interactomes of multiple species. We now have an opportunity to compare species to better understand how interactomes evolve. As DNA and protein sequence alignment algorithms were required for comparative genomics, network alignment algorithms are required for comparative interactomics. A number of network alignment methods have been developed for protein-protein interaction networks, where proteins are represented as vertices linked by edges if they interact. Recently, protein interactions have been mapped at the level of amino acid positions, which can be represented as an interface-interaction network (IIN), where vertices represent binding sites, such as protein domains and short sequence motifs. However, current algorithms are not designed to align these networks and generally fail to do so in practice. We present a greedy algorithm, GreedyPlus, for IIN alignment, combining data from diverse sources, including network, protein and binding site properties, to identify putative orthologous relationships between interfaces in available worm and yeast data. GreedyPlus is fast and simple, allowing for easy customization of behaviour, yet still capable of generating biologically meaningful network alignments.
+The increasing ease and accuracy of protein-protein interaction detection has resulted in the ability to map the interactomes of multiple species. We now have an opportunity to compare species to better understand how interactomes evolve. As DNA and protein sequence alignment algorithms were required for comparative genomics, network alignment algorithms are required for comparative interactomics. A number of network alignment methods have been developed for protein-protein interaction networks, where proteins are represented as vertices linked by edges if they interact. Recently, protein interactions have been mapped at the level of amino acid positions, which can be represented as an interface-interaction network (IIN), where vertices represent binding sites, such as protein domains and short sequence motifs. However, current algorithms are not designed to align these networks and generally fail to do so in practice. We present a greedy algorithm, !GreedyPlus, for IIN alignment, combining data from diverse sources, including network, protein and binding site properties, to identify putative orthologous relationships between interfaces in available worm and yeast data. !GreedyPlus is fast and simple, allowing for easy customization of behaviour, yet still capable of generating biologically meaningful network alignments.
 Line 10:
-''!GreedyPlus'': (Feb 13, 2015) <<BR>> Source: [[attachment:GreedyPlus_v0.1.zip]] <<BR>> Note: Initial release.
  Usage  Open a terminal and type in java -jar <file location>/GreedyPlus_v0.1.jar <parameters file (optional)>
+''!GreedyPlus'': (May 02, 2015)  Source: [[attachment:GreedyPlus_v0.2.zip]] <<BR>> Note: Initial release.
-Line 13:
+Line 12:
-  By default, with no parameters file specified, this implementation of GreedyPlus will use the file at <file location>/worm_yeast/best.reduced_max.params as the input parameter file. This file represents the trained, minimal set of parameters as described in the paper. The output will be generated at <file location>/GreedyPlus.worm_yeast.reduced_max.xgmml, which can be then directly imported into Cytoscape.
+=== Usage ===
Open a terminal and type in `java -jar <file location>/GreedyPlus_v0.1.jar <parameters file (optional)>`
 Line 15:
-    Description of parameters file:    The parameters file contains 6 sections. Lines are read in order from the top. Comment lines are indicated by a preceding !
+By default, with no parameters file specified, this implementation of !GreedyPlus will use the file at `<file location>/worm_yeast/best.reduced_max.params` as the input parameter file. This file represents the trained, minimal set of parameters as described in the paper. The output will be generated at `<file location>/GreedyPlus.worm_yeast.reduced_max.xgmml`, which can be then directly imported into Cytoscape.
 Line 17:
-  The first two lines are the locations of two adjacency lists, each representing one of the two input networks.
+=== Description of parameters file: ===
The parameters file contains 6 sections. Lines are read in order from the top. Comment lines are indicated by a preceding !
-Line 19:
+Line 20:
-  The third line is the location of a file listing all pairs of orthologous proteins between the two input networks.
+The first two lines are the locations of two adjacency lists, each representing one of the two input networks. The third line is the location of a file listing all pairs of orthologous proteins between the two input networks. The fourth line is the location of a file list all pairs of orthologous nodes between the two input networks. The fifth line specifies where the output should be written to. The `.xgmml` extension will be appended automatically. The remaining lines specify the input scoring matrix files for the algorithm. They should be listed, one per line, in tab-delimited format: file location, type (`p` for protein, `d` for domain, `b` for binding site/ligand), weight, and threshold (`-` for no threshold).
-Line 21:
+Line 22:
-  The fourth line is the location of a file list all pairs of orthologous nodes between the two input networks.
+=== Network Adjacency Files ===
Each line in the adjacency file represents two nodes in the network and an edge between them. Each node should be specified in the format <protein name>, <start position>, <end position>. The ordering of the nodes/edges is not relevant. These files do not need to contain a complete list of all nodes to be aligned; the algorithm uses the scoring matrix files (described below) as the definitive source for nodes, as there may be disconnected nodes with the input networks.
-Line 23:
+Line 25:
-  The fifth line specifies where the output should be written to. The “.xgmml” extension will be appended automatically.
+=== Orthologous Proteins File ===
Each line in this file should be two orthologous proteins, in tab-delimited format, with the first protein coming from the first species and the second protein coming from the second species (order as specified in the first two lines of the parameters file).
-Line 25:
+Line 28:
-  The remaining lines specify the input scoring matrix files for the algorithm. They should be listed, one per line, in tab-delimited format: file location, type (p for protein, d for domain, b for binding site/ligand), weight, and threshold (- for no threshold).
+This file may be omitted, but the RPO statistic will not be calculated.
-Line 27:
+Line 30:
-  Network Adjacency Files    Each line in the adjacency file represents two nodes in the network and an edge between them. Each node should be specified in the format <protein name>, <start position>, <end position>. The ordering of the nodes/edges is not relevant.
+=== Orthologous Nodes File ===
Each line in this file should be two orthologous nodes, in tab-delimited format, with the first node coming from the first species and the second node coming from the second species (order as specified in the first two lines of the parameters file).
-Line 29:
+Line 33:
-  These files do not need to contain a complete list of all nodes to be aligned; the algorithm uses the scoring matrix files (described below) as the definitive source for nodes, as there may be disconnected nodes with the input networks.
+This file may be omitted, but the OVP statistic will not be calculated.
-Line 31:
+Line 35:
-    Orthologous Proteins File  Each line in this file should be two orthologous proteins, in tab-delimited format, with the first protein coming from the first species and the second protein coming from the second species (order as specified in the first two lines of the parameters file).
+=== Scoring Matrix Files ===
This file should be a tab-delimited file listing the similarity scores between every pair of proteins/domains/binding sites in the input networks. Those from the first species should be listed in the first row, while those from the second species should be listed in the first column.
-Line 33:
+Line 38:
-  This file may be omitted, but the RPO statistic will not be calculated.

    Orthologous Nodes File  Each line in this file should be two orthologous nodes, in tab-delimited format, with the first node coming from the first species and the second node coming from the second species (order as specified in the first two lines of the parameters file).

  This file may be omitted, but the OVP statistic will not be calculated.

    Scoring Matrix Files  This file should be a tab-delimited file listing the similarity scores between every pair of proteins/domains/binding sites in the input networks. Those from the first species should be listed in the first row, while those from the second species should be listed in the first column.

  These files need to correspond with the input network files: every node listed in the network files must be also be present in either every domain scoring matrix file or every binding site scoring matrix file, and the protein name for each node must be present in every protein scoring matrix file. The reverse is not required.
+These files need to correspond with the input network files: every node listed in the network files must be also be present in either every domain scoring matrix file or every binding site scoring matrix file, and the protein name for each node must be present in every protein scoring matrix file. The reverse is not required.