Cellular location analysis in oral pathogen
Chi Yang and Chuan-Hsiung Chang
Introduction
The function of a protein is closely correlated with its subcellular location. A lot about a protein can be learnt from its cellular location because each cellular compartment has its specific roles, proteins in each compartment are there to fulfill those specific roles and, therefore, possible functions can be inferred for a protein by knowing which cellular compartment it is localized to. So subcellular location prediction for a protein will expedite the functional determination of a protein, particularly when the function of a protein remains unknown through a Blast homology search.
Cellular components of microorganisms are also important immunogenic determinants. However, only the components that are exposed to host immune system surveillance are susceptible to body's immune system. Therefore, knowledge of subcellular location of proteins in pathogens is important in designing potential drugs or subunit vaccines. There are several methods available to predict subcellular location of proteins in Gram-positive and Gram-negative bacteria. Examples of these methods include support vector machine (SVM), K-nearest neighbor (KNN), and other machine-learning approaches. Recent tools developed by various groups can determine the nature of proteins as being cytoplasmic, periplasmic, inner membrane, outer membrane (Gram-negative bacteria), extracellular. In this analysis, six tools (Table 1)were used to predict cellular locations of proteins from 13 oral bacteria and 2 plasmids. One of the methods, SignalP, only predicts presence of signal peptides but not cellular locations. Results from this analysis were integrated into the final prediction report.
All the raw data which are the probability or reliability index generated by different tools were stored in the database.
In addition, the most probable location(s) of each protein were summarized and presented in each gene record. The complete results for each genome are also available for bulk download.
Table 1. Comparison of the six protein cellular location prediction tools used in this analysis
| Tools used in this analysis | Program name | Downloadable to be standalone | Methods to predict location | Used as batch mode | Prediction speed | Literatures | Notes |
| Yes | psortb | Yes | Multiple classifier | Yes | Moderate | Gardy et. al, 2004(1) | |
| Yes | PSORT | Yes | Expert system | Yes | Fast | Nakai and Kanehisa, 1991(2) | |
| Yes | SignalP | Yes | Predicts traditional N-terminal signal peptides | Up to 2000 seqs | Fast | Bendtsen et al, 2004(3) | |
| No | Gpos/Gneg-PLoc | No | K-nearest neighbor-based classifier | No | Fast | Shen and Chou, 2007(5); Chou and Shen, 2006(6) | |
| Yes | CELLO | No | Two-level SVM | Yes | Fast | Yu et al, 2006(4) | |
| No | PSLpred | No | SVM and PSI-BLAST | Yes | Fast | Bhasin et al, 2005(7) | Only for gram negative bac. |
| Yes | Proteome Analyst’s Subcellular Localization Server (SubCell) | No | Machine learning | Yes | Moderate | Lu et al, 2004(8) | |
| No | LOCtree | No | ab initio Prediction of subcellular localization using hierarchical SVM's | Up to 100 seqs. | Slow | Nair and Rost, 2005(9) | non-membrane proteins only |
| Yes | SubLoc 1.0 | No | SVM | Yes | Fast | Hua and Sun, 2001(10) | Non-membrane proteins only |
Keys steps of the integrated cellular location prediction pipeline (Fig. 1)
- After running all the web-based prediction tools, the results in HTML format are saved.
- The results generated from those standalone tools are directly saved in TEXT format.
- A perl script was used to parse all the prediction results into tab-delimitated format.
- The tab-delimitated results are uploaded into the Oralgen database.
- For each gene record, the HTML-formated summary table (Fig. 2) displayed.

Figure 1. The integration of the prediction results from different tools.
Data presentation
- Figure 2 shows the summary results in each gene record.
- In this example, the protein was predicted as an extracellular protein. In the summmary the results with the highest reliability scores, which have to be larger than the cut-off (PSORT is 0.5 and psortb is 2.5), are picked for each tools. If more than one tool predicts the same location, then this location will be placed first.
- Because different tools use different scales for reliability or probability, each cell in the table will have its color-scale. The more reliable results, the darker the red color. The gray color means that that tool does not predict for that cellular location.
- The SubLoc tool does not differentiate the Gram-positive and Gram-negative bacteria in predicting membrane protein. So if a protein is predicted either as a membrane protein or as inner or outer membrane protein by other tools, the results from SubLoc will be ignored.

Figure 2. The summary table of the prediction results from different tools in each gene record of the Oralgen database.
Results for download
- Summary page - This page provides a complete list of prediction for all proteins in each genome.
For each location, the ratio represents the number of the tools that predicted over the total number of tools used for that particular cellular locationprediction. The ratio n each cell is colored from white to red with darker red indicating the most likely cellular prediction predicted for that protein.
- Detailed page - This page display the raw data for all proteins in a genome. Each row shows predicted values from each different tools. The red color indicates the reliability.
- In the tab-delimitated file (database dump) for bulk download, the data are in the following order:
- Gene_id
- Description of proteins
- Tool name
- The reliability of each protein predicted as extracellular
- The reliability of each protein predicted as outer membrane (gram-negative), cell wall (gram-positive)
- The reliability of each protein predicted as periplasmic space
- The reliability of each protein predicted as inner membrane (gram-negative), membrane (gram-positive)
- The reliability of each protein predicted as cytoplasmic
- In the tab-delimitated file (summary results), the data are in the following order:
- Gene_id
- Summary results
References
- J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester and F. S. L. Brinkman (2005). PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 21, 617-623.
- Nakai K, Kanehisa M. (1991). Expert system for predicting protein localization sites in gram-negative bacteria. Proteins., 11, 95-110.
- Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak.(2004). Improved prediction of signal peptides: SignalP 3.0.J. Mol. Biol., 340, 783-795.
- Chin-Sheng Yu, Yu-Ching Chen, Chih-Hao Lu, Jenn-Kang Hwang. (2006). Prediction of protein subcellular localization.Proteins., 15, 643-651.
- Hong-Bin Shen and Kuo-Chen Chou. (2007). Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein. Eng. Des. Sel., 20, 39-46.
- Kuo-Chen Chou and Hong-Bin Shen. (2006). Large-Scale Predictions of Gram-Negative Bacterial Protein Subcellular Locations. J. Proteome Res., 5, 3420-3428.
- Bhasin M., Garg A., Raghava GP. (2005). PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics, 21, 2522-2524.
- Z. Lu, D. Szafron, R. Greiner, P. Lu , D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner. (2004). Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics, 20, 547-556.
- Nair R and Rost B. (2005). Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol., 348, 85-100.
- Hua S, Sun Z. (2001). Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.