Cellular location analysis in oral pathogen

Chi Yang and Chuan-Hsiung Chang

Introduction

The function of a protein is closely correlated with its subcellular location. A lot about a protein can be learnt from its cellular location because each cellular compartment has its specific roles, proteins in each compartment are there to fulfill those specific roles and, therefore, possible functions can be inferred for a protein by knowing which cellular compartment it is localized to. So subcellular location prediction for a protein will expedite the functional determination of a protein, particularly when the function of a protein remains unknown through a Blast homology search.

Cellular components of microorganisms are also important immunogenic determinants. However, only the components that are exposed to host immune system surveillance are susceptible to body's immune system. Therefore, knowledge of subcellular location of proteins in pathogens is important in designing potential drugs or subunit vaccines. There are several methods available to predict subcellular location of proteins in Gram-positive and Gram-negative bacteria. Examples of these methods include support vector machine (SVM), K-nearest neighbor (KNN), and other machine-learning approaches. Recent tools developed by various groups can determine the nature of proteins as being cytoplasmic, periplasmic, inner membrane, outer membrane (Gram-negative bacteria), extracellular. In this analysis, six tools (Table 1)were used to predict cellular locations of proteins from 13 oral bacteria and 2 plasmids. One of the methods, SignalP, only predicts presence of signal peptides but not cellular locations. Results from this analysis were integrated into the final prediction report. All the raw data which are the probability or reliability index generated by different tools were stored in the database. In addition, the most probable location(s) of each protein were summarized and presented in each gene record. The complete results for each genome are also available for bulk download.

Table 1. Comparison of the six protein cellular location prediction tools used in this analysis

Tools used in this analysisProgram nameDownloadable to be standaloneMethods to predict locationUsed as batch modePrediction speedLiteraturesNotes
YespsortbYesMultiple classifierYesModerateGardy et. al, 2004(1)
YesPSORTYesExpert systemYesFastNakai and Kanehisa, 1991(2)
YesSignalPYesPredicts traditional N-terminal signal peptides Up to 2000 seqsFastBendtsen et al, 2004(3)
NoGpos/Gneg-PLocNoK-nearest neighbor-based classifierNoFastShen and Chou, 2007(5); Chou and Shen, 2006(6)
YesCELLONoTwo-level SVMYesFastYu et al, 2006(4)
NoPSLpredNoSVM and PSI-BLASTYesFastBhasin et al, 2005(7)Only for gram negative bac.
YesProteome Analyst’s Subcellular Localization Server (SubCell)NoMachine learningYesModerateLu et al, 2004(8)
NoLOCtreeNoab initio Prediction of subcellular localization using hierarchical SVM'sUp to 100 seqs.SlowNair and Rost, 2005(9)non-membrane proteins only
YesSubLoc 1.0NoSVMYesFastHua and Sun, 2001(10)Non-membrane proteins only

Keys steps of the integrated cellular location prediction pipeline (Fig. 1)

  1. After running all the web-based prediction tools, the results in HTML format are saved.
  2. The results generated from those standalone tools are directly saved in TEXT format.
  3. A perl script was used to parse all the prediction results into tab-delimitated format.
  4. The tab-delimitated results are uploaded into the Oralgen database.
  5. For each gene record, the HTML-formated summary table (Fig. 2) displayed.

Figure 1. The integration of the prediction results from different tools.

Data presentation

  1. Figure 2 shows the summary results in each gene record.
  2. In this example, the protein was predicted as an extracellular protein. In the summmary the results with the highest reliability scores, which have to be larger than the cut-off (PSORT is 0.5 and psortb is 2.5), are picked for each tools. If more than one tool predicts the same location, then this location will be placed first.
  3. Because different tools use different scales for reliability or probability, each cell in the table will have its color-scale. The more reliable results, the darker the red color. The gray color means that that tool does not predict for that cellular location.
  4. The SubLoc tool does not differentiate the Gram-positive and Gram-negative bacteria in predicting membrane protein. So if a protein is predicted either as a membrane protein or as inner or outer membrane protein by other tools, the results from SubLoc will be ignored.

Figure 2. The summary table of the prediction results from different tools in each gene record of the Oralgen database.

Results for download

  1. Summary page - This page provides a complete list of prediction for all proteins in each genome. For each location, the ratio represents the number of the tools that predicted over the total number of tools used for that particular cellular locationprediction. The ratio n each cell is colored from white to red with darker red indicating the most likely cellular prediction predicted for that protein.
  2. Detailed page - This page display the raw data for all proteins in a genome. Each row shows predicted values from each different tools. The red color indicates the reliability.
  3. In the tab-delimitated file (database dump) for bulk download, the data are in the following order:
    1. Gene_id
    2. Description of proteins
    3. Tool name
    4. The reliability of each protein predicted as extracellular
    5. The reliability of each protein predicted as outer membrane (gram-negative), cell wall (gram-positive)
    6. The reliability of each protein predicted as periplasmic space
    7. The reliability of each protein predicted as inner membrane (gram-negative), membrane (gram-positive)
    8. The reliability of each protein predicted as cytoplasmic
  4. In the tab-delimitated file (summary results), the data are in the following order:
    1. Gene_id
    2. Summary results
OrganismGram stainingSummary pageDetailed pageTab-delimitated file (database dump)Tab-delimitated file (summary data)
Actinobacillus actinomycetemcomitans HK1651-Results_Act_summary.htmlResults_Act.htmlResults_Act.txtResults_Act_short.txt
Actinomyces naeslundii MG1-Results_Ana_summary.htmlResults_Ana.htmlResults_Ana.txtResults_Ana_short.txt
Fusobacterium nucleatum ATCC 25586-Results_Fnu_summary.htmlResults_Fnu.htmlResults_Fnu.txtResults_Fnu_short.txt
Porphyromonas gingivalis-Results_Pgi_summary.htmlResults_Pgi.htmlResults_Pgi.txtResults_Pgi_short.txt
Prevotella intermedia 17-Results_Pin_summary.htmlResults_Pin.htmlResults_Pin.txtResults_Pin_short.txt
Streptococcus agalactiae 2603V/R+Results_Sag_summary.htmlResults_Sag.htmlResults_Sag.txtResults_Sag_short.txt
Streptococcus mitis NCTC 12261+Results_Smi_summary.htmlResults_Smi.htmlResults_Smi.txtResults_Smi_short.txt
Streptococcus mutans UA159, serotype C+Results_Smu_summary.htmlResults_Smu.htmlResults_Smu.txtResults_Smu_short.txt
Streptococcus pneumoniae TIGR4+Results_Spn_summary.htmlResults_Spn.htmlResults_Spn.txtResults_Spn_short.txt
Streptococcus sanguinis SK36+Results_Ssa_summary.htmlResults_Ssa.htmlResults_Ssa.txtResults_Ssa_short.txt
Streptococcus thermophilus CNRZ1066+Results_Sth_summary.htmlResults_Sth.htmlResults_Sth.txtResults_Act_short.txt
Treponema denticola ATCC 35405-Results_Tde_summary.htmlResults_Tde.htmlResults_Tde.txtResults_Tde_short.txt
Tannerella forsythensis ATCC 43037-Results_Tfo_summary.htmlResults_Tfo.htmlResults_Tfo.txtResults_Tfo_short.txt
Plasmid in gram positive bac.+Results_ppPos_summary.htmlResults_ppPos.htmlResults_ppPos.txtResults_ppPos_short.txt
Plasmid in gram negative bac.-Results_ppNeg_summary.htmlResults_ppNeg.htmlResults_ppNeg.txtResults_ppNeg_short.txt

References

  1. J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester and F. S. L. Brinkman (2005). PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 21, 617-623.
  2. Nakai K, Kanehisa M. (1991). Expert system for predicting protein localization sites in gram-negative bacteria. Proteins., 11, 95-110.
  3. Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von Heijne and Søren Brunak.(2004). Improved prediction of signal peptides: SignalP 3.0.J. Mol. Biol., 340, 783-795.
  4. Chin-Sheng Yu, Yu-Ching Chen, Chih-Hao Lu, Jenn-Kang Hwang. (2006). Prediction of protein subcellular localization.Proteins., 15, 643-651.
  5. Hong-Bin Shen and Kuo-Chen Chou. (2007). Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein. Eng. Des. Sel., 20, 39-46.
  6. Kuo-Chen Chou and Hong-Bin Shen. (2006). Large-Scale Predictions of Gram-Negative Bacterial Protein Subcellular Locations. J. Proteome Res., 5, 3420-3428.
  7. Bhasin M., Garg A., Raghava GP. (2005). PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics, 21, 2522-2524.
  8. Z. Lu, D. Szafron, R. Greiner, P. Lu , D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner. (2004). Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics, 20, 547-556.
  9. Nair R and Rost B. (2005). Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol., 348, 85-100.
  10. Hua S, Sun Z. (2001). Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.