Blood-brain barrier permeability peptide prediction



The workflow of the DeepB3P

The workflow of the DeepB3P comprising five steps: data acquisition and processing (A); amino acid embedding (B); data augmentation using FBGAN (C); and the neural network architecture of DeepB3P (D, E).

  • (A). The postive dataset was obtained from the study B3Pred, BBPred, BBPpredict, B3Pdb, Brainpeps and Uniport. The negtivate dataset was obtained from the Uniport follows the selection criteria with the BBPpredict. Finaly, a total of 329 BBBP and 6,851 Non-BBBP sequences for training the model, and 99 BBBP and Non-BBBP sequences for testing the model.
  • (B). Two feature encoding schemes, namely one-hot and ordinal encoding, were employed for FBGAN and DeepB3P, respectively.
  • (C). FBGAN comprising a generator, a discriminator, and a feedback analyzer.
  • (D). The architecture of the encoder and decoder in the transformer.
  • (E). The final classifier employs a multilayer perceptron consisting of three fully connected layers, with each layer connected using batch normalization and ReLU activation functions.
  • Comparison of DeepB3P with existing methods

    We compared DeepB3P with three baseline methods, namely, B3Pred, BBPpred, and BBPpredict.

  • (A). DeepB3P demonstrated improvements of 9.09%, 4.55% and 9.41% for SP, ACC and MCC, respectively.
  • (B). Furthermore, the DeepB3P surpasses all other methods in terms of AUC.
  • (C). Not all methods are suitable for peptides with lengths ranging from 5 to 50 amino acids. For example, B3Pred was designed to support peptides with lengths ranging from 5 to 30 amino acids only. In our testing dataset, sequences ranging from 5 to 30 amino acids in length accounted for 75.25% of the entire dataset.
  • DeepB3P over other methods for peptides of all lengths, except for BBPpredict within peptides with the length range of 11 to 20 amino acids.
  • Improved the performance with data augmentation

    The number of the FBGAN training process consisted of 1,000 epochs. For every 100 epoches, both the 6,522 sequences (Pse-BBBPs) generated by the generator and BBBPs were utilized as positive dataset to train our proposed model, and the model is evaluated using the testing dataset.

  • (A). The ACC of the DeepB3P which trained using the Pse-BBBPs improved along with increased training steps, and it was stable around 800 epoches of training.
  • (B). For the 1,000 epoch, the prediction of BBPpredict indicated that more than 73% of the Pse-BBBPs were classified as BBBPs.
  • (C). Through a comparison of the amino acid frequency distributions among BBBPs, Non-BBBPs and Pse-BBBPs generated in the 1,000th epoch, it becomes evident that the amino acid frequency distribution of Pse-BBBPs exhibits a greater similarity to that of BBBPs.
  • (D). The heatmap showing the performance of different data augmentation methods as well as raw data construction models.
  • Interpretability analysis

    The t-SNE visualization of the features. (A) the t-SNE visualization of the initial features. (B)-(D) are the t-SNE visualization of the features acquired from the transformer decoder, encoder and the penultimate layer of DeepB3P, respectively. it is evident that the initial features were randomly distributed (A), whereas a more pronounced clustering effect becomes apparent after the encoder layer (B). The decoder layer additionally retained the crucial classification features, resulting in distinct clusters for BBBPs and Non-BBBPs (C). Ultimately, accurate identification was accomplished through the classification layer and softmax function (D).