deepb3p

Blood-brain barrier permeability peptide prediction

The source code and data used in this study can be download from Here

The workflow of the DeepB³P

The workflow of the DeepB³P comprising five steps: data acquisition and processing (A); amino acid embedding (B); data augmentation using FBGAN (C); and the neural network architecture of DeepB³P (D, E).

(A). The postive dataset was obtained from the study B3Pred, BBPred, BBPpredict, B3Pdb, Brainpeps and Uniport. The negtivate dataset was obtained from the Uniport follows the selection criteria with the BBPpredict. Finaly, a total of 329 BBBP and 6,851 Non-BBBP sequences for training the model, and 99 BBBP and Non-BBBP sequences for testing the model.

(B). Two feature encoding schemes, namely one-hot and ordinal encoding, were employed for FBGAN and DeepB³P, respectively.

(C). FBGAN comprising a generator, a discriminator, and a feedback analyzer.

(D). The architecture of the encoder and decoder in the transformer.

(E). The final classifier employs a multilayer perceptron consisting of three fully connected layers, with each layer connected using batch normalization and ReLU activation functions.

Comparison of DeepB³P with existing methods

We compared DeepB³P with three baseline methods, namely, B3Pred, BBPpred, and BBPpredict.

(A). DeepB³P demonstrated improvements of 9.09%, 4.55% and 9.41% for SP, ACC and MCC, respectively.

(B). Furthermore, the DeepB³P surpasses all other methods in terms of AUC.

(C). Not all methods are suitable for peptides with lengths ranging from 5 to 50 amino acids. For example, B3Pred was designed to support peptides with lengths ranging from 5 to 30 amino acids only. In our testing dataset, sequences ranging from 5 to 30 amino acids in length accounted for 75.25% of the entire dataset.

DeepB³P over other methods for peptides of all lengths, except for BBPpredict within peptides with the length range of 11 to 20 amino acids.

Improved the performance with data augmentation

The number of the FBGAN training process consisted of 1,000 epochs. For every 100 epoches, both the 6,522 sequences (Pse-BBBPs) generated by the generator and BBBPs were utilized as positive dataset to train our proposed model, and the model is evaluated using the testing dataset.

(A). The ACC of the DeepB³P which trained using the Pse-BBBPs improved along with increased training steps, and it was stable around 800 epoches of training.

(B). For the 1,000 epoch, the prediction of BBPpredict indicated that more than 73% of the Pse-BBBPs were classified as BBBPs.

(C). Through a comparison of the amino acid frequency distributions among BBBPs, Non-BBBPs and Pse-BBBPs generated in the 1,000th epoch, it becomes evident that the amino acid frequency distribution of Pse-BBBPs exhibits a greater similarity to that of BBBPs.

(D). The heatmap showing the performance of different data augmentation methods as well as raw data construction models.

Interpretability analysis

The t-SNE visualization of the features. (A) the t-SNE visualization of the initial features. (B)-(D) are the t-SNE visualization of the features acquired from the transformer decoder, encoder and the penultimate layer of DeepB³P, respectively. it is evident that the initial features were randomly distributed (A), whereas a more pronounced clustering effect becomes apparent after the encoder layer (B). The decoder layer additionally retained the crucial classification features, resulting in distinct clusters for BBBPs and Non-BBBPs (C). Ultimately, accurate identification was accomplished through the classification layer and softmax function (D).