he catalytic residues for each enzyme were extracted from the CSA. Many studies
have used the SITE records defined in PDB files as the basis for defining functional
residues and sites. Unfortunately SITE records are not a homogeneous data set, and
there are no fixed rules on what may or may not be included in a SITE entry. Only
13 of the 159 PDB files in our data set contain SITE records, less than 10%. These
13 structures contain 50 catalytic residues, as defined above and 94 SITE residues.
The overlap between these two groups contains 36 residues. We find therefore, that
in our data set 28% of catalytic residues are not found in the SITE records and only
38% of SITE residues are catalytic.
The following parameters were derived for each residue (catalytic and non?catalytic) in all 159 proteins:
• Conservation: The sequence of each chain in the protein was used to initiate
a PSI-BLAST search of the NCBI Non-Redundant Data Base (NRDB) with an
E-value cut-off of 10–20 for inclusion in the next iteration. Each PSI-BLAST
search was run to convergence or a maximum of 20 iterations. The final
multiple alignment generated by PSI-BLAST was then scored for conservation
and Diversity Of Position Score (DOPS) as described by Valdar et al 86.
• Relative Solvent Accessibility (RSA): NACCESS 87 was used with stan?dard parameters to calculate the RSA of each residue.
• Secondary Structure: DSSP 88 was used to extract the secondary struc?ture for each residue. The DSSP classification was simplified to three cate?gories: helix, sheet or coil/other.
• Cleft: Surfnet 89 was used to define in which, if any, cleft the between 0 and 1 (0 for no conservation and 1 for perfect conservation) and so is
passed to the network as is. The RSA is a percentage and is scaled to between
0 and 1 before presentation to the network. Depth is scaled so that the deepest
residue in each structure is scored 1 and surface residues 0.
The other parameters: residue type, secondary structure and cleft are categorical
in nature, and are encoded using 1-of-C encoding. Amino acid type is encoded as an
array of 20 inputs where one input is set to 1 and the rest to 0. Secondary structure
is encoded by three input parameters. Cleft size is divided into four categories: no
cleft, largest cleft, 2nd or 3rd largest cleft and 4th to 9th largest cleft.
An example encoding is shown in Figure 2.1 for a serine residue with conservation
0.7, DOPS score 0.9, depth 0.3, RSA 15%, in a coil region and lying in th trained using a scaled conjugate gradients algorithm. A single-layer architecture is
used in all cases. In order to accurately measure the performance of the network
it is trained using a 10-fold cross validation experiment. The dataset is divided
into 10 equal subgroups, and then in each training run 9 of the groups are used for
training, whilst the network is tested on the single remaining group. The network
is run 10 times using a different subgroup as the test group each time. In this study
the dataset was divided by structure rather than residue, so each subgroup contains
the data for approximately 16 structures. The ratio of catalytic to non-catalytic
residues is approximately 1:60 in the training set. Presenting the data in this ratio
causes the net to predict every residue as non-catalytic. The best balanced training
set was found to have a ratio of 1:6. Each training group is balanced by discarding
a random selection of the non-catalytic residues prior to training. Training was
for 100 epochs, in every case the network converged to a stable error-level before
training was terminated. The number of training epochs was not optimised, and
in particular the performance of the test set was not used to optimise the stopping
point in any way.
2.2.5 Measuring Performance
In order to judge the neural network learning process, a suitable measure of perfor?mance is required. Total error (percentage of incorrect predictions) is not sufficient
because of the highly unbalanced nature of the dataset. All of the
- The without having to worry about page
- Buglab