6533b86efe1ef96bd12ccac6

RESEARCH PRODUCT

Feature selection on a dataset of protein families: from exploratory data analysis to statistical variable importance

Serena DotoloSerena DotoloAnna MarabottiAnna MarabottiEugenio Del PreteEugenio Del PreteAngelo Facchiano

subject

Quantitative Biology::Biomoleculesbusiness.industrySparse PCAPattern recognitionFeature selectionLinear discriminant analysisCross-validationRandom forestExploratory data analysisStatistical classificationArtificial intelligencebusinessCluster analysisMathematics

description

Proteins are characterized by several typologies of features (structural, geometrical, energy). Most of these features are expected to be similar within a protein family. We are interested to detect which features can identify proteins that belong to a family, as well as to define the boundaries among families. Some features are redundant: they could generate noise in identifying which variables are essential as a fingerprint and, consequently, if they are related or not to a function of a protein family. We defined an original approach to analyze protein features for defining their relationships and peculiarities within protein families. A multistep approach has been mainly performed in R environment: getting-cleaning data, exploratory data analysis and predictive modeling for classification. Ten protein families have been chosen by their CATH classification (different architectures), with rules over the number of structures, the length of the sequence and the choice of the chain. Properties investigated are secondary structures, hydrogen bonds, accessible surface areas, torsion angles, packing defects, number of charged residues, free energy of folding, volume and salt bridges. Kernel density estimation helps in discovering unusual multimodal profiles. Pearson correlation highlights statistical links between pairwise variables and Pearson distance provides a dendrogram with a clusterization of the features. PCA clusterizes the protein families and it detects outliers, sparse PCA performs a feature selection. Many classification algorithms have been used: decision trees (classical, boosting and bagging), SVMs (flexible discriminant analysis), centroid (nearest shrunken). The interest is on variable importance estimation. A 10-fold x 10 cross validation has been applied over the training set. Accuracy, K coefficient, sensitivity and specificity have been calculated for each methods. From the density plots, the percentage of mostly buried residues is significantly different for each family. Dissimilarity dendrogram shows separated clusters for secondary structures, torsion angles, defects and geometrical features. From the features network, torsion angles and surface variables result as peripheral (i.e. redundant) from the core of the graph. PCA biplot gives a good clustering for the protein families and sparse PCA confirm dendrogram results. Unifying all the results, these features are typical for our dataset: helix, strand, coil, turn, hydrogen bond, polar and charged accessible surface area, volume and residue buried for the most part. Random forest algorithm has the best performance values. Graphical multivariate procedures are good tools for the characterization of possible fingerprints about the protein families. Predictive models for classification and variable importance estimation help in performing feature selection. The work can be improved by the use of multivariate regression models and the increase of the protein families number.

10.7287/peerj.preprints.2157v1http://dx.doi.org/10.7287/peerj.preprints.2157v1