Abstract:
This thesis investigates the local characterization of protein families at both structural and sequential level. A formal framework is introduced to describe local relationships between the primary and tertiary structure of proteins.
Building on this formalism, we introduce contact fragments (CF) as portions of protein structure that conciliate spatial locality together with sequential neighborhood. We show that the predictability of CF from the sequence is better than that of contiguous fragments and of structurally distant pairs of fragments. In order to structurally compare CF, we introduce ASD, a novel alignment-free dissimilarity based on Fourier transform of the matrix of inter-atomic distances. This measure respects triangular inequality while being tolerant to sequence shifts and indels. We show that ASD can be used for standard fragments comparison and outperforms classical scores on practical experiments such that unsupervised classification and structural mining.
Ultimately, by integrating the identification of CF from the sequence into a statistical machine learning framework, we developed VIRALpro, a tool that enables the detection of sequences of viral structural proteins.