Amyloid_PIT: An Amyloid Segment Identifier Prog

One of the general aims of bioinformatical researches is to identify such entries of large protein or other biological databases with high precision using tools and methods of mathematics and statistics which have some predetermined properties so they can be relevant to biologists or doctors. A request of this kind from biologists and chemists has been received by Protein Information Technology Bioinformatics Group (Eötvös Loránd University), aimed to find the amyloid-like items in Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB).

Graphically described, amyloids are „wrongly lapped” proteins having a property that makes them similiar to prions: if an amyloid touches a free chain of another non-amyloid protein then the latter protein is also going to lap in a wrong way. The examination of amyloids is considered very important from a medical perspective because it is confirmed their presence is highly correlated with a number of neurodegenerative diseases (e.g. Alzheimer’s and Huntington’s disease).

kep1Figure 1. PDB entry 2MXU, a 42-Residue Beta Amyloid Fibril

Finding and identifying amyloids have become essential areas of bioinformatics in the latest decades because of the reasons mentioned above. In order to solve this problem, many amyloid finder programs have been developed. Among these programs, considering the ones working with the data of PDB, most of them have two general disadvantages:

1) They can find only complete amyloid structures, implying that even entries with long amyloid-like segments are not included in their output, regardless of the fact that perhaps only a short part of protein chains do not resemble to amyloids;

2) PDB annotation, keywords, precursors, etc. (i.e. non-quantitative data of the database) are used during the application of their searching algorithms which are not accurate in many cases.

In our research, it was aimed to construct a program which do not have the previous disadvantages: it is to able to find every PDB entries with long amyloid-like segments (even if not the whole structure of the PDB file has a clear amyloid structure) using only the geometric description of the chain atoms and avoiding the text data as much as possible.

An article regarding to build amyloid databases underlies the basis of our research (Stanković, I. et al. (2017): Construction of Amyloid PDB Files Database. Transactions on Internet Research. 13 (1): 47-51.). In this article, the following method was used to identify amyloid structures:

  1. Searching for „amyloid” and 38 other amyloid precursors in the PDB;
  2. Excluding helical structures (using a TCL script);
  3. Excluding non-parallel structures (using the torsion angles of Ramachadran plot).

The criterion for parallelism between the chains of a PDB entry was the following: the distance between two Cα atoms belonging to two parallel fragments must differ maximally 1.5 Å, as found in amyloid-β structures resolved in previous works. In the output of this method, 109 PDB structures had been identified as amyloids.


Figure 2. The criterion for parallelism between chains (source: Stanković, I. et al. (2017)).

While improving our program to its current version, there were several work versions during the process:

  • v1.0: In this version, basically the method described in the referenced article had been worked out but with one difference: the used parameter in the process of determining parallel fragments had been changed from maximal difference between chains to deviation of the differences.

  • v1.5: The method of Stanković’s can be applied only to PDB files with at least two chains. In this step, there was an attempt to extend this algorithm for entries with a single chain using a natural generalization. After discovering that running times of the program have drastically increased because of this modification, it was decided that for the present, our research is going to focus to only PDB files with more than one chain.

  • v2.0: This version was the first one being able to recognize whole „amyloid” structures (i.e. PDB entries where every chain has a part with a significant length which is parallel to a long fragment of another chain) using our new method to locate parallel segments where not only the distance of the respective Ca atom pairs were calculated but the general geometric structures of the chains is also considered.

Version v2.0 has produced an output consisting of 782 PDB files, including some obvious false positive outputs (see Figure 3), so it was required to apply some more modifications.

Figure 3. The geometrical structure of PDB entries 1COS and 5EHB  which are clearly false positive outputs of v2.0 as the program has identified them as „amyloid-like”.

In the current version of our program (v3.0) two main changes were made, compared to v2.0:

  1. Implementing the recognition of not only pure amyloid structures but also amyloid-like segments long enough (where calculating the angles between the consequtive Ca atoms is an important step);
  2. Considering the secondary structure of the entries (based on the „SEQRES” lines of the PDB files): only the parallelism of b-sheets is examined (e.g., collagens could be excluded by this condition).

The results of v3.0 consist of 534 output entries that can be downloaded and seen on the following webpage:, which is automatically refreshed with the new PDB files monthly.