Abstract | U ovom diplomskom radu promatrana su poravnanja pet proteinskih familija od kojih je svaka podijeljena u dvije podfamilije. Za svaku familiju, cilj je bio pronaći one pozicije poravnanja koje su najvažnije u klasifikaciji proteina te familije u njene podfamilije, specijalizirane za različite funkcije. Proučavane proteinske familije uključuju: acil transferaze (tj. AT-domene), familiju malatnih i laktatnih dehidrogenaza (MDH/LDH), ciklaze, kinaze, te ketoreduktaze (tj. KR-domene). Statistička analiza provedena nad nizovima iz poravnanja moguća je jer se svakoj aminokiselini (i praznini) u poravnanju pridružio petdimenzionalni numerički vektor. Definirana je razdvajajuća (split) S-statistika koja sumira omjere intergrupne i intragrupne varijabilnosti po svakoj koordinati aminokiselinskog vektora. Podacima se dodao šum dobiven iz poznate prosječne distribucije svih aminokiselina. Po vrijednostima S-statistike rangirane su pozicije za svako od 5 poravnanja, dok je distribucija S-statistike procijenjena nekom F-distribucijom. U većini slučajeva F-distribuiranost S-statistike nismo mogli odbaciti KS testom, pa su izdvojene statistički značajne pozicije za svaku familiju, na razinama značajnosti od 1, 5 ili 10 %. Prikazani su i t-SNE grafovi koji vizualiziraju originalne proteine iz poravnanja, koristeći samo 10 najznačajnijih pozicija tog poravnanja. Iz tih ilustrativnih grafova moglo se uočiti da, za svaku familiju, pripadne podfamilije tvore međusobno odvojene klastere, uz jako malo ili nimalo pogrešnih klasifikacija proteina. Konačno, usporedilo se rangiranje pozicija s rangiranjima u nekim sličnim prošlim istraživanjima. Dobivene značajne pozicije u ovom radu potencijalno daju vrijednu informaciju za buduća eksperimentalna biološka istraživanja, posebno u vidu mogućih mutacija enzima baš na tim pozicijama s ciljem postizanja drugačije, preferabilnije funkcije enzima. |
Abstract (english) | In this thesis, alignments of five protein families were studied, where each family is split into two subfamilies. The goal was to find, for each protein family, the most important alignment positions in terms of separation of certain family into its subfamilies, specialized for different functions. Protein families that were studied include: acyl tranferases (AT-domains), a family of malate and lactate dehydrogenases (MDH/LDH), cyclases, kinases, and ketoreductases (KR-domains). Statistical analysis implemented on sequences of the alignment is possible because each aminoacid (and gap) in the alignment was given a five-dimensional numeric vector. Split statistic (S-statistic) was defined, which sums up ratios of between group variability and within group variability per each coordinate of aminoacid's vector. The noise produced from known random distribution of all aminoacids was added to the data. According to the values of S-statistic, the positions were ranked, for each of the 5 alignments, while the distribution of S-statistic was estimated by some F-distribution. In the majority of cases, the F-distribution of S-statistic could not be rejected with the KS test, so statistically significant positions for each family were selected, at significance levels of 1, 5 or 10 %. Also shown are t-SNE graphs that visualize the original proteins from each alignment, solely using their aminoacid residues on the ten most important positions of that alignment. From those illustrative graphs it can be observed that for each family, corresponding subfamilies make up mutually separated clusters, with very few or zero protein misclassifications. Finally, the ranking of positions was compared with rankings in similar past research. The significant positions found in this thesis potentially provide valuable information for future experimental biological research, especially in the form of possible enzyme mutations at those exact positions, with the aim of achieving a different, more preferable enzyme function. |