## Modified Edit Distance on Global SARS-CoV-2 Analysis

#### Abstract:

In bioinformatics, sequence comparison is used to determine how closely related two sequences of nucleotides or amino acids are by finding the best alignment and taking into consideration insertions, deletions, and substitutions that can happen between two sequences, which is remarkably similar to the Edit Distance algorithm. To improve the accuracy of the algorithm applied to bioinformatics sequence comparison, many researchers have used different weight matrices to change the cost of various edits. A shortcoming of this approach is that it does not reflect random errors that may occur during genetic duplication. Thus, this study proposes and evaluates a new cost function that takes into account the number of consecutive differences. The data used to perform analyses are SARS-CoV-2 sequences from NCBI from several places around the world to determine their relationships. Accuracy is measured by comparing the results from the algorithm with analyses from BLASTP on NCBI and phylogeny trees drawn by MEGAX. The results show that by implementing this cost function, distances between related sequences are more consistent with analyses given by established tools. This suggests that the cost function is an effective alternative approach to improve the accuracy of Edit Distance as applied to bioinformatics sequence comparison.#### Bibliography/Citations:

Literature Cited

- Kleinberg, Jon, and Eva Tardos. Algorithm Design. Pearson Education, 2014.
- Berger, Bonnie & Waterman, Michael & Yu, Yun. “Levenshtein Distance, Sequence Comparison and Biological Database Search.” IEEE Transactions on Information Theory (2020): PP. 1-1. doi:10.1109/TIT.2020.2996543.
- Pearson, William R. “Selecting the Right Similarity-Scoring Matrix.” Current protocols in bioinformatics vol. 43 (2013): 3.5.1-3.5.9. doi:10.1002/0471250953.bi0305s43

#### Additional Project Information

## Project files

#### Research Plan:

Research Plan:

- Implement the Edit Distance algorithm in Python
- Obtain SARS-CoV-2 sequences from around the world, as well as a SARS sequence and a bat coronavirus from NCBI databases
- Compare a SARS-CoV-2 sequence, SARS, and bat coronavirus using BLASTP on NCBI to obtain the accurate percent identities to match with
- Run the same three sequences on the Edit Distance to see the initial percent identities it gives
- Edit the algorithm to add a cost function, including all the parameters that can be experimented with
- Experiment with different functions and parameters and run on the sequences, each time tweaking to find what gives the closest to BLASTP
- Once settled on a function and its parameters, then run all the SARS-CoV-2 sequences on the modified algorithm
- Run all the SARS-CoV-2 sequences on MEGA-X to produce phylogeny trees
- Compare the results from the modified algorithm and MEGA-X to see if they match (which would indicate that the algorithm is applicable not just to a particular set of sequences)

#### Questions and Answers

1. What was the major objective of your project and what was your plan to achieve it?

a. Was that goal the result of any specific situation, experience, or problem you encountered?

b. Were you trying to solve a problem, answer a question, or test a hypothesis?

The objective of my project was to see if the edit distance algorithm could be applied to bioinformatics sequence comparison by modifying it through adding a function that took into account the number of consecutive edits between two sequences to adjust the distance between them. This function was termed the cost function. I came up with this idea after attending several summer programs that introduced me to bioinformatics tools for sequence comparison and determining relationships between sequences, as well the Edit Distance algorithm in a linguistics context. After these programs, I saw how the algorithm was very similar to comparing sequences in bioinformatics, so I decided to research it and possibly find a way to make the algorithm work for bioinformatics by making its results match other bioinformatics tools.

2. What were the major tasks you had to perform in order to complete your project?

a. For teams, describe what each member worked on.

I worked on this entirely independently. I first had to implement the Edit Distance algorithm in Python and obtain many sequences from the NCBI databases. I then had to edit my code to add backtracking in order to count the number of consecutive edits in the best alignment of the sequences, which would then have to go into a function to adjust the distance between the sequences. I had to try many different functions and tweak the parameters of the functions, running the algorithm to see the changes in the results, and then analyze how I need to change the function in order to improve the results. This was done by graphing the functions and testing out different functions to see how the shape changed in places that I needed it to. That process took the longest time; the rest was just running the sequences on different tools or running different sequences on the same tools and analyzing if the results matched.

3. What is new or novel about your project?

a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?

b. Is your project's objective, or the way you implemented it, different from anything you have seen?

c. If you believe your work to be unique in some way, what research have you done to confirm that it is?

Edit distance is similar to sequence comparison in bioinformatics, as both are finding the best alignment and calculating distances between strings. The algorithms behind tools like BLAST that are commonly used today to search for similar sequences are based off these algorithms that align sequences (Berger, Waterman & Yu, 2020). Thus, research has been done to apply Edit Distance algorithm to bioinformatics by making certain modifications that attempt to encompass the complex factors in biology that go into determining how two sequences are related. A very common method is using substitution matrices, or scoring matrices, which in the simplest nature, change the weight of substituting between different amino acids according to how likely they are to be aligned with each other. Research has been done to find suitable matrices to find similarities across evolutionary distances (Pearson, 2013).

In DNA replication, random errors may happen during transcription and translation, but they may or may not affect the actual function of the protein. Thus, finding a single character difference between two sequences may not be significant, but longer chains of differences most likely indicates a significant deviation. A substitution matrix does not take into account this factor, so the idea behind this research was to modify the Edit Distance algorithm from this angle instead.

4. What was the most challenging part of completing your project?

a. What problems did you encounter, and how did you overcome them?

b. What did you learn from overcoming these problems?

The most challenging part of the project was trying to find the cost function, because it took a lot of patience to find a suitable function that made the results close enough to those given by bioinformatics tools. I spent a lot of time on Desmos graphing functions and changing constants to make a shape that suited the changes I hoped to see. Eventually, I settled into a shape for the function and only really had to change the constants to fine-tune the numbers to arrive at the final function.

5. If you were going to do this project again, are there any things you would you do differently the next time?

A couple different steps could be taken. Firstly, I might spend more time researching the cost function to give more accurate results, either through changing the actual function or tweaking the numbers further. As I was testing the function, I realized that it would make sense to have a function that would lower the cost of a single edit on its own, raise the cost for a certain number of consecutive edits, but after a certain point the cost wouldn’t change much. This might help improve the accuracy of the percent identity between the China and SARS sequences. I would also use more sets of sequences not related to the COVID/SARS family to further test the applicability.

6. Did working on this project give you any ideas for other projects?

Not as an entirely new project, but to completely revamp this project, it might be worthwhile to try to combine substitution matrices with this cost function to provide a more nuanced model, as well as research other factors to increase the accuracy of the algorithm. I might also try a different type of edit distance, because the Levenshtein distance only considers insertion, deletion, and substitution, but I later learned about the Damerau–Levenshtein distance also adds transposition as another type of edit, which might make the algorithm more accurate. All of these changes would completely change the code and are considerably more complex than the extent to which this project goes.

7. How did COVID-19 affect the completion of your project?

COVID-19 did not affect the completion of my project because the idea was created during the pandemic, so everything was able to be done from my own computer at home. All the resources I used were accessible either through the internet or by downloading an application. The pandemic actually allowed me to think of this idea, as I was able to attend two online programs over the summer that introduced me to the bioinformatics tools, the COVID sequence database on NCBI, and the Edit Distance algorithm. Through these programs I saw how the algorithm was very similar to bioinformatics sequence comparison and wanted to see if by modifying the algorithm, it could give similar results to real bioinformatics tools. In addition, I wanted to see how COVID sequences around the world were related, so those two ideas combined created the basis for my project.