Processing protein multiple sequence alignments

Processing protein multiple sequence alignments#


Download and submission instruction

Download:

  1. Click the download button on the top right of this page and select .ipynb to download the Jupyter notebook.

  2. Move the downloaded .ipynb file to a uv project, where the python package notebook has been added as a dependency.

  3. Edit and run the notebook using the .venv of the uv project. Install packages as needed by running uv add mypackage

Submission:

  1. Name the .ipynb file as main.ipynb.

  2. Make a new folder named assignment-1-protein-msa under the OneDrive folder that has been shared with you

  3. Upload main.ipynb and pyproject.toml files to the OneDrive folder assignment-1-protein-msa. Please not upload other files in the uv project.

Due date: January 27, 2025, 11:59 PM


Proteins are clustered into families based on sequence similarity. A protein family is a group of proteins that share a common evolutionary origin, reflected by their related functions and similarities in sequence or structure. Sequences of proteins in a family are aligned to identify the conserved regions and the variations in the family. Such an alignment is called a multiple sequence alignment (MSA).

In this assignment, you will write Python code to process the MSA of a protein family. The MSA is stored in a text file in the Stockholm format. The Stockholm formatted file looks like the following:

# STOCKHOLM 1.0
#=GF ID   EXAMPLE
<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>
//

The first line shows the version of the Stockholm format. Each line that starts with # is a comment and can be ignored. It is followed by the aligned sequences of the proteins in the family, one sequence per line. Each line contains the sequence name (including start and end positions) and the aligned sequence separated by spaces. The alignment is ended by //.

First, let us download a sample MSA file that we will use for this assignment. The following code downloads the MSA of the protein family PF00041 from the Pfam database and saves it to the file PF00041_seed.txt in the folder data. Within the protein sequence, letters represent the amino acids (e.g., A for Alanine, C for Cysteine, etc.), and - and . are gaps.

[1]:
## import required packages
import urllib3
import gzip
import math
import matplotlib.pyplot as plt
import os

## download the seed alignment for the PF00041 protein family
pfam_id = "PF00041"
http = urllib3.PoolManager()
r = http.request(
    "GET",
    f"https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/{pfam_id}/?annotation=alignment:seed&download",
)
data = gzip.decompress(r.data)
data = data.decode()
os.makedirs("data", exist_ok=True)
with open(f"./data/{pfam_id}_seed.txt".format(pfam_id), "w") as file_handle:
    print(data, file=file_handle)

You can open the file PF00041_seed.txt in a text editor to see the content of the MSA file. In the following, you will write Python code to read this file and process the MSA .

1. Reading the MSA file#

  1. Read the MSA file PF00041_seed.txt and store the sequences in a dictionary. The key of each item in the dictionary is the sequence name, and the value is the aligned sequence as a string. The sequence name should include the start and end positions of the sequence if provided. If the start and end positions are not provided, you can use the sequence name as it is. Keep the gaps in the aligned sequences.

  2. Write a function to get the names of sequences that are longer than 100 amino acids, excluding gaps. Use the dictionary created in the previous step as input to this function. The function should return a list of sequence names

[2]:
## msa is the dictionary that will store the MSA
msa = {}
with open(f"./data/{pfam_id}_seed.txt", "r") as file_handle:
    for line in file_handle:
        ######################################################################
        ## write code to parse the MSA file and store it in the dictionary msa
        None
        ######################################################################
[3]:
def get_names_of_long_sequences(msa: dict) -> int:
    """
    This function gets names of sequences that are longer than
    100 amino acids
    """
    ####################################################################
    ## write your code here

    return None
    ####################################################################
[4]:
names = get_names_of_long_sequences(msa)
print(f"Names of sequences longer than 100 amino acids: {names}")
Names of sequences longer than 100 amino acids: None

2. Processing the MSA#

The sequence EPHA1_HUMAN/334-437 is one of the sequences obtained from the previous step. In this part, you will used this sequence as a reference sequence to process the MSA.

Write a function that takes the reference sequence name and the msa dictionary as input and processes the MSA as follows. First, for each sequence, remove the positions that have a gap in the reference sequence. In other words, if the reference sequence has a gap at a position, remove that position from all the sequences. Second, for each position, remove the position from all the sequences if more than 80% of the sequences have a gap at that position. The function should return a new dictionary with the processed sequences.

[5]:
def process_msa(reference: str, msa: dict) -> dict:
    ####################################################################
    ## write your code here

    return None
[6]:
msa_processed = process_msa("EPHA1_HUMAN/334-437", msa)

3 Calculating Shannon Entropy#

For each position in the processed MSA, we are interested in knowing how conserved it is, i.e., the degree of variation at that position. A commonly used measure for this is the Shannon entropy. The Shannon entropy is calculated as follows:

\[H = -\sum_{i=1}^{21} p_i \log_2(p_i)\]

where \(p_i\) is the frequency of the \(i\)-th amino acid at that position. \(i\) ranges from 1 to 21 because there are 20 amino acids and we consider the gap as an additional character. When \(p_i = 0\) for an amino acid, the term \(p_i \log(p_i)\) is taken as 0. The entropy value ranges from 0 (no variation) to \(\log_2(21) \approx 4.39\) (all amino acids are equally likely).

Write a function that takes the processed MSA dictionary as input and calculates the Shannon entropy for each position in the MSA. The function should return a list of entropy values, one for each position in the MSA. You could write additional helper functions to be used in this function if necessary.

[7]:
####################################################################
## you could define helper functions to be used in the next steps if needed



####################################################################


def compute_entropy_mas(msa: dict) -> dict:
    entropy = []
    ####################################################################
    ## write your code here

    return entropy

entropy = compute_entropy_mas(msa_processed)

[8]:
## plot the entropy values
fig = plt.figure()
plt.clf()
plt.plot(entropy, '.')
plt.ylim([0, 4.5])
plt.xlabel("Position")
plt.ylabel("Entropy")
plt.show()
../../_images/homework_1-python-basics_main_14_0.png