Predict protein secondary structure#

Author: YOUR_NAME

Due date: February 17, 2025, 11:59 PM

In this assignment you will predict the secondary structure of proteins based on their amino acid sequences. The secondary structure of a protein is the local spatial arrangement of its backbone atoms. It arises from the hydrogen bonds between the backbone atoms and is the building block of proteins’ 3D structure. The most common types of secondary structures are alpha helices, beta sheets, and coils. Given a protein’s 3D structure, the secondary structure can be determined from the coordinates of its atoms. Depending on the determination method, the secondary structure can be classified into more or fewer types.

Here you will use a dataset of proteins with known secondary structures that were assigned by the DSSP method based on their 3D structures. Although DSSP assigns eight types of secondary structures, the assignment simplifies them into three types: helix (including types G, H, and I), strand (including types B and E) and coil (including types T, S, ., and P). The training and test datasets are provided in the text files train.txt and test.txt, respectively. Each line starting with a greater-than symbol > contains the name of the protein. In the training dataset, the two lines following the name contain the amino acid sequence and the secondary structure, respectively. In the test dataset, only the amino acid sequence is provided. The amino acid sequence is represented by the one-letter code. The secondary structure is represented as a string of 0 (helix), 1 (strand), and 2 (other).

The task is to train a machine learning model using the training dataset and predict the secondary structure of the proteins using their sequences in the test dataset.

Model#

Different proteins have different numbers of amino acids. If the model were to predict the secondary structure of all amino acids in a protein at once, it would have to handle sequences of different lengths. Although this is possible, this assignment will use a simpler approach. The model will predict the secondary structure of an amino acid based on the amino acids before it and after it. This way, the model will have a fixed-size input and output. Specifically, the model will use as input a sequence of 15 amino acids and predict the secondary structure of the central amino acid. The model will slide this window along the sequence to predict the secondary structure of all amino acids in a protein. For the amino acids near the beginning and end of the sequence where the window extends beyond the sequence, the model will use a special padding token * to complete the window.

As an example, consider the amino acid sequence SIVAGYEVVGSSSASELLSAIEHVAEKA. To predict the secondary structure of the starting amino acid S, the model will use as input the window *******SIVAGYEV. For the second amino acid I, the model will use the window ******SIVAGYEVV. For the residue A at position 14, the model will use the window EVVGSSSASELLSAI. This process will continue until the model predicts the secondary structure of the last amino acid A using the window IEHVAEKA*******

Therefore the input to the model will be a sequence of 15 amino acids and the output will be the secondary structure of the central amino acid. The secondary structure of an amino acid is a categorical variable \(y\) with three classes: 0 (helix), 1 (strand), and 2 (other). Therefore, the model will perform a multi-class classification. The input is a sequence of strings, so they need to be converted to numerical values. The simplest way to do this is to use one-hot encoding. In one-hot encoding, each amino acid is represented by a vector of zeros with a one at the position corresponding to the amino acid. For example, if all amino acids (including the padding token *) are represented by the list ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', '*'], then the amino acid A is represented by the 21-dimensional vector [1, 0, ..., 0], the amino acid C is represented by [0, 1, 0, ..., 0], and so on. The padding token * is represented by [0, 0, ..., 1]. Because the input is a sequence of 15 amino acids, the input \(x\) to the model is a vector of length \(15 \times 21 = 315\).

The model will be a softmax regression model, as covered in the lectures.

[19]:
## Importing the necessary packages
## If there is any package missing in your uv project, remember to install it using `uv install <package-name>`

import numpy as np
import jax.numpy as jnp
import jax
from tqdm import tqdm
import optax
import chex
import optax.tree_utils as otu
from typing import NamedTuple

jax.config.update("jax_enable_x64", True)

Process the training data#

Given the design of the model, the data in train.txt are not directly suitable for training. The data need to be processed to create the input and output pairs for the model. For each sequence in the training data, you need to extract all windows of 15 amino acids and the corresponding secondary structure of the central amino acid. Then you need to convert the amino acids to one-hot encoding. Take the sequence SIVAGYEVVGSSSASELLSAIEHVAEKA and its secondary structure 2111111111222000000000220000 as an example. The first window is *******SIVAGYEV and the corresponding secondary structure of the central amino acid S is 2. The second window is ******SIVAGYEVV and the secondary structure of the central amino acid I is 1. This process will continue until the last window IEHVAEKA******* and the secondary structure of the central amino acid A is 0. Each window and the corresponding secondary structure will be an input-output pair for the model. You need to process all sequences in the training data to create the input-output pairs.

[ ]:
## You need to download the files `train.txt` and `test.txt` mentioned above
## and place them in the a directory called `data` under the your uv project directory

path_to_train = "./data/train.txt"


## read data from train.txt
train_seq = {}

## finish the code below such that train_seq is a dictionary with the following structure:
## train_seq[protein_name] = (sequence, secondary_structure),
## where protein_name is the name of the protein, sequence is the amino acid sequence of the protein, and secondary_structure is the secondary structure of the protein as given in the train.txt file


with open(path_to_train, "r") as f:
    ############################################
    #### write your code (5 points) #######
    ############################################
    None

[21]:
def get_windows(seq, secondary_structure):
    """Extract windows of size 15 from the sequence and secondary structure of a protein

    Args:
        seq: str, the amino acid sequence of the protein
        secondary_structure: str, the secondary structure of the protein

    Returns:
        list of tuples: each tuple contains (x, y), where x (15x21=315 dimensional) is the one-hot encoding of the amino acid sequence of length 15, and y is the secondary structure of the central amino acid in the window
    """

    ## amino acid order for one-hot encoding
    amino_acids = "ACDEFGHIKLMNPQRSTVWY*"

    ############################################
    #### write your code (15 points) ############
    ############################################

    return None

Code in the cell below uses the function get_windows to process every sequence in the training data and stack both x and y in numpy arrays: train_xs and train_y.

[ ]:
#############################################################################
#### The following code is provided to you and you don't need to modify it
#### However, you need to understand it to complete the assignment
#############################################################################

## get windows for all proteins in the training set
## stack all windows in train_xs and train_ys

train_xs = []
train_ys = []

for name, (seq, ss) in tqdm(train_seq.items()):
    windows = get_windows(seq, ss)
    for x, y in windows:
        train_xs.append(x)
        train_ys.append(y)

train_xs = np.array(train_xs)
train_ys = np.array(train_ys, dtype=np.int64)

train_xs = np.hstack((np.ones((train_xs.shape[0], 1)), train_xs))

assert train_xs.shape[0] == train_ys.shape[0]
assert train_xs.shape[1] == 315 + 1
assert np.all((train_xs == 0) | (train_xs == 1))

Code in the following cell builds a optimizer, fmin_lbfgs, using the optax library. You do not need to modify this code. The optimization step will use this optimizer.

[23]:
def _run_opt(init_params, fun, opt, max_iter, tol):
    value_and_grad_fun = optax.value_and_grad_from_state(fun)

    def step(carry):
        params, state = carry
        value, grad = value_and_grad_fun(params, state=state)
        updates, state = opt.update(
            grad, state, params, value=value, grad=grad, value_fn=fun
        )
        params = optax.apply_updates(params, updates)
        return params, state

    def continuing_criterion(carry):
        _, state = carry
        iter_num = otu.tree_get(state, "count")
        grad = otu.tree_get(state, "grad")
        err = otu.tree_l2_norm(grad)
        return (iter_num == 0) | ((iter_num < max_iter) & (err >= tol))

    init_carry = (init_params, opt.init(init_params))
    final_params, final_state = jax.lax.while_loop(
        continuing_criterion, step, init_carry
    )
    return final_params, final_state


class _InfoState(NamedTuple):
    iter_num: chex.Numeric


def _print_info():
    def init_fn(params):
        del params
        return _InfoState(iter_num=0)

    def update_fn(updates, state, params, *, value, grad, **extra_args):
        del params, extra_args

        jax.debug.print(
            "Iteration: {i:>5}, Value: {v:>6.3e}, Gradient norm: {e:>6.3e}",
            i=state.iter_num,
            v=value,
            e=otu.tree_l2_norm(grad),
        )
        return updates, _InfoState(iter_num=state.iter_num + 1)

    return optax.GradientTransformationExtraArgs(init_fn, update_fn)


def fmin_lbfgs(initial_params, fun, max_iter, tol):
    opt = optax.chain(_print_info(), optax.lbfgs())
    return _run_opt(initial_params, fun, opt, max_iter, tol)

Define the loss function for softmax regression#

In the following, you need to complete a few functions to compute the loss for the softmax regression model. The parameter of the loss function loss_fn is theta which has shape (2, 316), although we have three classes. The first row of theta corresponds to the weights for the class 0 and the second row corresponds to the weights for the class 1. Why do we not have a row for class 2? This is because the degenaracy of the softmax function, i.e., the softmax function is invariant to adding a constant to the scores. Therefore, we can fix one of the scores to zero. In this case, we fix the score of the last class to zero by setting the theta parameter for the last class to zero. This is why the shape of theta is (2, 316) and not (3, 316). Given a theta of shape (2, 316), we could always add a row of zeros to make it (3, 316).

[24]:
def expand_theta(theta):
    """Expand theta by appending a row of zeros at the end

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)

    Returns:
        jnp.ndarray: the expanded theta with shape (3, 316) where the last row is all zeros

    """

    ############################################
    #### write your code (5 points) ############

    return None


def log_likelihood_per_sample(theta, x, y):
    """Compute the log-likelihood of a single sample

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)
        x: jnp.ndarray, the input features of one sample
        y: int, the target label of the sample

    Returns:
        float: the log-likelihood of the sample

    """

    ############################################
    #### write your code (20 points) ############
    ############################################

    #### Hint: use the function jax.scipy.special.logsumexp

    return None


def log_likelihood_data(theta, xs, ys):
    """Compute the log-likelihood of the data by using log_likelihood_per_sample and jax.vmap

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)
        xs: jnp.ndarray, the input features of all samples with shape (n, 316)
        ys: jnp.ndarray, the target labels of all samples with shape (n,)

    Returns:
        jnp.ndarray: the log-likelihood of all samples with shape (n,)

    """

    ############################################
    #### write your code (5 points) ############
    ############################################

    return None


def loss_fn(theta):
    """the loss function is defined as the mean negative log-likelihood of the samples,

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)

    Returns:
        float: the mean negative log-likelihood of the samples

    """

    ############################################
    #### write your code (5 points) ############
    ############################################

    return

Train the model#

After the data are processed properly and the loss function is defined, it is time to train the model. Thanks to the automatic differentiation provided by JAX, you do not need to write the function to compute the gradient of the loss function. You only need to provide the loss function to the optimizer along with the initial parameters and stopping criteria.

[ ]:
## initialize theta with zeros
m = 3
theta_init = jnp.zeros((m - 1, train_xs.shape[1]))

## check the initial value of the loss function and the norm of the gradient at the initial point
print(
    f"Initial value: {loss_fn(theta_init):.2e} "
    f"Initial gradient norm: {otu.tree_l2_norm(jax.grad(loss_fn)(theta_init)):.2e}"
)

## optimize the loss function using L-BFGS
final_theta, _ = fmin_lbfgs(theta_init, loss_fn, max_iter=300, tol=1e-4)
print(
    f"Final value: {loss_fn(final_theta):.2e}, "
    f"Final gradient norm: {otu.tree_l2_norm(jax.grad(loss_fn)(final_theta)):.2e}"
)

Make predictions and compute the accuracy on the training data#

After the model is trained, you need to make predictions on the training data and compute the accuracy of the predictions. For each residue in the training data, you need to compute the probability of each class and assign the class with the highest probability as the predicted class. Then you need to compare the predicted classes with the true classes and compute the accuracy. The accuracy is the number of correct predictions divided by the total number of predictions.

[26]:
def predict_per_sample(theta, x):
    """Predict the class of a single sample

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)
        x: jnp.ndarray, the input features of one sample

    Returns:
        int: the predicted class of the sample

    """

    ############################################
    #### write your code (5 points) ############
    ############################################

    return None


def predict(theta, xs):
    """Predict the classes of multiple samples by using predict_per_sample and jax.vmap

    Args:
        theta: jnp.ndarray, the parameters of the model with shape (2, 316)
        xs: jnp.ndarray, the input features of multiple samples with shape (n, 316)

    Returns:
        jnp.ndarray: the predicted classes of the samples with shape (n,)
    """

    ############################################
    #### write your code (5 points) ############
    ############################################

    return None


def compute_accuray(y_pred, y_true):
    """Compute the accuracy of the model

    Args:
        y_pred: jnp.ndarray, the predicted classes of the samples with shape (n,)
        y_true: jnp.ndarray, the true classes of the samples with shape (n,)

    Returns:
        float: the accuracy of the model
    """

    ############################################
    #### write your code (5 points) ############
    ############################################

    return None


y_pred = predict(final_theta, train_xs)

Compute the accuracy of the model on the training set using the initial theta and the final theta. The accuracy with the initial theta should be around 33% because the initial theta is zero and the model is making random predictions. The accuracy with the final theta should be significantly higher because the model has learned the weights that minimize the loss function on the training data.

[ ]:
y_pred_init = predict(theta_init, train_xs)
accuracy_init = compute_accuray(y_pred_init, train_ys)

y_pred_final = predict(final_theta, train_xs)
accuracy_final = compute_accuray(y_pred_final, train_ys)

print(f"Accuracy of the model with initial theta: {accuracy_init:.2f}")
print(f"Accuracy of the model with final theta: {accuracy_final:.2f}")

Make predictions on the test data#

After the model is trained, you need to make predictions on the test sequences. The test sequences are provided in the file test.txt. You need to use the trained model to predict the secondary structure of all residues in the test sequences. The predictions should be written to a text file named test_prediction.txt and its format should be the same as the file train.txt.

[ ]:
###########################################################################
#### write your code here for the task described above (10 points) ########
###########################################################################


Model interpretation#

You have trained the soft-max regression model to predict the secondary structure of proteins. Specifically the model has learned the values of the parameter theta.

Question 1.

If someone asks you, “Based on your model, what is the order of preference of the amino acids for the helix secondary structure?” How would you answer this question based on the values of the parameter theta?

Type your answer in the cell below.

Your answer to Question 1:

Question 2.

Even though the model has been learned to predict the secondary structure of proteins based on the amino acid sequence, is it possible to use the model to design protein sequences so that they have a specific secondary structure? If yes, how would you do it? If no, why not?

Type your answer in the cell below.

Your answer to Question 2:

Submission instructions#

  1. You need to create a folder named assignment-2-protein-secondary-structure under the OneDrive folder that has been shared with you.

  2. Complete the code and answer the questions as described above in this Jupyter notebook. Make sure to save your work and name your Jupyter notebook as main.ipynb.

  3. Upload the main.ipynb, pyproject.toml, and test_prediction.txt files to the assignment-2-protein-secondary-structure folder that you created in step 1.

Note: Please make sure the names of the folder and files are exactly as instructed.