Project Euler #300

In a very simplified form, we can consider proteins as strings consisting of hydrophobic (H) and polar (P) elements, e.g. HHPPHHHPHHPH. For this problem, the orientation of a protein is important; e.g. HPP is considered distinct from PPH. Thus, there are 2n distinct proteins consisting of n elements.

When one encounters these strings in nature, they are always folded in such a way that the number of H-H contact points is as large as possible, since this is energetically advantageous. As a result, the H-elements tend to accumulate in the inner part, with the P-elements on the outside. Natural proteins are folded in three dimensions of course, but we will only consider protein folding in two dimensions.

Assuming that H and P elements are equally likely to occur in any position along the string, the average number of H-H contact points in an optimal folding of a random protein string of length 8 turns out to be 850 / 28 = 3.3203125.

What is the average number of H-H contact points in an optimal folding of a random protein string of length 15? Give your answer using as many decimal places as necessary for an exact result.

This was a rather cool one. Unfortunately it required a brute force method. Nevertheless without some optimizations (or – in my case – lazy dumb ass parallelization) it was impossible to compute without waiting long hours. Running time on a single CPU – 10 hours. On a 96-core grid – 6.25 minutes ๐Ÿ˜‰ Oh, and I used my mpi::map() helper ๐Ÿ˜€


  • Hi, Stanisล‚aw. brute force how? For this problem I don’t even have an idea to write a brute force program. Thanks for any hint.

  • You can imagine every protein chain in this problem as a free (i.e. rotation/translation doesn’t make it a different one) polyomino. If you find a way to generate all free polyominos of length 15 you’re halfway there. Then you can make a list of contact points for every generated polyomino and brute force your way through every possible HP combination of length 15. This way you evaluate 2^15 (32768) HP combinations against contact point lists of ~12500 polyominos (if you do it the proper way, I couldn’t generate less than ~100000 polyominos). For every HP combination you take the polyomino with contact list that gave the best score and the average of those best scores is what you’re supposed to give as an answer. There is also a subtle optimization possible during the contact point lists evaluation step but it’s not crucial so I leave it for you to figure out ๐Ÿ˜‰

  • Hi thanks for the hints. About the 2^15, I think we can half them due to the fact that the best configuration for HPPPHP is also the best for its reverse counterpart PHPPPH.

    Just a small question for my full understanding, what do you mean by free polyaminos? Do you mean all the possible 2D configuration of a 15-letters string? If so, I think I can try dynamic programming to pick up those clump configs and drop all those straight lines.

    I have read a paper that uses SOMs to run 12000 trials on a 20 aa string. Only 4 of them yielded the optimal answer: 4.

    Thank you very much.

  • Now that I think of it – ideally I mean all the possible self-avoiding walks of length 15 that produce different sets of contact points. No matter how you generate them, the most important thing it to limit their number, because then you will be testing them against 2^15/2 (obviously you’re right about HPHP having the same best folding as PHPH, etc.) proteins.

  • Several days later, I finally have solved it in python. Too many hard code. Reading the threads in the forum, I could only understand their ideas, not their codes.

    The only thing that I can’t live with is, I have to first build up a coordinate system and then the contact list. Did you have used a coordinate system? If not, could you tell me how to avoid it? Thanks.

    So here is the coordinate part:

    import copy
    import math
    contact = []
    n = 15
    x_action = [1,-1,0,0]
    y_action = [0,0,1,-1]
    proteins = []
    c = []
    c = []
    c = [(0,3)]
    c = []
    c = []
    #print proteins
    for step in range(4,n):
        print step, len(proteins)
        lastAA = step - 1
        temp = []
        temp_c = []
        for i in range(len(proteins)):
            lastAA_pos = proteins[i][lastAA]
            for j in range(len(x_action)):
                tentative_pos_x = lastAA_pos[0] + x_action[j]
                tentative_pos_y = lastAA_pos[1] + y_action[j]
                #print (tentative_pos_x, tentative_pos_y)
                found = False
                temp_contact = copy.deepcopy(contact[i])
                for AA in proteins[i].keys():
                    if proteins[i][AA] == (tentative_pos_x, tentative_pos_y):
                        found = True
                    if math.sqrt((proteins[i][AA][0] - tentative_pos_x)**2 + (proteins[i][AA][1] - tentative_pos_y)**2) == 1 and AA != lastAA:
                        #print "found"
                if found == False and abs(tentative_pos_x) < 6 and abs(tentative_pos_y) < 5:
                    #print (tentative_pos_x, tentative_pos_y)
                    temp_protein = copy.deepcopy(proteins[i])
                    temp_protein[step] = (tentative_pos_x, tentative_pos_y)
                    #print temp_protein,proteins[i]
                    #print temp
        proteins = temp
        contact = temp_c
    print len(contact)
  • Glad you’ve made it. I’ve also used coordinates so can’t help you there. How much time does it take to compute the result using your code?

Leave a Reply

Your email address will not be published. Required fields are marked *