RNA Folding
You might be very familiar with DNA and protein but not very familiar with RNA. Actually, in biology, RNA is even more important than either DNA or protein, because RNA can play both the “informational role” (which DNA plays) and the “functional role” (which protein plays):
molecule | role | information (idea) | function (real work) |
---|---|---|---|
DNA | CEO | only | NO |
RNA | directors | messenger RNA | non-coding RNA |
protein | workers | NO | only |
As you can see, DNA (CEO) can only pass information but can’t do real work, while protein is the other way around. RNAs as directors can both pass information and do real work. Therefore, it is widely believed that life started with the versatile RNAs only (the RNA world hypothesis), and later on DNA and protein evolved to specialize in either informational and real work.
RNA is also important because our world has just been turned upside down by an RNA virus SARS-CoV-2 (in fact, most viruses bothering us are RNA viruses, including common cold, flu, HIV, Ebola, Rabies, etc.), and this pandemic was contained partly by messenger RNA (mRNA) vaccines which won the 2023 Nobel Prize.
RNAs fold in nature by forming Watson Crick base
pairs (GC
, CG
, AU
,
UA
) and Wobble base
pairs (GU
, UG
), for instance in the
following transfer
RNA (tRNA) structure:
An RNA is a sequence {A, C, G, U}
, e.g., ACAGU
. You want to
predict its secondary structure (
, .
, or
)
. Here a dot .
means this (
means
this )
means this
((.))
. Each matching pair must be either a
Watson-Crick pair or a Wobble pair. The following are all valid
structures for ACAGU
:
ACAGU
.....
...()
..(.)
.(.).
(...)
((.))
Caveat: in real biology and in the CS textbooks that cover this
problem (such as the KT book/slides), there is another “no sharp-turn”
constraint which says if
Given RNA
In the example above, ((.))
is the best structure for
ACAGU
, with 2 pairs. But for GCACG
, the best
is ().()
, also with 2 pairs. Other (bigger) examples
(tie-breaking is arbitrary):
GUUAGAGUCU
(.()((.))) # 4 pairs
AUAACCUUAUAGGGCUCUG
.(((..)()()((())))) # 8 pairs
AACCGCUGUGUCAAGCCCAUCCUGCCUUGUU
(((.(..(.((.)((...().))())))))) # 11 pairs
Actually the RNA folding problem is equivalent to the famous context-free parsing problem (in natural language processing or NLP), and is closely related to the following classical problems:
All these five (5) problems are famous DP instances in hypergraphs
rather than graphs. Some textbooks (such as KT) call these problems “DP
over intervals” (like spans
Now let’s solve it by DP. The subproblem in this DP is the best
substructure of a span
Now we can decompose the span
The first way is known as CKY (aka CYK) (~1964) from NLP. There are two cases:
xxxxxxxx = (xxxxxx) | xxxx xxxx
i j i j i k j
Therefore:
Base cases:
While the first line of base case (singletons) is intuitive, the
second line (empty span) seems weird. Is it really necessary? Yes, for
example in the
Interestingly, many years after the NLP field invented the CKY
algorithm in the 1960s, the computational biology field
independently invented something very similar called the
Nussinov algorithm (1978) (important things always get re-invented many
times – CKY itself was independently invented by C, K, and Y). But
interestingly, Nussinov has a slightly different way of decomposing span
xxxxxxxx = xxxxxxx. | xxxx(xxx)
i j i j i k j
Therefore:
Notice that in the second clause,
(xxxxxx) # left subproblem empty
i=k j
xxxxxx() # right subproblem empty
i kj
() # both subproblems empty
ij
If you compare these two decomposition schemes, you can find a duality:
Both are
Which one is better? Well, either one is good enough for the best
structure problem. CKY is a bit easier to understand and is
aesthetically more pleasing (thanks to symmetry), but Nussinov is a bit
faster since the split point G
then only those C
can match it).
However, we will see below that they make a huge difference in the next two problems.
You can implement either CKY or Nussinov in either top-down or bottom-up style. Which order is easier? Well, as discussed in the DP chapter, top-down can figure out the topological order for you, but bottom-up (nested loops) requires a predefined topological order. What’s a good topological order for this RNA folding problem?
Note that we have three variables:
for i = 1 to n-1
for j = i+1 to n
for k = i to j-1
...
Is this correct? Of course not! These loops traverse spans in this order:
(1,2) (1,3) ... (1,n) <-- this is the whole problem!
(2,3) ... (2,n)
(3,4) ... (3,n)
...
(n-1,n) <-- this is a smallest span!
Clearly, the whole problem (1, n)
depends on (almost)
all other subproblems and should be the last subproblem to be solved,
but in the above order, it’s attempted way too early. If you use CKY,
and split (1, n)
to (1, k) + (k+1, n)
, but
none of the (k+1, n)
spans are ready!
The above order is ridiculous. Instead, you should start from the
smallest spans (length 2) and go all the way the largest span (length
for span = 2 to n
for i = 1 to n-span+1
j = i + span - 1
for k = i to j-1
...
The order of spans is now:
(1,2) (2,3) (3,4) ... (n-1,n) -- spans of length 2
(1,3) (2,4) ... (n-2,n) -- spans of length 3
(1,4) ... (n-3,n) -- spans of length 4
... ...
(1,n) -- spans of length n
Here is an example for CKY:
Here is another example for Nussinov, including backtracing for optimal structure:
Now we switch gears to counting the total number of possible
structures for a given sequence. For example, here
total(ACAGU)
should return 6:
ACAGU
.....
...()
..(.)
.(.).
(...)
((.))
This problem is very similar to the 1-best structure above, and you
can solve it by replacing
But the big question is, can both CKY and Nussinov work for this problem?
Well, CKY doesn’t, due to overcounting. For examples, there are at
least two ways of deriving ()()()
:
()()() = ()()()
ik j
() = ()
ij
()() = ()()
ik j
() = ()
ij
() = ()
ij
()()() = ()()()
i k j
()() = ()()
ik j
() = ()
ij
() = ()
ij
() = ()
ij
But in Nussinov, there is a unique (and shorter) derivation for this structure (and every structure):
()()() = ()()()
i kj
()() = ()()
i kj
() = ()
kj (i=k)
So we write Nussinov equation:
Note that
We now want the top-
You can use either CKY or Nussinov to solve the
First, we need to replace the
Or if we expand the second clause:
Before we tackle this
Just like CKY/Nussinov are generalizations of Viterbi from DAGs to
DAHs (directed acyclic hypergraphs), now let’s generalize
Example: