I. Some observations/facts about diets and insertion into diets:
There are essentially three different cases that can occur
during an insertion of an integer x into a diet d:
(A) x is not adjacent to any other interval
=> Simply create a node [x,x] at the appropriate place in d
(B) x is adjacent to exactly one interval [i,j] (ie, x=i-1 or x=j+1)
=> Replace [i,j] by [x,j] or [i,x]
(C) x is adjacent to two intervals [i,j] and [i',j'].
Assume wlog that i* Replace one interval (the one that is located higher
in d) by [i,j'] and delete the other.
II. In the sequel I will use the following Haskell classes and types:
class Ix a => Discrete a where
pre, suc :: a -> a
adjacent :: a -> a -> Bool
adjacent x y = suc x==y
instance Discrete Int where
pre x = x-1
suc x = x+1
data Discrete a => Diet a = Empty | Node (Diet a) (a,a) (Diet a)
deriving (Eq)
{- for better debugging:
showsDiscrete :: (Show a,Discrete a) => (a,a) -> ShowS
showsDiscrete (i,j) = (" ["++) . shows i . (',':) . shows j . ("] "++)
showsDiet :: (Show a,Discrete a) => Diet a -> ShowS
showsDiet Empty = id
showsDiet (Node l i r) =
('(':) . showsDiet l . showsDiscrete i . showsDiet r . (')':)
instance (Show a,Discrete a) => Show (Diet a) where
showsPrec _ d = showsDiet d
-}
III. Derivation of insert in several steps:
(1) Formulation as two completely separate phases: insert & regroup
(2) Fusion of insert & regroup (to eliminate duplicate path traversal)
(3) Split regroup into two functions (to eliminate redundant tests)
(4) Fusion of auxiliary functions (to eliminate duplicate path traversal)
(1) First, insert is defined almost like for ordinary binary search trees:
insert :: Discrete a => a -> Diet a -> Diet a
insert x Empty = Node Empty (x,x) Empty
insert x d@(Node l (i,j) r) =
if x**j then Node l (i,j) (insert x r)
else d
Now regroup has to find out whether the newly inserted interval [x,x] is
adjacent to none, one, or two intervals, and it has to perform the appropriate
reorganization.
Important for the further development is the following
CLAIM: If regroup does not take the last inserted value x as a
parameter, its running time cannot be guaranteed to be
bounded by the height of the diet.
Actually, regroup follows (at least) the path that was taken by insert x
to find the possibly adjacent intervals. Without guidance by x regroup had
to branch at each node into two directions which would mean in the worst
case a scan of the complete diet. regroup works as follows: if x is found
without encountering an adjacent interval, the diet is left unchanged (case
(A), last "else"). Once an adjacent adjacent interval (say, [i,j] with
x+1=i) is found, it must be tested whether case (B) or (C) is present.
Therefore, the (old) predecessor of [i,j] is sought (note: the current
predecessor is [x,x]). Such a predecessor exists only if the right subtree
of l is not empty (note: l is definitely not empty because it contains at
least the interval [x,x]). If a predecessor [li,lj] exists, and if it is
also adjacent to x, it can be fused with [i,j] to [li,j], and [li,lj] and
[x,x] must be deleted from l. Otherwise, [x,x] (which is the maximum
interval in l) must be deleted and [i,j] is just extended by one element to
the left, ie, [x,j]. (The case for x>j is analogous.)
regroup :: Discrete a => a -> Diet a -> Diet a
regroup x Empty = Empty
regroup x d@(Node l (i,j) r) =
if x** l /= Empty
if adjacent x i then
if right l /= Empty && adjacent lj x then
Node (delPredMax l) (li,j) r
else
Node (delMax l) (x,j) r
else
Node (regroup x l) (i,j) r else
if x>j then -- => r /= Empty
if adjacent j x then
if left r /= Empty && adjacent x ri then
Node l (i,rj) (delSuccMin r)
else
Node l (i,x) (delMin r)
else
Node l (i,j) (regroup x r)
else d
where (li,lj) = predMax l
(ri,rj) = succMin r
The used auxiliary functions are:
left (Node l _ _) = l
right (Node _ _ r) = r
delMin :: Discrete a => Diet a -> Diet a
delMin (Node Empty _ r) = r
delMin (Node l i r) = Node (delMin l) i r
delSuccMin :: Discrete a => Diet a -> Diet a
delSuccMin (Node (Node Empty _ Empty) _ r) = r
delSuccMin (Node l i r) = Node (delSuccMin l) i r
succMin :: Discrete a => Diet a -> (a,a)
succMin (Node (Node Empty _ Empty) i _) = i
succMin (Node l _ _) = succMin l
delMax :: Discrete a => Diet a -> Diet a
delMax (Node l _ Empty) = l
delMax (Node l i r) = Node l i (delMax r)
delPredMax :: Discrete a => Diet a -> Diet a
delPredMax (Node l _ (Node Empty _ Empty)) = l
delPredMax (Node l i r) = Node l i (delPredMax r)
predMax :: Discrete a => Diet a -> (a,a)
predMax (Node _ i (Node Empty _ Empty)) = i
predMax (Node _ _ r) = predMax r
Now insertion can be simply defined as:
insertIt :: Discrete a => a -> Diet a -> Diet a
insertIt x d = regroup x (insert x d)
(2) It is striking in (1) that regroup initially follows the same path as
taken by insert x. The actual work of regroup begins not before an
adjacent interval has been found. It therefore seems to be reasonable to
move the first/outer adjacency tests from regroup to insert. This makes the
whole program more efficient, and it simplifies regroup a bit while insert
is still easy to understand.
insert x Empty = Node Empty (x,x) Empty
insert x d@(Node l (i,j) r) =
if x**j then
if adjacent j x then regroup x (Node l (i,x) r)
else Node l (i,j) (insert x r)
else d
In contrast to the approach (1), regroup does not see a diet in which
[x,x] has already been inserted. Instead, if x is adjacent to an interval,
this is the current interval which has already been extended by insert.
So the side of regrouping is determined by comparing x with i and j.
regroup x d@(Node l (i,j) r) =
if x==i then -- regroup left
if l /= Empty then
if adjacent lj x then
Node (delMax l) (li,j) r
else
Node l (x,j) r
else
d else
if x==j then
if r /= Empty then
if adjacent x ri then
Node l (i,rj) (delMin r)
else
Node l (i,x) r
else
d
else d
where (li,lj) = findMax l
(ri,rj) = findMin r
The fact that [x,x] is not in the diet to be regrouped also simplifies
the needed auxiliary functions. We need delMin and delMax from above and
the two simple functions:
findMin :: Discrete a => Diet a -> (a,a)
findMin (Node Empty i _) = i
findMin (Node l _ _) = findMin l
findMax :: Discrete a => Diet a -> (a,a)
findMax (Node _ i Empty) = i
findMax (Node _ _ r) = findMax r
(3) We see that in the above version of regroup x is only needed to find
out the proper side for regrouping. Since we have this information already
available in insert, we can fuse one step further by splitting regroup into
two functions joinLeft and joinRight. This also allows to replace the
second conditional by pattern matching. So instead of regroup we get the
following two functions:
joinLeft d@(Node Empty _ _) = d
joinLeft d@(Node l (i,j) r) =
if adjacent lj i then Node (delMax l) (li,j) r
else Node l (i,j) r
where (li,lj) = findMax l
joinRight d@(Node _ _ Empty) = d
joinRight d@(Node l (i,j) r) =
if adjacent j ri then Node l (i,rj) (delMin r)
else Node l (i,j) r
where (ri,rj) = findMin r
and insert has to be slightly changed to:
insert x Empty = Node Empty (x,x) Empty
insert x d@(Node l (i,j) r) =
if x**j then
if adjacent j x then joinRight (Node l (i,x) r)
else Node l (i,j) (insert x r)
else d
(4) A final source of inefficiency in (3) is that in case (C) the path to
the second adjacent interval is traversed twice (by the auxiliary functions
findMin and delMin, for instance). But these can be fused together into
one auxiliary function splitMin that retrieves and deletes the predecessor
interval (same for findMax, etc.) in one run. Thus, we finally obtain the
following version: insert remains unchanged, and joining and splitting is
defined as follows.
splitMin :: Discrete a => Diet a -> (Diet a,(a,a))
splitMin (Node Empty i r) = (r,i)
splitMin (Node l i r) = (Node l' i r,i') where (l',i') = splitMin l
splitMax :: Discrete a => Diet a -> (Diet a,(a,a))
splitMax (Node l i Empty) = (l,i)
splitMax (Node l i r) = (Node l i r',i') where (r',i') = splitMax r
joinLeft d@(Node Empty _ _) = d
joinLeft (Node l (i,j) r) =
if adjacent lj i then Node l' (li,j) r
else Node l (i,j) r
where (l',(li,lj)) = splitMax l
joinRight d@(Node _ _ Empty) = d
joinRight (Node l (i,j) r) =
if adjacent j ri then Node l (i,rj) r'
else Node l (i,j) r
where (r',(ri,rj)) = splitMin r
In my opinion this last version looks quite elegant. It is reasonably
concise and it is, in particular, the most efficient one. I don't believe
there is a simpler solution that is as efficient.
*