CIS 501
Spring 2003
Assignment 02L - Lessons to be Learned from Assignment 2
Assignment 2 (the merge-sort part) is a pretty difficult, complex
assignment.
The final code is not long, but it IS complicated. I just want to take
a few minutes here and point out some issues related to this assignment.
You may find this useful in guiding your study for the exam. Consider these
issues as things to remember for the future.
-
This is an assignment in which group discussion and group work can be highly
beneficial. So … if you are not in a group, I will help you get into one.
Let me know by e-mail, as soon as possible. If you have changed your group
in any way from the last time, also let me know.
-
We always run around like lunatics preaching top-down approaches to programming.
But the top-down approach does not work so well in this case. In fact,
it is often the case that the top-down approach, by itself, is difficult
to use. In this problem, and others, unless you have a clear picture of
how the bottom, most detailed pieces of a program will work, it is nearly
impossible to decide what the upper pieces will look like.
We started out today taking a top-down approach, writing out, in English,
what needed to be done in the outer two loops of the sort-merge function.
But then we spent the rest of the class talking about the lowest level
code – the function (called merge_two_runs) that merges two runs of length
k (one from each input file) and writes the resulting run to an output
file. If you take the time to fully understand how this function works,
I would bet that the remaining work needed to finish the entire sort-merge
problem will be minimal.
-
As we began working on the merge_two_runs function, we constantly kept
a focus on a sample set of data, partially sorted (worked through two Phases)
and we examined what needed to be done to get Phase 3 working. We started
small, with the very first step needed – namely reading a data item from
each of the two input files, and then comparing these two data items. As
we went along, we
-
carefully identified any input or output arguments we needed and wrote
down descriptive information about these arguments (we also named these
arguments);
-
named and described any local data (such as the counts) we would need in
writing the function;
-
mapped out in pseudo code descriptions of what needed to be done at each
step, using the most precise descriptions we could produce and leaving
some details (such as the copy of a run and the test to see if we were
at the end of a run) to be dealt with later;
-
traced our path through the merge-sort of one pair of runs, refining and
adjusting our algorithm as we went along.
-
When we felt confident that the basic algorithm for merge_two_runs looked
pretty good, we then began to look at the test for end of run and the copy
operations. We actually figured out the test for end of run mechanism,
but not the copy. Note that these two pieces of the sort-merge problem
would fall below the merge_two_runs subproblem. In other words, they are
subproblems of the subproblem.
-
(continuation of 4) If I asked you now to draw a hierarchy chart, the sort-merge
function problem would appear at the top. Underneath would be an initialization
box (including a distribution function that would appear underneath the
initialization box), a termination box, and a process box. The process
box, which contains steps to be repeated, might read something like this:
for k = 1 to k == log2n, repeat Phase k: merge runs of length
2 k-1 into runs of length 2k.
Underneath this step is another step to be repeated and it is
essentially
the merge_two_runs step. Under this step is at the very least, two more
steps: determine when the end of a run is reached, and copy the remaining
run to the output file. YOU SHOULD DRAW YOURSELF A COPY OF THIS HIERARCHY
CHART AND ADD WHATEVER ITEMS YOU SEE FIT TO ADD.
- FINALLY: Whenever you program, you should always try to anticipate
what we sometimes call the "end point cases" for your algorithms. For example,
in the handout e-mailed to you yesterday, it was suggested (see item 2ii)
that power_c and power_d should be equal. Well, there may be an end point
case for which this is not true. Consider for example, the case where count_c
= 16 and count_d = 17. Compute power_c and power_d. In theory, one of these
values is 4 and the other is 5. Which value do we want in this case? What
values of number and n do we need in this case? (Hint: n should be 64).
How would you rewrite the note at the end of 2ii to correct this problem?