Make Mine MAPPER #9
                      -------------------
                        by Rob Haeuser


             Database Design: the Secret to Success               
---------------------------------------------------------------   
    For smaller applications, an inefficient database design may
hardly be noticeable by the few people using it. An occasional wait
of a few minutes is tolerated, but certainly not appreciated.     
    As applications grow larger and larger, these occasional waits
become more "occursional" (add that one to your dictionaries), and
the frequency of "slow response" complaints increases
exponentially. The situation can rapidly deteriorate, alternate
software may be explored as a solution, and all because for some
strange reason MAPPER is blamed for any basically rotten egg put
into it's basket. 
     Being a self-contained, on-line, real-time, database
environment, it is easy to get the impression that MAPPER "does it
all", and, therefore, is directly responsible for everything done
under it"s umbrella. But, a bad run is a bad run, and that
certainly isn't MAPPER's fault.
     Batch systems can mask inefficiency in the fact that they do
most, if not all, processing at night, invisible to the user. Not
many would bother to do the research to prove that the nightly
update program could really run in two hours instead of five, if
only there was enough time to re-write it...
     But it isn't always faulty run code. Even the best of runs can
have problems with a poorly designed database. And this is at the
heart of a lot of response complaints. After all, if the data
record I'm looking for is in a 500 line rid, it doesn't matter
whether I use find or binary-find, it'll take about a second either
way. But if I don't know the rid, and have to go hunt through a few
hundred of them, you might have time to go re-fill your mug.
     Unfortunately, the boss has complained about the vast
quantities of coffee everyone is consuming. It may be time to look
at why response is being measured in minutes instead of seconds.
     Let's back up to the beginning of the analysis phase for a
minute and consider a series of questions that one might ask when
first analyzing your customer's database needs. Take the following
conversation for example:
<ANALYST> : What... are... your... needs...?
<USER>    : What... are... needs...?
     Believe me, it can happen. But after you overcome the initial
impact of Robo-shock, you begin to establish a dialogue. This may
sound extremely basic, but make sure that terms like 'enter',
'inquire', 'update', and 'delete' are completely understood.
     Of course, the first step is to identify all the required data
fields. This is where I start playing not a 20, but usually at
least a 200 questions game. What may end up being the most
important questions of all are: What is the key? How do you need to
get at the data?


     There are two major system design categories here: either you
can control the creation of the key, or you can't. If you can't
control the creation of the key, it is usually because it is being
fed to your system by external sources, or maybe the application
already existed in some other form and the key technique is
impossible to change.
     The question that will probably determine how the database
finally gets structured is: Will the system be used primarily for
on-line inquiry at the record level, for reporting bigger pictures
on groups of data, or approximately the same for both?
     If a system leans very heavily towards one type of need versus
the other, decisions become easier. If record-level, on-line
inquiry is what you want, no problem. We just figure out a way,
given the level of control we have over the creation of the key, to
spread the data in such a way that the key will point to a rid of
no more than a few hundred lines, up to about 1,000 lines as a
maximum.
     Reports might only be created monthly, and then it doesn't
matter that the entire database may be read a few times over,
possibly taking hours to execute (there's that batch masking
again). 
     For a heavily weighted reporting system, data organization
would lean more toward the ability to directly display accumulated
data. I call these accumulated data rids report "grids". A system
might contain dozens of grids, each representing a particular time
or area slice of data.
     When the data records are entered, a process can be initiated
that will update all the appropriate grids, maintaining running
totals that are almost instantly available for user perusal. These
reports can be in displayable format, or they might be pre-prepped
for interfacing with the graphics runs.
     If the process of updating all the report grids is a lengthy
one, we have the option of passing data to a background run to do
the actual work. This technique should only be used if the updates
would take more than a few seconds to execute in the foreground,
tying up the user's terminal for what can seem like an eternity. 
     Should the process could go on for minutes, however, batching
the input and conducting periodic scheduled updates is the way to
go. An update run can be scheduled to run once an hour, or whatever
interval is most appropriate for management needs. If it's ok to
live with data one day old, don't schedule the update run to
execute every five minutes.
     Something that works real well is to have graphic template
grids set up with all the appropriate parameters pre-established.
The update run adds numbers to the data lines in the template, and
a menu run can then be used to select the grid and link to whatever
graphics run is desired. From the menu selection to the display of
the graph usually takes only two to five seconds.
     Report grids, if well designed, can be rapidly combined and
collapsed, so don't go to extremes when setting them up. Some
reporting needs may require a few grids to be processed, but that
could be far better than maintaining more than is absolutely
necessary. 

     It may have sounded like they are just piles of meaningless
looking numbers, to be un-snarled by some near-artificial
intelligence run that only a team of run designers can tackle. Not
true. They can look very structured, with non-MAPPER-classic
looking headings, page-breaks, etc.
     In fact, the more structured the better, when it comes to
updating the numbers. You can sometimes use an offset technique to
"hash" the grid line number to be updated, thereby saving the need
to find the right record first.
     For example, if I am accumulating numbers for any one of ten
different regions (ingeniously numbered two through eleven), I can
initialize a variable to be equal to the region number plus 4,
giving me the exact line number to read, increment, and then write
back again (see Figure 1). More grids can be added to the system as
necessary, and the occasional adhoc report will simply have to read
to database.

----------------------------------------------------------------  
                       FIGURE 1
                          --------

@.    Some other run displayed a screen, solicited a region (V1)
@.    and a quantity to add (V2), then started this run via @BR .
@CHG INPUT$ V1I2,V2I4 CHG V3I2 V1 +4\   v3 is the line # to update 
 LOK,M,T,R RDL,M,T,R,V3 10-5 V4I5 INC,V2 V4\  read # and increment
 WRL,M,T,R,V3,Y 10-5 ],V4 ULK .            write accumulated total
@.   The lok, read, and write might be done for a few report grids
@GTO END . 

---------------------------------------------------------------

     Let's talk a bit more about the other extreme, a system with
lots of on-line inquiry and very little reporting needs. Again, the
key is of utmost importance.
     If it's creation is under your control, you can set up a key
that points directly to a record by using the "MTRL" (sorry, I mean
"CDRL") technique. It would be in the format: MMMTRRRRLLLL where
"MMM" is the mode, "T" is the form type, "RRRR" is the rid, and
"LLLL" is the record line number. 
     Twelve characters sure makes for a big key, though. Maybe you
don't need the mode, and can use the "TRL" technique instead. Nine
digits is a key length I'm sure most of us with social security
numbers can relate to.
     Using the line number in the key (or hashing it, for that
matter) makes the database somewhat more sensitive to the addition
or deletion of lines from a rid. I sure wouldn't allow it to be
done by hand! 
     Without the line number, the closest you can get is the rid.
Of course, the key must contain some other unique value to
distinguish it within the rid. If the database is spread properly,
and no rid exceeds 500 - 1,000 lines, then a simple find will make
a hit in about 5 I/O. The need to maintain sorted key order for
binary-find to work, and the associated overhead of sorting,
usually precludes it's use in my book. I use BFN mainly for large
tables, where updating is minimal, but reading is constant.
     For massively large systems, where a form type or even an
entire mode may be insufficient to store all the data, a
combination of find and direct reads can be used. I will use a
straight find for the "master" record, denote the line number, and
then read the same line number across various form types/modes to
pull all the data together. After a "secondary" record is read,
it's key should be checked, and if incorrect, a database pointer
error has occurred.
     The update run must maintain the master and secondary rids in
exactly the same order and size. If no more blank lines are
available in the master rid, you better add them to the bottom.
Don't run the risk of corrupting pointers even for a second.
Because, as you add lines to the master rid, all other secondary
rids should be updated in the same manner. 
     Even though it sounds sensitive, since all data is maintained
via runs, problems should be extremely rare. I have had systems
running with this key technique for years with no pointer errors to
date. The beauty of this type of organization is that inquiry is
extremely fast, and reports only process relative data, since there
are no trailer lines in this type of structure. 
     Another approach is data paragraphing, using trailers for
different record types. This technique will use fewer rids for
storage, and inquiry is lightning quick, because all data to be
displayed is in a contiguous block of records. The down side is
that reporting data exclusively on trailer lines can be somewhat
painful, and many data lines are skipped over during processing,
wasting I/O. Of course, there's also the one header/multiple record
type syndrome to deal with.
     In reality, you will find that most systems balance out
somewhere near the middle, with roughly equal amounts of on-line
versus reporting needs. When everything has to be available within
milliseconds, a combination of transaction record access tricks and
report grids can give you direct and immediate access to reports,
graphs, and raw data. After all, isn't that what a good system is
all about?