Make Mine MAPPER #9 ------------------- by Rob Haeuser Database Design: the Secret to Success --------------------------------------------------------------- For smaller applications, an inefficient database design may hardly be noticeable by the few people using it. An occasional wait of a few minutes is tolerated, but certainly not appreciated. As applications grow larger and larger, these occasional waits become more "occursional" (add that one to your dictionaries), and the frequency of "slow response" complaints increases exponentially. The situation can rapidly deteriorate, alternate software may be explored as a solution, and all because for some strange reason MAPPER is blamed for any basically rotten egg put into it's basket. Being a self-contained, on-line, real-time, database environment, it is easy to get the impression that MAPPER "does it all", and, therefore, is directly responsible for everything done under it"s umbrella. But, a bad run is a bad run, and that certainly isn't MAPPER's fault. Batch systems can mask inefficiency in the fact that they do most, if not all, processing at night, invisible to the user. Not many would bother to do the research to prove that the nightly update program could really run in two hours instead of five, if only there was enough time to re-write it... But it isn't always faulty run code. Even the best of runs can have problems with a poorly designed database. And this is at the heart of a lot of response complaints. After all, if the data record I'm looking for is in a 500 line rid, it doesn't matter whether I use find or binary-find, it'll take about a second either way. But if I don't know the rid, and have to go hunt through a few hundred of them, you might have time to go re-fill your mug. Unfortunately, the boss has complained about the vast quantities of coffee everyone is consuming. It may be time to look at why response is being measured in minutes instead of seconds. Let's back up to the beginning of the analysis phase for a minute and consider a series of questions that one might ask when first analyzing your customer's database needs. Take the following conversation for example: : What... are... your... needs...? : What... are... needs...? Believe me, it can happen. But after you overcome the initial impact of Robo-shock, you begin to establish a dialogue. This may sound extremely basic, but make sure that terms like 'enter', 'inquire', 'update', and 'delete' are completely understood. Of course, the first step is to identify all the required data fields. This is where I start playing not a 20, but usually at least a 200 questions game. What may end up being the most important questions of all are: What is the key? How do you need to get at the data? There are two major system design categories here: either you can control the creation of the key, or you can't. If you can't control the creation of the key, it is usually because it is being fed to your system by external sources, or maybe the application already existed in some other form and the key technique is impossible to change. The question that will probably determine how the database finally gets structured is: Will the system be used primarily for on-line inquiry at the record level, for reporting bigger pictures on groups of data, or approximately the same for both? If a system leans very heavily towards one type of need versus the other, decisions become easier. If record-level, on-line inquiry is what you want, no problem. We just figure out a way, given the level of control we have over the creation of the key, to spread the data in such a way that the key will point to a rid of no more than a few hundred lines, up to about 1,000 lines as a maximum. Reports might only be created monthly, and then it doesn't matter that the entire database may be read a few times over, possibly taking hours to execute (there's that batch masking again). For a heavily weighted reporting system, data organization would lean more toward the ability to directly display accumulated data. I call these accumulated data rids report "grids". A system might contain dozens of grids, each representing a particular time or area slice of data. When the data records are entered, a process can be initiated that will update all the appropriate grids, maintaining running totals that are almost instantly available for user perusal. These reports can be in displayable format, or they might be pre-prepped for interfacing with the graphics runs. If the process of updating all the report grids is a lengthy one, we have the option of passing data to a background run to do the actual work. This technique should only be used if the updates would take more than a few seconds to execute in the foreground, tying up the user's terminal for what can seem like an eternity. Should the process could go on for minutes, however, batching the input and conducting periodic scheduled updates is the way to go. An update run can be scheduled to run once an hour, or whatever interval is most appropriate for management needs. If it's ok to live with data one day old, don't schedule the update run to execute every five minutes. Something that works real well is to have graphic template grids set up with all the appropriate parameters pre-established. The update run adds numbers to the data lines in the template, and a menu run can then be used to select the grid and link to whatever graphics run is desired. From the menu selection to the display of the graph usually takes only two to five seconds. Report grids, if well designed, can be rapidly combined and collapsed, so don't go to extremes when setting them up. Some reporting needs may require a few grids to be processed, but that could be far better than maintaining more than is absolutely necessary. It may have sounded like they are just piles of meaningless looking numbers, to be un-snarled by some near-artificial intelligence run that only a team of run designers can tackle. Not true. They can look very structured, with non-MAPPER-classic looking headings, page-breaks, etc. In fact, the more structured the better, when it comes to updating the numbers. You can sometimes use an offset technique to "hash" the grid line number to be updated, thereby saving the need to find the right record first. For example, if I am accumulating numbers for any one of ten different regions (ingeniously numbered two through eleven), I can initialize a variable to be equal to the region number plus 4, giving me the exact line number to read, increment, and then write back again (see Figure 1). More grids can be added to the system as necessary, and the occasional adhoc report will simply have to read to database. ---------------------------------------------------------------- FIGURE 1 -------- @. Some other run displayed a screen, solicited a region (V1) @. and a quantity to add (V2), then started this run via @BR . @CHG INPUT$ V1I2,V2I4 CHG V3I2 V1 +4\ v3 is the line # to update LOK,M,T,R RDL,M,T,R,V3 10-5 V4I5 INC,V2 V4\ read # and increment WRL,M,T,R,V3,Y 10-5 ],V4 ULK . write accumulated total @. The lok, read, and write might be done for a few report grids @GTO END . --------------------------------------------------------------- Let's talk a bit more about the other extreme, a system with lots of on-line inquiry and very little reporting needs. Again, the key is of utmost importance. If it's creation is under your control, you can set up a key that points directly to a record by using the "MTRL" (sorry, I mean "CDRL") technique. It would be in the format: MMMTRRRRLLLL where "MMM" is the mode, "T" is the form type, "RRRR" is the rid, and "LLLL" is the record line number. Twelve characters sure makes for a big key, though. Maybe you don't need the mode, and can use the "TRL" technique instead. Nine digits is a key length I'm sure most of us with social security numbers can relate to. Using the line number in the key (or hashing it, for that matter) makes the database somewhat more sensitive to the addition or deletion of lines from a rid. I sure wouldn't allow it to be done by hand! Without the line number, the closest you can get is the rid. Of course, the key must contain some other unique value to distinguish it within the rid. If the database is spread properly, and no rid exceeds 500 - 1,000 lines, then a simple find will make a hit in about 5 I/O. The need to maintain sorted key order for binary-find to work, and the associated overhead of sorting, usually precludes it's use in my book. I use BFN mainly for large tables, where updating is minimal, but reading is constant. For massively large systems, where a form type or even an entire mode may be insufficient to store all the data, a combination of find and direct reads can be used. I will use a straight find for the "master" record, denote the line number, and then read the same line number across various form types/modes to pull all the data together. After a "secondary" record is read, it's key should be checked, and if incorrect, a database pointer error has occurred. The update run must maintain the master and secondary rids in exactly the same order and size. If no more blank lines are available in the master rid, you better add them to the bottom. Don't run the risk of corrupting pointers even for a second. Because, as you add lines to the master rid, all other secondary rids should be updated in the same manner. Even though it sounds sensitive, since all data is maintained via runs, problems should be extremely rare. I have had systems running with this key technique for years with no pointer errors to date. The beauty of this type of organization is that inquiry is extremely fast, and reports only process relative data, since there are no trailer lines in this type of structure. Another approach is data paragraphing, using trailers for different record types. This technique will use fewer rids for storage, and inquiry is lightning quick, because all data to be displayed is in a contiguous block of records. The down side is that reporting data exclusively on trailer lines can be somewhat painful, and many data lines are skipped over during processing, wasting I/O. Of course, there's also the one header/multiple record type syndrome to deal with. In reality, you will find that most systems balance out somewhere near the middle, with roughly equal amounts of on-line versus reporting needs. When everything has to be available within milliseconds, a combination of transaction record access tricks and report grids can give you direct and immediate access to reports, graphs, and raw data. After all, isn't that what a good system is all about?