Sunday, May 22, 2016

Quick! Get me member (Q47MMT1)!

Imagine a partitioned dataset with 8,000 members (or more).  This is getting into the range where finding the directory entry for a specific member is becoming a real chore and is chewing up cycles.  I heard of an imaginative way to speed up the process.

Define the partitioned dataset as a Generation Data Group and make the group large enough that, when the dataset is split, searching the directory of each is less of a chore (it will be even if only because each fragment of the whole is smaller).  Let's say, for the sake of argument, that we break it into 27 generations, one for each letter of the alphabet plus a catch-all.  Now copy all the members beginning with non-alphabetics into generation #1, all the "A"s into #2, all the "B"s into #3, etc.  When you access the group via its base-name (without specifying a generation) you get them all concatenated in 27-26-25...3-2-1 order.

When you look for member 'Q47MMT1', the directory of generation #27 is scanned, but member names are always in alphabetical order and this directory starts with 'Z...'.  That's not it; skip to G0026V00.  Its first entry starts with 'Y...'.  Nope.  G0025V00 starts with 'X...', G0024V00 starts with 'W...', G0023V00 starts with 'V...', G0022V00 starts with 'U...', G0021V00 starts with 'T...', G0020V00 starts with 'S...', G0019V00 starts with 'R...', G0018V00 starts with 'Q...'.  Got it!  You quickly find the required member and processing continues.  What's happening here is that instead of searching through 8,000+ directory entries and finding what you seek in (on average) 4000-or-so lookups, you looked at (on average) 13 + ~150 (8000 / 27 / 2).  As the original partitioned dataset gathers more members, this comparison gets more stark.  At some point, the comparison is so stark that someone will wonder if the quicker method failed because it just couldn't complete that fast.

No comments:

Post a Comment