One of the problems when sampling data out of raw files is that there maybe consistency constraints on the picked lines.
For example, if you like to extract representative sales slips out of TLOG texts, you want all the positions belonging to one sales slip to be sampled as a whole … and you want that property consistently over all processes/machines onto which the texts/sampling has been distributed.
Using the typically seeded pseudo-random sequences will not work as expected, here. You would have to first aggregate the sales slip headers, sample on the level of headers and, again, join the resulting sample with the original TLOG data.
A nice idea that circumvents that necessity with only a minimal bit of overhead is inspired by Chuck Lam’s method of simply constructing Bloom-filter hash-functions. For each line, the pseudo-random generator is seeded with the hash-code of a given key/object. Then, a fixed position of the random sequence (the new “seed”) is read as the observed random value.
For our TLOG case, all the lines carrying the identical sales slip key will get the same random boolean computed and get filtered or passed through allthesame.