I'm creating a pseudo-random data set for application users to train on.
I'm surprised that if I seed the RAND() function with 1, 2, 3, etc., I get very nearly the same result from the seeded function. However, this seems to be followed by "properly random" but repeatable values when a seed is not supplied.
SELECT RAND(1) AS R1A, RAND() AS R1B, RAND(2) AS R2A, RAND() AS R2B, RAND(3) AS R3A, RAND() AS R3B, RAND(4) AS R4A, RAND() AS R4B 0.713591993212924 0.472241415009636 0.713610626184182 0.217821139260039 0.71362925915544 0.963400850719992 0.713647892126698 0.708980575436056
At first sight, it appears that I can evaluate RAND(@seed) and discard the result, then evaluate RAND() to get several really "random" numbers for my training data - so far I've planned to use four per record; I might need some more.
Is that plan going to work properly? And, what am I looking at, here? And, should it be in the documentation? I haven't found it.
Documentation says this, which may be meant as a clue:
The RAND function is a pseudorandom number generator that operates in a manner similar to the C run-time library rand function. If no seed is provided, the system generates its own variable seed numbers.
Does the rand function in C produce similar output for similar seed input?
I think the documentation could also state more clearly that RAND(@number) followed by RAND() always generates the same numbers. But that is what I wanted and what any experienced computer programmer will expect.
I suppose I could fill a table with random data keys obtained from https://www.random.org/ to use for the purpose - but that has drawbacks.
I have the following conclusions about RAND() and for now I think I'll continue with it, but keeping alternatives in mind.
RAND(@int) sets the seed of the random number generator using the integer value given, and returns a float result which is not statistically independent, inasmuch as RAND(@int) and RAND(@int+1) produce almost exactly the same result.
Of course, RAND(@int) always produces the same result.
RAND(-@int) and RAND(@int) produce the same result.
RAND(0) is an exception: there may be other exceptions. RAND(0) always produces the result 0.943597390424144, but it is not similar to the result of RAND(1).
RAND() called n times after RAND(@int) always produces the same n numbers. If we call the nth number "rand(@int, @n)" -
CREATE PROCEDURE sproc_rand(@seed int, @nth int, @rand float OUTPUT) AS SET @rand = RAND(@seed); WHILE ( @nth > 0 ) BEGIN SET @rand = RAND(); SET @nth = @nth - 1; END
The difference between rand(@int, @n) and rand(@int+1, @n) "modulo 1" -
(1.0 + rand(@int, @n) - rand(@int+1, @n) % 1
Is a constant or nearly constant; for @n = 1 it is approximately 0.75. For @n = 5 it is 0.991. For @n = 6 it is 0.91. For @n = 100 it is 0.83.
So, no, these are not good "random" numbers when a simple incremented seed is used - although they bounce around quite well for @n IN (1, 2, 3, 4).
What I'm now considering as a mitigation is using not RAND(row_id) but,
RAND(row_id * @factor_1 + @factor_2)
where @factor1 and @factor2 are constant terms and @factor_1 is approximately 10,000. And for different projects, different factors.
So far, if I want repeatable results, the other method available (besides the Multiplicative Inverse thing, which I haven't quite understood yet) is to create a table and fill it with non-repeating random numbers from one of the other methods, then draw the numbers from that table, when required.
The only reason to ever provide a "seed" for
RAND() is if you want the exact same sequence of (pseudo) random values for testing purposes.
Just call it without a seed, and that should be good enough for most purposes.
If you do need a predictable sequence, and you're simply unhappy with the starting (random) value,
RAND() takes seeds from -2,147,483,648 all the way up to 2,147,483,647 (full range of int), so try some seeds that differ by more than 1 or 2.
Finally, if you were on SQL 2008 or better, you'd also have the option of considering CRYPT_GEN_RANDOM instead (warning, it returns a varbinary, not a float, so you'd likely need to change your code a bit): see this article for some differences and discussion.
SELECT CRYPT_GEN_RANDOM (4, 0x01020304)SQL doc page doesn't explain that but Windows doc does: "If an application has access to a good random source, it can fill the pbBuffer buffer with some random data before calling CryptGenRandom. The CSP then uses this data to further randomize its internal seed." It occurs to me this may be an intentionally garbled way of saying "one-time pad" instead of "seed". Only the CIA know for sure. :-) — Jun 12, 2017 at 09:22
External links referenced by this document: