Visual Basic for Applications/Pseudo Random Repeated Substrings
Appearance
Summary
[edit | edit source]This page describes some matters that apply to the Rnd() function of VBA. In particular it illustrates that repeated substrings can result when the Randomize() function is wrongly positioned inside the same loop as Rnd(), instead of before it.
The VBA Rnd() Function
[edit | edit source]- The Rnd() function is pseudo random, as opposed to true random . True randomness is found rarely, one notable example being the sequence of data that can be derived from white noise. White noise, like the radio noise from the sun or perhaps the unintentional noise in a radio or other electronic device, has a fairly uniform distribution of frequencies, and can be exploited to produce random distributions of data; also known as linear probability distributions because their frequency distributions are straight lines parallel to the horizontal axis.
- Pseudo randomness can be obtained with a feedback algorithm, where a sequence of output values of a function is fed back and assists in the making of the next part of an output stream. These are referred to as pseudo random number generators (PRNG). Such a process, although complex, is nonetheless determinate, being based entirely on its starting value. Such generators, depending on their design can produce long sequences of values, all unique, before the entire stream eventually repeats itself.
- A PRNG output stream will always repeat itself if a long enough set of values is generated. The Rnd function in VBA can generate a sequence of up to 16,777,216 numbers before any one number is repeated, at which time the entire sequence itself is repeated. This is adequate in most cases. The Rnd() function has been described by Microsoft as belonging to the set of PRNGs called Linear Congruential Generators (LCG), though it is unclear as to whether or not the algorithm has since been modified.
- The Rnd function is not suitable for large tables or for cryptographic use, and VBA itself is quite insecure in its own right. For given starting values the generator always will produce the same sequence. Clearly, if any part of the stream is known, this allows other values in the sequence to be predicted, and this state of affairs is insecure for cryptographic use. Perhaps surprisingly, modeling methods that make much use of random values need even longer unique sequences than that produced by Rnd().
- The exact coding of the Microsoft Rnd() function is not available, and their descriptive material for it is quite sketchy. A recent attempt by me to implement the assumed algorithm in VBA code failed because of overflow, so those who intend to study such generators in VBA need to use another algorithm. Perhaps study of the Wichmann-Hill (1982) CLCG algorithm, that can be implemented in VBA would be a better choice. A VBA implementation, (by others), of the Wichmann-Hill (1982) algorithm is provided in another page of this series, along with some simpler generator examples.
Worst Case for Rnd() Substrings?
[edit | edit source]- A well designed PRNG stream consists of unique numbers, but this applies only to the designer's unfiltered set of numbers in the range from zero to unity, [0,1]. As soon as we start to take some values from the stream, and ignore others, say to make custom outputs, the new steams will take on different characteristics. The combination of cherry-picking elements from the natural sequence and the mapping of a large set to a very small set takes its toll. When observing the new set, the count of characters to the cycle repeat point shortens, and the number of repeated substrings increases throughout the set.
- The code listing below allows checking of a Rnd() stream for substrings, using preset filter settings, eg; capitals, lower case, integers, etc., and in addition, includes a similar generator based on a hash for those who would like to compare it.
- The repeat substring procedure is quite slow, depending as it does on the location of repeats. The worst case is for no repeats found where the number of cycles becomes maximum at (0.5*n)^2, the square of half the number of characters in the test string. Of course the smallest number of cycles is just one when a simple string is repeated, eg; abcabc. Clearly, an increase of string length by a factor of ten could increase the run time by a factor of one hundred. (1000 characters in about 2 seconds, 2000 in 4secs, 10000 in 200secs, is, so far, the best timing!).
- Coding layout can affect the length of repeated substrings too. The reader might compare the effect of placing the Randomize function outside the random number loop, then inside the loop, while making an output of only capitals. (See code in MakeLongRndStr). In the past the repeat strings have worsened considerably when placed within. The code as listed here to test Rnd() with 1000 samples of capitals, no use of DoEvents, and Randomize wrongly placed inside the loop, will return a repeat substring of up to four hundred characters for this author. Increasing the code lines in the loop, affecting the time (?) for each iteration of the loop to run, also affects the length of any substrings.
Option Explicit
Sub TestRndForRepeats()
'run this to make a pseudo random string
'and to test it for longest repeated substring
Dim strRnd As String, sOut As String
Dim nOut As Long, nLen As Long
strRnd = MakeLongRndStr(1000)
MsgBox strRnd,, "Long string..."
sOut = LongestRepeatSubstring(strRnd, nOut)
MsgBox "Repeated substring found is : " & _
vbCrLf & sOut & vbCrLf & _
"Number of these found : " & nOut & vbCrLf & _
"Length of each : " & Len(sOut),, "Repeat substring..."
End Sub
Sub TestHashForRepeats()
'run this to make a long hash-based output
'and to test it for longest repeated substring
Dim sOut As String, sHash As String, nOut As Long
sHash = LongHash("String to hash", 1000)
MsgBox "The following sha256-based hash has " & _
Len(sHash) & " characters." & _
vbCrLf & vbCrLf & sHash,, "Long hash..."
sOut = LongestRepeatSubstring(sHash, nOut)
MsgBox "Repeated substring found is : " & _
vbCrLf & sOut & vbCrLf & _
"Number of these found : " & nOut & vbCrLf & _
"Length of each : " & Len(sOut),, "Repeat substring..."
End Sub
Function MakeLongRndStr(nNumChr As Long) As String
'Makes a long capital letters string using rnd VBA function
Dim n As Long, sChr As String, nAsc As Long
Dim nSamp As Long, sRec As String
'========================================================================
' Notes and Conclusions:
' The VBA function rnd is UNSUITED to generation of long random strings.
' Both length and number of repeats increases rapidly near 256 charas.
' Reasonable results can be obtained by keeping below 128 characters.
' For longer strings, consider hash-based methods of generation.
'========================================================================
'Randomize 'right place
Do Until n >= nNumChr
'DoEvents
Randomize 'wrong place
nSamp = Int((122 - 48 + 1) * Rnd + 48) 'range includes all charas
sChr = Chr(nSamp)
'cherry-picks 10, 26, 36, 52, or 62 from a set of 75
Select Case nSamp 'chara filter
Case 65 To 90 'upper case letters
sRec = sRec & sChr
Case 48 To 57 'integers
'sRec = sRec & sChr
Case 97 To 122 'lower case letters
'sRec = sRec & sChr
Case Else
'disregard
End Select
n = Len(sRec)
Loop
'MsgBox sAccum
MakeLongRndStr = Left$(sRec, nNumChr)
End Function
Function LongHash(sIn As String, nReq As Long, Optional sSeed As String = "") As String
'makes a long sha256 hash - length specified by user
'Parameters: sIn; the string to hash
'nReq; the length of output needed
'sSeed; optional added string modifier
Dim n As Long, m As Long, c As Long, nAsc As Integer, sChr As String
Dim sF As String, sHash As String, sRec As String, sAccum As String
Do Until m >= nReq
DoEvents
n = n + 1 'increment
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
'you set your own cycle increment here
sF = sIn & sSeed & sAccum & (7 * n * m / 3)
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
'get a single hash of sF
sHash = HashSHA256(sF)
'filter output for chara type
For c = 1 To Len(sHash)
sChr = Mid$(sHash, c, 1)
nAsc = Asc(sChr)
'cherry-picks 10, 26, 36 ,52, or 62 from a set of 64
Select Case nAsc 'chara filter
Case 65 To 90 'upper case letters
sRec = sRec & sChr
Case 48 To 57 'integers
'sRec = sRec & sChr
Case 97 To 122 'lower case letters
'sRec = sRec & sChr
Case Else
'disregard
End Select
Next c
'accumulate
sAccum = sAccum & sRec
m = Len(sAccum)
sRec = "" 'delete line at your peril!
Loop
LongHash = Left$(sAccum, nReq)
End Function
Function HashSHA256(sIn As String) As String
'Set a reference to mscorlib 4.0 64-bit
'HASHES sIn string using SHA2-256 algorithm
'NOTE
'total 88 output text charas of base 64
'Standard empty string input gives : 47DEQpj8HBSa+/...etc,
Dim oT As Object, oSHA256 As Object
Dim TextToHash() As Byte, bytes() As Byte
Set oT = CreateObject("System.Text.UTF8Encoding")
Set oSHA256 = CreateObject("System.Security.Cryptography.SHA256Managed")
TextToHash = oT.GetBytes_4(sIn)
bytes = oSHA256.ComputeHash_2((TextToHash))
HashSHA256 = ConvB64(bytes)
Set oT = Nothing
Set oSHA256 = Nothing
End Function
Function ConvB64(vIn As Variant) As Variant
'used to produce a base-64 output
'Set a reference to mscorlib 4.0 64-bit
Dim oD As Object
Set oD = CreateObject("MSXML2.DOMDocument")
With oD
.LoadXML "<root />"
.DocumentElement.DataType = "bin.base64"
.DocumentElement.nodeTypedValue = vIn
End With
ConvB64 = Replace(oD.DocumentElement.Text, vbLf, "")
Set oD = Nothing
End Function
Function LongestRepeatSubstring(sIn As String, Optional nSS As Long) As String
'finds longest repeated non-overlapping substring (in function name) and number of repeats (in nSS)
'greatest number cycles = (0.5*n)^2 for when "none found", eg; abcdef (9)
'shortest number cycles = 1 for one simple duplicated string; eg abcabc
Dim s1 As String, s2 As String, X As Long
Dim sPrev As String, nPrev As Long, nLPrev As Long
Dim nL As Long, nTrial As Long, nPos As Long, vAr As Variant
nL = Len(sIn)
For nTrial = Int(nL / 2) To 1 Step -1
DoEvents
For nPos = 1 To (nL - (2 * nTrial) + 1)
X = 0
s1 = Mid(sIn, nPos, nTrial)
s2 = Right(sIn, (nL - nPos - nTrial + 1))
vAr = Split(s2, s1)
X = UBound(vAr) - LBound(vAr)
If X > 0 Then
If nPrev < X Then
sPrev = s1
nPrev = X
End If
End If
Next nPos
If nPrev <> 0 Then
LongestRepeatSubstring = sPrev
nSS = nPrev
Exit Function
End If
Next nTrial
End Function