Hello you all. I want to split a text into words and tried to use Joe Strout’s SplitByReg for it.
the function is:
Protected Function SplitByRegEx(source As String, delimPattern As String) As String()
// Split a string into fields delimited by a regular expression.
Dim out(-1) As String
Dim re As New RegEx
Dim rm As RegExMatch
Dim startPos As Integer
re.SearchPattern = delimPattern
rm = re.Search( source )
while rm <> nil
'System.DebugLog rm.SubExpressionString(0) + " at " + str(rm.SubExpressionStartB(0)) + " matches " + pattern + " in " + s
out.Append MidB( source, startPos + 1, rm.SubExpressionStartB(0) - startPos )
startPos = re.SearchStartPosition
rm = re.Search
wend
if startPos < source.LenB then
out.Append MidB( source, startPos + 1 )
end if
return out
End Function
The regex works correct in most languages, but splits in some languages on certain characters. Trouble is the source text can be in any language.
For example the polish word “pieścidełko” breaks as
pie
cide
ko
I tried to use “\W” as pattern, as wel as “[^'[:alpha:]]+”, with and without “" to escape the quote.
Every pattern breaks at the same place
But when I use String.toArray(” ") it works and doesn’t split the word. The text however can also have commas, dots, question marks etc as separators.
When I try the same in php using mb_split it works correct. even without specifying the encoding.
Is there something obvious that I don’t see or should I loop thru the text and use ToArray on every single space, comma, point etc.
Is there anyone who could help with this. It’s driving me nuts.
Thank you.
3 posts - 2 participants