Quantcast
Viewing all articles
Browse latest Browse all 3874

Split a text in words with regex, language unknown

Hello you all. I want to split a text into words and tried to use Joe Strout’s SplitByReg for it.
the function is:

Protected Function SplitByRegEx(source As String, delimPattern As String) As String()
  // Split a string into fields delimited by a regular expression.
  
  Dim out(-1) As String
  
  Dim re As New RegEx
  Dim rm As RegExMatch
  Dim startPos As Integer
  
  re.SearchPattern = delimPattern
  rm = re.Search( source )
  while rm <> nil
    'System.DebugLog rm.SubExpressionString(0) + " at " + str(rm.SubExpressionStartB(0)) + " matches " + pattern + " in " + s
    out.Append MidB( source, startPos + 1, rm.SubExpressionStartB(0) - startPos )
    startPos = re.SearchStartPosition
    rm = re.Search
  wend
  
  if startPos < source.LenB then
    out.Append MidB( source, startPos + 1 )
  end if
  
  return out
  
End Function

The regex works correct in most languages, but splits in some languages on certain characters. Trouble is the source text can be in any language.
For example the polish word “pieścidełko” breaks as

pie
cide
ko

I tried to use “\W” as pattern, as wel as “[^'[:alpha:]]+”, with and without “" to escape the quote.
Every pattern breaks at the same place
But when I use String.toArray(” ") it works and doesn’t split the word. The text however can also have commas, dots, question marks etc as separators.

When I try the same in php using mb_split it works correct. even without specifying the encoding.

Is there something obvious that I don’t see or should I loop thru the text and use ToArray on every single space, comma, point etc.

Is there anyone who could help with this. It’s driving me nuts.
Thank you.

3 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 3874

Trending Articles