Quantcast
Channel: Xojo Programming Forum - Latest topics
Viewing all articles
Browse latest Browse all 3747

Help with unicode-friendly regex whole word searches

$
0
0

My app has a function that changes the case of a string to Title Case (every word begins with a capital letter). It has a user-configurable stop list that contains words whose case is not to be changed (e.g. DNA). I use regex to search for the presence of such words in the string and ensure that they appear as in the stop list. A user has reported an edge case where mistakes can occur when a word as accented Unicode characters. I can see where the problem lies, but I can’t come up with a solution. I’m using a search pattern that I’m pretty sure @Kem_Tekinay helped me with:

(?<!\w)\Q" + stopWord + "\E(?!\w)

The intent is that only whole words will be found, and the characters between \Q and \E (the stop word) are treated as literals. The problem is that \w only deals with ASCII values. This is a real life example that fails:

stopWord = CE
string = Gréce

The search identifies the “ce” after the é as a match, and the output becomes

GréCE

If the é is changed to e the output is correct (Grece).

Suggestions on how to deal with such examples are appreciated.

7 posts - 3 participants

Read full topic


Viewing all articles
Browse latest Browse all 3747

Trending Articles