Please check out the DynaPDFParserMBS class in MBS Xojo DynaPDF Plugin. This class allows you to:
- Parse a page
- Extract text
- Find text
- Replace text
- Find characters
- Delete text
- Write changes back to page
You can limit the search to a part of the page or the whole page and use various options like whether the text search is case insensitive.
Today we want to show you how you can identify the exact position of any character in a PDF. Like this picture where we show all characters with a box, even for mirrored or rotated text:
Let us show the code for this. You may review the example project Text Positions with parser and see where we load the PDF. Once it is loaded, we initialize the DynaPDFParserMBS object. We use the kstMatchAlways here to have it not look for a particular text, but to report the position of every character:
// now do search and replace
Dim Parser As New DynaPDFParserMBS(p)
Dim area As DynaPDFRectMBS = Nil // whole page
Dim SearchType As Integer = DynaPDFParserMBS.kstMatchAlways
Dim ContentParsingFlags As Integer = DynaPDFParserMBS.kcpfEnableTextSelection
If parser.ParsePage(1, ContentParsingFlags) Then
Dim index As Integer = 0
Dim found As Boolean = Parser.FindText(area, SearchType, "")
While found
Dim r As DynaPDFRectMBS = parser.SelBBox
Dim t As New PDFText
t.Text = parser.SelText
t.rect = r
t.index = index
t.points = parser.SelBBox2
texts.Append t
index = index + 1
found = Parser.FindText(area, SearchType, "", True)
Wend
End If
The loop runs while we have more text. For each character, we get the selection text and the bounding box as an array of points. You can of course just get the rectangle, but that won’t handle rotated text. We continue the loop with calling FindText again and passing true to continue search.
In the paint event of the window, we draw the PDF page first. Then we loop over the found text pieces and show each character surrounded with the box drawn from the points we got:
For Each t As PDFText In texts
Dim points() As DynaPDFPointMBS = t.points
g.ForeColor = &c00FF00
g.DrawLine points(0).X * factor, points(0).Y * factor, points(1).X * factor, points(1).Y * factor
g.DrawLine points(1).X * factor, points(1).Y * factor, points(2).X * factor, points(2).Y * factor
g.DrawLine points(2).X * factor, points(2).Y * factor, points(3).X * factor, points(3).Y * factor
g.DrawLine points(3).X * factor, points(3).Y * factor, points(0).X * factor, points(0).Y * factor
next
As shown you can know from each character where it is. You may use DeleteText function to precisely cut text and remove individual characters from the PDF page. Or annotate the PDF page. Like you could add WebLinks to specific words once you know the surrounding rectangle.
Please try the example project and let us know what questions you have. The recent addition of SelBBOx2 and SelText properties in v24.1 are based on customers asking for them.
3 posts - 2 participants