How to Extract & Search for Text with Foxit PDF SDK (.NET)

Contents

Text Page

Foxit PDF SDK provides APIs to extract, select, search and retrieve text in PDF documents. PDF text contents are stored in TextPage objects which are related to a specific page. The TextPage class can be used to retrieve information about text in a PDF page, such as single character, single word, or text content within a specified character range or a rectangle and so on. It also can be used to construct objects of other text related classes to perform other operations for text contents or access specified information from text contents:

To search for text in the text contents of a PDF page, construct a TextSearch object with a TextPage object.
To access text such as hypertext links, construct a PageTextLinks object with TextPage object.

Example:

How to extract text from a PDF page

using foxit.common;
using foxit.pdf;
...
// Assuming PDFPage page has been loaded and parsed.
using (var text_page = new TextPage(page, (int)TextPage.TextParseFlags.e_ParseTextNormal))
{
int count = text_page.GetCharCount();
if (count > 0)
 {
 String chars = text_page.GetChars(0, count);
 writer.Write(chars);
 }
}
...

How to select text of a rectangle area in a PDF

using foxit.common;
using foxit.pdf;
using foxit.common.fxcrt;
...
RectF rect = new RectF(100, 50, 220, 100);
TextPage text_page = new TextPage(page, (int)foxit.pdf.TextPage.TextParseFlags.e_ParseTextNormal);
String str_text = text_page.GetTextInRect(rect);
...

Text Search

Foxit PDF SDK provides APIs to search text in a PDF document, a XFA document, a text page or in a PDF annotation’s appearance. It offers functions to perform a text search and get the search results:

To specify the search pattern and options, use functions TextSearch.SetPattern, TextSearch.SetStartPage (only useful for a text search in a PDF document), TextSearch.SetEndPage (only useful for a text search in a PDF document) and TextSearch.SetSearchFlags.
To perform the search, use function TextSearch.FindNext or TextSearch.FindPrev.
To get the search results, use function TextSearch.GetMatchXXX().

Example:

How to search a text pattern in a page

using foxit.common;
using foxit.pdf;
...
// Assuming PDFDoc doc has been loaded.
using (TextSearch search = new TextSearch(doc, null))
{
 int start_index = 0;
 int end_index = doc.GetPageCount() - 1;
 search.SetStartPage(0);
 search.SetEndPage(doc.GetPageCount() - 1);
 String pattern = "Foxit";
 search.SetPattern(pattern);
 Int32 flags = (int)TextSearch.SearchFlags.e_SearchNormal;
 search.SetSearchFlags(flags);
 int match_count = 0;
 while (search.FindNext())
 {
 RectFArray rect_array = search.GetMatchRects();
 match_count++;
 }
...

Text Link

In a PDF page, text contents that represent a hypertext link to a website/resource on the internet, or an email address are the same as common text. Prior to text link processing, user should first call PageTextLinks.GetTextLink to get a textlink object.

Example:

How to retrieve hyperlinks in a PDF page

using foxit.common;
using foxit.pdf;
...
// Assuming PDFPage page has been loaded and parsed.
// Get the text page object.
TextPage text_page = new TextPage(page, (int)foxit.pdf.TextPage.TextParseFlags.e_ParseTextNormal);
PageTextLinks page_textlinks = new PageTextLinks(text_page);
TextLink text_link = page_textlinks.GetTextLink(index); // specify an index.
string str_url = text_link.GetURI();
...

Updated on October 23, 2019

Was this article helpful?

Yes No

Ready to try Foxit PDF SDK?

Click the link below to download your trial

Free Trial

Text Page

How to extract text from a PDF page

How to select text of a rectangle area in a PDF

Text Search

How to search a text pattern in a page

Text Link

How to retrieve hyperlinks in a PDF page

Was this article helpful?

Related Articles