![]() |
#include <TextExtractor.h>
Classes | |
class | Line |
TextExtractor::Line object represents a line of text on a PDF page. More... | |
class | Style |
A class representing predominant text style associated with a given Line, a Word, or a Glyph. More... | |
class | Word |
TextExtractor::Word object represents a word on a PDF page. More... | |
Public Types | |
enum | ProcessingFlags { e_no_ligature_exp = 1, e_no_dup_remove = 2, e_punct_break = 4, e_remove_hidden_text = 8, e_no_invisible_text = 16 } |
Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms. More... | |
enum | XMLOutputFlags { e_words_as_elements = 1, e_output_bbox = 2, e_output_style_info = 4 } |
Flags controlling the structure of XML output in a call to GetAsXML(). More... | |
Public Member Functions | |
TextExtractor () | |
Constructor and destructor. | |
~TextExtractor () | |
void | Begin (Page page, const Rect *clip_ptr=0, UInt32 flags=0) |
Start reading the page. | |
int | GetWordCount () |
void | GetAsText (UString &out_str, bool dehyphen=true) |
Get all words in the current selection as a single string. | |
void | GetAsXML (UString &out_xml, UInt32 xml_output_flags=0) |
Get text content in a form of an XML string. | |
int | GetNumLines () |
Line | GetFirstLine () |
The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor include:
The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:
Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.
In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.
A sample use case (in C++):
... Initialize PDFNet ... PDFDoc doc(filein); doc.InitSecurityHandler(); Page page = *doc.PageBegin(); TextExtractor txt; txt.Begin(page, 0, TextExtractor::e_remove_hidden_text); UString text; txt.GetAsText(text); // or traverse words one by one... TextExtractor::Line line = txt.GetFirstLine(), lend; TextExtractor::Word word, wend; for (; line!=lend; line=line.GetNextLine()) { for (word=line.GetFirstWord(); word!=wend; word=word.GetNextWord()) { text.Assign(word.GetString(), word.GetStringLen()); cout << text << '\n'; } }
A sample use case (in C#):
... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.InitSecurityHandler(); Page page = doc.PageBegin().Current(); TextExtractor txt = new TextExtractor(); txt.Begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text); string text = txt.GetAsText(); // or traverse words one by one... TextExtractor.Word word; for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) { for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) { Console.WriteLine(word.GetString()); } }
For full sample code, please take a look at TextExtract sample project.
Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms.
Flags controlling the structure of XML output in a call to GetAsXML().
pdftron::PDF::TextExtractor::TextExtractor | ( | ) |
Constructor and destructor.
pdftron::PDF::TextExtractor::~TextExtractor | ( | ) |
Start reading the page.
page | Page to read. | |
clip_ptr | A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle. | |
flags | A list of ProcessingFlags used to control text extraction algorithm. |
int pdftron::PDF::TextExtractor::GetWordCount | ( | ) |
void pdftron::PDF::TextExtractor::GetAsText | ( | UString & | out_str, | |
bool | dehyphen = true | |||
) |
Get all words in the current selection as a single string.
out_str | The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. ' ') characters. | |
dehyphen | If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files. |
Get text content in a form of an XML string.
out_xml | - The string containing XML output. | |
xml_output_flags | - flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags. |
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1"> <Para id="1"> <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;"> <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word> <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word> <Word box="125.617, 708.075, 6.22242, 10.02">is</Word> ... </Line> </Para> </Flow> </Page>
The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)
In case 'xml_output_flags' was not specified, the default XML output would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1">
<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line> <Line>levels. Using the PDFNet PDF library, ...</Line> ...
</Flow> </Page>
int pdftron::PDF::TextExtractor::GetNumLines | ( | ) |
Line pdftron::PDF::TextExtractor::GetFirstLine | ( | ) |
To traverse the list of all word on a given line use line.GetFirstWord().