C# tutorial: area-based text extraction and margins finding


Location-based text extraction and margins finding

By reading the tutorial Extract text from a PDF file, you learn how to extract all text from a PDF file. In this tutorial, I am going to show you how to do area-based text extraction and margins finding.

To extract text from an area on the PDF document, you will use the FilteredTextRenderListener class that is a sub-class of the ITextExtractionStrategy class. When creating an object of the FilteredTextRenderListener class, you need to provide two parameters: an instance of the LocationTextExtractionStrategy and an instance of RegionTextRenderFilter class. When you create RegionTextRenderFilter, you can specify an area or a region on the PDF doccument to extract the text.

As you extract text from a PDF file based on areas or regions of PDF pages, it is import to know the margins of the pages, so that the text at the correct location can be extracted. You can obtain the information about the margins of the PDF pages by using the TextMarginFinder class. The instance of the TextMarginFinder class is returned from the ProcessContent method of the PdfReaderContentParser. With this instance, you can get x and y coordinates where the text starts on a page, and width and height of the page.

PdfReader reader = new PdfReader("D:/jmf_tutorial.pdf");
FileStream fs = new FileStream("D:/result.txt", FileMode.Create);
StreamWriter sw = new StreamWriter(fs);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
ITextExtractionStrategy strategy;
TextMarginFinder finder;
for (int i = 1; i <= reader.NumberOfPages; i++) {

finder = parser.ProcessContent(i, new TextMarginFinder());
Rectangle area = new Rectangle(finder.GetLlx(),finder.GetLly(),finder.GetWidth()/2,finder.GetHeight()/2);
RenderFilter filter = new RegionTextRenderFilter(area);
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
sw.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

}

sw.Flush();
sw.Close();

pdf locaton text extraction and margin finder


Comments




This website intents to provide free and high quality tutorials, examples, exercises and solutions, questions and answers of programming and scripting languages:
C, C++, C#, Java, VB.NET, Python, VBA,PHP & Mysql, SQL, JSP, ASP.NET,HTML, CSS, JQuery, JavaScript and other applications such as MS Excel, MS Access, and MS Word. However, we don't guarantee all things of the web are accurate. If you find any error, please report it then we will take actions to correct it as soon as possible.