C# tutorial: extract text from a PDF file


Extracting text from a PDF file

In case that you want to extract text from a PDF file, this tutorial is useful to you. In iTextSharp, you can use the PdfReaderContentParse and the SimpleTextExtractionStrategy class to extract all text from the PDF file. These classes are in the iTextSharp.text.pdf.parser namespace.

The PdfReaderContentParse helps you to process content from pages of a PdfReader object.When creating an object of the PdfReaderContentParse class, you need to pass to its constructor a PdfReader object.The SimpleTextExtractionStrategy class is a simple text extraction renderer. Its object contains all the text of a specific page. You can use the GetResultantText method to get all the text of that page. The SimpleTextExtractionStrategy object is returned by the ProcessContent method of the PdfReaderContentParse class. You should note that if the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF.



The example code below extracts all text from the oop-software-development.pdf to a text file called result.txt.

PdfReader reader = new PdfReader("D:/oop-software-development.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
FileStream fs = new FileStream("D:/result.txt",FileMode.Create);
StreamWriter sw=new StreamWriter(fs);
SimpleTextExtractionStrategy strategy;
for (int i = 1; i <= reader.NumberOfPages; i++) {
strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
sw.WriteLine(strategy.GetResultantText());
}
sw.Flush();
sw.Close();

pdf text extract

 

Alternatively, you can use the PdfTextExtractor class to extract all text from a specific page of the PDF. This class has a static method called GetTextFromPage that can be used to do the task. The GetTextFromPage method, returning all text from a specific page, has two parameters: PdfReader object and page number. So by using the PdfTextExtractor instead of the PdfReaderContentParser and SimpleTextExtractionStrategy classes, the code above can be rewritten as shown below:

PdfReader reader = new PdfReader("D:/oop-software-development.pdf");
FileStream fs = new FileStream("D:/result1.txt",FileMode.Create);
StreamWriter sw=new StreamWriter(fs);
SimpleTextExtractionStrategy strategy;
for (int i = 1; i <= reader.NumberOfPages; i++) {
String text = PdfTextExtractor.GetTextFromPage(reader, i);
sw.WriteLine(text);
}
sw.Flush();
sw.Close();



Comments

lorretadt comment

 lorretadt

Questions on ASP.NET Annotate PDF is expected to relate to programming within the rasteredge page. Consider ASP.NET: Annotate PDF leaving comments for improvement can be answered on rasteredge page http://www.rasteredge.com/how-to/vb-net-imaging/pdf-html5-feature-annotate/


2016-06-29
lorretadt comment

 lorretadt

Here is the link for you to vb.net extract text from pdf. Hope this gives you a start on rasteredge page http://www.rasteredge.com/how-to/vb-net-imaging/pdf-convert-text/


2016-05-18
lorretadt comment

 lorretadt

The first and easiest way to vb.net read pdf text to expression web is to use the rasteredage page http://www.rasteredge.com/how-to/vb-net-imaging/pdf-html5-feature-annotate/.

The vb.net add comments to pdf reader is not static, so you'll need to create an instance of the clas


2016-05-14
lorretadt comment

 lorretadt

Here is the link for you to c# .net extract text from pdf. Hope this gives you a start on rasteredge page ttp://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-text/


2016-05-12
lorretadt comment

 lorretadt

Here is the link for you to c# .net extract text from pdf. Hope this gives you a start on rasteredge page ttp://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-text/


2016-05-12
lorretadt comment

 lorretadt

rasteredge can provide youc# add comments to pdf reader, and download it to try it free on rasteredge page http://www.rasteredge.com/how-to/csharp-imaging/pdf-html5-feature-annotate/



2016-05-04
Evelyn Vale comment

 Evelyn Vale

I think this can be done more easily with the GemBox.Document component for .NET:

http://www.gemboxsoftware.com/document/articles/c-sharp-vb-net-read-pdf-extract-text


2015-12-10



This website intents to provide free and high quality tutorials, examples, exercises and solutions, questions and answers of programming and scripting languages:
C, C++, C#, Java, VB.NET, Python, VBA,PHP & Mysql, SQL, JSP, ASP.NET,HTML, CSS, JQuery, JavaScript and other applications such as MS Excel, MS Access, and MS Word. However, we don't guarantee all things of the web are accurate. If you find any error, please report it then we will take actions to correct it as soon as possible.