C# tutorial: PDF fill in form


Fill in PDF form

In the two previous pages, you learnt how to add form fields to the PDF document. In this tutorial, I am going to show you how to retrieve all fields' names in an existing PDF document and fill data in the fields.

The PdfStamper class has the AcroFields property. This property allows you to access all fields in the PDF document. To get the enumerable collection of fields, you will use the Fields property of the AcroFields property. The Fields property returns a IDictionary collection object that contains all fields. Then you can read through the collection and output the name of each field by using a foreach loop. As you see in the picture below, there are three fields: txtname, txtmail, and combobox, in the PDF document (pdfdoc.pdf).

pdf blank form

//path to source file
String source = "d:/pdfdoc.pdf";
//create PdfReader object to read the source file
PdfReader reader = new PdfReader(source);
//PdfStamper object to modify the content of the PDF
PdfStamper stamp = new PdfStamper(reader, new

FileStream("d:/formfilled.pdf", FileMode.Create));
AcroFields form = stamp.AcroFields;
IDictionary<string, AcroFields.Item> fs = form.Fields;
foreach (var f in fs)
{
Console.WriteLine(f.Key);
}

pdf retrieve field names

After you know the fields in the PDF document, let's move to how to fill data in the fields. You can fill data in the text fields by using the SetField(string name, string value) method of the AcroFields class. The SetField method has two parameters: name and value. The name parameter allows you to specify the name of the field. The value parameter refers the value that is set to the text field. To select a value from the combo box, you can use the SetListSelection(string name,string value[]) method. The SetListSelection method is similar to the SetField method. Howerver, Instead of passing a single value to the SetField method, the SetListSelection method allows you to pass an array of values that will be selected in the combo box.

//path to source file
String source = "d:/pdfdoc.pdf";
//create PdfReader object to read the source file
PdfReader reader = new PdfReader(source);
//PdfStamper object to modify the content of the PDF
PdfStamper stamp = new PdfStamper(reader, new FileStream("d:/formfilled.pdf", FileMode.Create));
//get form fields
AcroFields form = stamp.AcroFields;
//fill in text fiels
form.SetField("txtname", "Yuk dara");
form.SetField("txtmail", "yuk.dara@gmail.com");
//select combo box
form.SetListSelection("combobox", new string[]{"Khmer"});
stamp.Close();
//view the result pdf file
System.Diagnostics.Process.Start("d:/formfilled.pdf");

pdf filled in form


Comments

Pramod comment

 Pramod

I want fill that pdf dynamically with database values??
actually my code return random form fields not in sequentially.
Please let me know the process.

Thank you.


2018-04-06
Pramod comment

 Pramod

I want fill that pdf dynamically with database values??
actually my code return random form fields not in sequentially.
Please let me know the process.

Thank you.


2018-04-06
g comment

 g

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


v
Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid


Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

123
Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked = true;

// a text field
PdfTXField f = ...;
// set the text
f.Text = "Hello, World.";
// mark it as a password field
f.Password = true;

// a combo or list box
PdfCHField f = ...;
// render as combo box
f.Combo = true;
// more than one item is selectable
f.MultiSelect = true;
// select items 1 and 3
f.SetSelectedIndexes(1, 3);
Points of Interest
The parser can deal with almost all string representations the PDF Reference document provides for, i.e. literal string including escape sequences and hexadecimal strings with possibly missing digits. It can also parse Unicode (UTF-16) encoded text strings. Language detection is not supported, however. Strings are always written out in literal format.
The parser supports all form field types except for signature fields. The supported types are Button (including Pushbutton, Checkbox, and Radio Button), Text, and Choice.
The parser cannot currently deal with linearized PDF files, i.e. files that were saved with the option "optimized for fast web view" in Acrobat. Also, encrypted files cannot be parsed.
For demo forms you might want to download the Adobe Acrobat Forms Samples package which includes a number of forms that exhibit most of the features of PDF forms.
Adobe, Acrobat, and Acrobat Reader are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States and/or other countries.
Tools used
I have written a number of unit tests using the NUnit unit testing framework which are included with the sources.
Class library documentation can be generated from the sources using the NDoc code documentation generator. The documentation can then be used from within Visual Studio.NET just like the .NET Framework class library documentation. An appropriate configuration file for NDoc is included with the sources.

Both NUnit and NDoc are open source software.

History
August 19, 2004: Version 1.0.
August 26, 2004: Version 1.1.
Added paragraph about appearance streams.
September 25, 2004: Version 1.2.
Now supports linearized files.
Now supports inherited fields.
Uses NAnt.
Uses log4net.
October 01, 2004: Version 1.3.
Fixed a bug parsing objects (thanks to Eddie Neal for helping me find it).
Fixed a number of FxCop issues, particularly regarding naming (thanks to Heath Stewart for making me aware).
License
This article, along with any associated source code and files, is licensed under The BSD License

Share
EMAIL
TWITTER
About the Author

Michael Ganss
Software Developer (Senior) UpdateStar
Germany Germany
Michael Ganss is Managing Director of UpdateStar. UpdateStar offers complete protection from PC vulnerability caused by outdated software. The award-winning UpdateStar offers comfortable software installation, uninstallation, and keeps all of your programs up-to-date. UpdateStar recognizes more than 135,000 software products and lets you know once an update is available for you - for optimized PC security.

You may also be interested in...

ASP Parser

Generate and add keyword variations using AdWords API

PDF Parser and FlateDecoder

Window Tabs (WndTabs) Add-In for DevStudio

SAPrefs - Netscape-like Preferences Dialog

OLE DB - First steps
Comments and Discussions

You must Sign In to use this message board.
Search Comments
Go
Spacing Layout Per page Update
First PrevNext

Question
Can not run pdf parser Pin member Member 11668163 10-May-15 23:04
General
My vote of 1 Pin member Paul Scholz 22-Oct-12 12:48
Question
Getting error. Pease help me Pin member nitin-aem 17-Aug-12 21:58
General
My vote of 5 Pin member manoj kumar choubey 15-Feb-12 23:07
Question
Adobe X Pin member vmullan 17-Jan-12 6:13
Answer
Re: Adobe X Pin member Paul Scholz 22-Oct-12 12:41
General
My vote of 5 Pin group Paul Coldrey 5-Jan-12 12:11
General
Tables Pin member priore 28-Oct-10 6:26
General
Parse pdf tables Re: Tables Pin member devvvy 22-Dec-10 16:20
General
Re: Parse pdf tables Re: Tables Pin member Gandalf - The White 22-Apr-11 1:37
General
Image Parser Pin member skg3264510 20-Oct-10 22:29
Question
AcroForm doubt! Pin member danielsantana 21-Jun-10 15:32
Question
create password for a pdf file Pin member PrgMaster 3-Jun-09 23:39
Question
Unable to Parse pdf file????? Pin member Adrien 4-Mar-09 12:11
Question
how to recognise hidden fields in pdf by itext Pin member rupkumar2006 20-Feb-09 7:36
General
Converting pdf to xml Pin member Rajshekar_Excelsoft 12-Dec-08 19:04
Question
SomeOne Help Me???? Pin member harsha318_ 27-Nov-08 22:03
Answer
Re: SomeOne Help Me???? Pin member Michael Ganss 27-Nov-08 23:00
General
Re: SomeOne Help Me???? Pin member harsha318_ 28-Nov-08 1:20
General
Re: SomeOne Help Me???? Pin member Member 3471270 15-Mar-10 11:43
General
Reading comments from PDF Pin member sunanth krishnan 22-Feb-08 1:08
General
header problem Pin member cadolfo_2000 22-Oct-07 5:00
Question
Radio buttons and comboboxes sintax problem Pin member Draculea5 10-Oct-07 4:45
General
Sweetness Pin member m_p_fontana 1-Jun-07 8:37
General
Re: Sweetness Pin member JCollum 7-Aug-07 12:20

Last Visit: 31-Dec-99 18:00 Last Update: 17-Jul-17 20:26 Refresh 1234567 Next »
General General News News Suggestion Suggestion Question Question Bug Bug Answer Answer Joke Joke Praise Praise Rant Rant Admin Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Go to top
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170713.1 | Last Updated 22 Jun 2006
Select Language​▼
Article Copyright 2004 by Michael Ganss
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid

Click here to Skip to main content
13,036,776 members (59,983 online) Sign in
Home
Click here to Skip to main content

Search for articles, questions, tips
Submit
homearticles
Chapters and Sections>
loading
Search
Latest Articles
Latest Tips/Tricks
Top Articles
Beginner Articles
Technical Blogs
Posting/Update Guidelines
Article Help Forum
Article Competition
Submit an article or tip
Post your Blog
quick answers
Ask a Question about this article
Ask a Question
View Unanswered Questions
View All Questions...
C# questions
ASP.NET questions
SQL questions
VB.NET questions
Javascript questions
discussions
All Message Boards...
Application Lifecycle>
Running a Business
Sales / Marketing
Collaboration / Beta Testing
Work Issues
Design and Architecture
ASP.NET
JavaScript
C / C++ / MFC>
ATL / WTL / STL
Managed C++/CLI
C#
Free Tools
Objective-C and Swift
Database
Hardware & Devices>
System Admin
Hosting and Servers
Java
.NET Framework
Android
iOS
Mobile
SharePoint
Silverlight / WPF
Visual Basic
Web Development
Site Bugs / Suggestions
Spam and Abuse Watch
features
Competitions
News
The Insider Newsletter
The Daily Build Newsletter
Newsletter archive
Surveys
Product Showcase
Research Library
CodeProject Stuff
community
Who's Who
Most Valuable Professionals
The Lounge
The Insider News
The Weird & The Wonderful
The Soapbox
Press Releases
Non-English Language >
General Indian Topics
General Chinese Topics
help
What is 'CodeProject'?
General FAQ
Ask a Question
Bugs and Suggestions
Article Help Forum
Site Map
Advertise with us
About our Advertising
Employment Opportunities
About Us
Articles » General Programming » Algorithms & Recipes » Parsers and Interpreters
Print
Article
Browse Code
Stats
Revisions
Alternatives
Comments (170)
Add your own
alternative version
Tagged as

.NET1.1
VS.NET2003
C#
Windows
.NET
Visual-Studio
Dev
Intermediate
Stats

532.7K views
9.9K downloads
157 bookmarked
Posted 19 Aug 2004
BSD
A PDF Forms Parser


Michael Ganss, 22 Jun 2006

4.60 (53 votes)
Rate this:
vote 1vote 2vote 3vote 4vote 5
A parser for PDF Forms written in C#.NET.
Download source - 22.3 Kb
Introduction
Although PDF documents are most often used for static content, they can also be used to represent user-fillable forms, much like HTML forms. PDF forms can be created by taking an existing PDF document and placing form fields on it using e.g. Adobe® Acrobat®. In many scenarios the resulting PDF forms are filled out by human users using a PDF viewing tool such as Adobe Acrobat. The actual data can be separated from the PDF that contains the representation using FDF or XFDF files, the latter being an XML format that contains the content of the form fields of a particular document. By using FDF or XFDF it is easy to programmatically fill out PDF forms in scenarios where the content is generated or queried from a database.

However, in certain scenarios it is required to incorporate the actual content into the PDF itself in order to have just one file that contains both content and representation. The small parser presented in this article helps to do just that, i.e. parse an existing PDF document containing form fields, get and set form field contents programmatically, and write the resulting PDF document back out.

Background
PDF is a proprietary format devised by Adobe Systems, Inc. in 1993. It is derived from Postscript, which in turn is derived from the Forth language. The specification for PDF is publicly available from the Adobe web site.

When I first started out trying to fill a PDF form programmatically, I had no idea what the PDF format looked like. So I just opened a PDF file with a text editor and discovered that the contents were actually human readable (or so it seemed). It was easy to identify the form fields and replace their content. Here's an excerpt from a PDF file that shows how a text field is represented:

Hide Copy Code
2774 0 obj
<<
/Type /Annot
/Subtype /Widget
/Rect [ 27.09381 776.96008 194.09021 789.76807 ]
/F 4
/P 1996 0 R
/AP << /N 14 6 R >>
/DA (/Helv 10 Tf 0 g)
/T (Name)
/FT /Tx
/Ff 4194304
/DV (Smith)
/V (Smith)
>>
endobj
Here, /T (Name) represents, not surprisingly, the name of the field you assign to it in the properties dialog of Acrobat. It's also easy to figure out that the "Smith" strings in parentheses represent the content of the field. /V stands for the actual value, while /DV represents the default value that the field content reverts to when the field is reset.

If you replace the string "Smith" by "Jones" you will find that the field content has not actually changed, but will change only after you click on the field in Acrobat. This is because Acrobat does not use the value of the form field for the visual representation, but "caches" the visual representation in an appearance stream object referenced from the /AP entry. Only after you click on the field will Acrobat regenerate the appearance stream and thus the visual representation. To work around this problem, you can try to find the appearance stream and change the string there as well.

But there are more problems. If you replace "Smith" by "Washington" Acrobat will report an error. This is because PDF is not in fact a text format but a binary format that contains an offset table with the byte offsets of the start of all objects.

If you change the offset of an object by extending an object earlier in the file but do not fix the offset table, the file gets corrupted. Usually Acrobat can fix minor errors in the offset table so you will usually still see something in Acrobat, but clearly this is not the right approach to filling form fields.

A workaround to this problem would be to always replace the exact same number of characters by truncating strings that are too long and padding with whitespace those that are too short. If you have control over the design of the PDF form you might choose as the initial content of each text field a fixed number of whitespace characters that definitely extend over the right edge of the field's box.

While these workarounds may be appropriate in certain situations, I found them not to be satisfying and wrote my own little PDF parser.

The PDF Parser
The parser is not a full-fledged PDF parser but rather a small, one-class parser that can be dropped into any project where form field parsing is necessary instead of a whole library that adds a lot of overhead. Although the parser supports all types of PDF objects except for streams, it parses just the form fields of a PDF file by looking at the AcroForm dictionary. If you need a full-fledged PDF parser you might want to look at the iText library which has been ported to several platforms including .NET.
The parser is designed as a straight-forward recursive descent parser. Since we are interested only in the form fields, the parser first parses the cross reference tables that contain the offsets of all objects and then finds the AcroForm dictionary that contains the identifiers of all form fields. Once we know the start and end offsets of all form fields, we can parse each form field object (which are a special form of dictionary object) in a recursive descent fashion. Summarizing, these are the steps to parse the whole PDF:

Parse cross reference table(s) identifying byte offsets for all objects.
Parse AcroForm dictionary object identifying form field object identifiers.
Parse all form field objects in recursive descent fashion.
This leaves us with a list of (C#) objects whose contents can be programmatically queried and updated. In order to write a conformant PDF file, we make use of a feature of the PDF format that provides for easy extensibility of PDF documents. PDF objects provide a simple versioning mechanism that makes it possible to append newer versions of objects already contained in a PDF file to the file. We simply write out all field objects that have changed and add an updated cross reference table that links to the old cross reference table. This same mechanism is also used by Acrobat itself when you change a form field and press the "Save" button. That's why PDF files keep getting bigger although you don't actually add any new content. Only when you do a "Save as" does Acrobat reorganize the PDF and eliminate duplicate object entries.
Using the code
The following example reads a PDF file, parses it, changes the value of a form field and writes an updated PDF file back out.
Hide Copy Code
// read the file and parse it
PdfReader reader = new PdfReader(filename);

// change one text field
try
{
((PdfTXField)reader.FieldsByName["Name"]).Text = "Doe";
}
catch
{
}

// write the updated file back out
FileStream fileStream = new FileStream(newFilename, System.IO.FileMode.Create);
reader.WritePdf(fileStream);
fileStream.Close();
Most properties of fields are accessible through properties in .NET as well, e.g.:

Hide Copy Code
// a radio button
PdfRadioButtonField f = ...;
// set the selected button, "Off" means just that.
f.SelectedItem = "MasterCard";
// one button must be pressed
f.NoToggleToOff = true;

// a check box
PdfCheckBoxField f = ...;
// check it
f.Checked =