Extracting images from PDFs in C#

TinyMCE is a Javascript rich-text editor that allows for a lot of extensibility. For example, this is a screenshot of what it looks like in my WordPress installation:

 

wordpress-tinymce

There are a lot of customization hooks, and because it’s used in WordPress, it gets a lot of maintenance. For instance, it supports pasting from Word pretty well. There is even a flag you can enable to allow people to paste in images (this base64 encodes them in the page).

The toughest area to work on is adding a media library, since that is typically dependent on the backend services (i.e. you’d need a different implementation depending on whether you use PHP, C#, Javascript etc, and Postgres / MySQL / Mongo).

WordPress lets you drag and drop images onto the page, and then they go into a paged list of images that you can search later.

wordpress-upload

One thing that surprises me that more people don’t do is to let you upload PDFs and automatically extract the images, but as I’ve found out, this turns out to be a little more difficult than I’d anticipated.

In .NET, there are a few options for PDF libraries, including PDFBox, Aspose, iTextSharp. I believe thatall three are originally Java libraries, which adds some complexity.

PDFBox is an Apache library, so it is the “cheapest”, but only if you value your time low. In order to use PDFBox, you have to run the library through IKVM (or download a copy from someone who has). IKVM converts bytecode from Java to .NET, and adds a ton of libraries to replace the JDK. Unfortunately it’s a big pain to do interop, since you need to write wrapper classes for things like streams. Image processing depends on AWT, which didn’t work in the versions of PDFBox I found, and after a few hours of poking at this I abandoned this approach.

iTextSharp is licensed under a license that lets you look at it, but if you want to use this without releasing your source you’d need to purchase a license. At one point it was licensed more openly. Someone ported this to C# under the old license and added it to Nuget, which is an option for testing. At this point this is probably missing a lot of bug fixes (you can certainly find many Stackoverflow posts where iText reps say this).

There are a lot of examples of how to do this already written, so for completeness, this is a good one to start from:

The biggest problem I’ve had with this is that it appears that images in PDFs may include transparency, and this information can get lost in translation on the way out.