{"id":3095,"date":"2016-02-09T01:20:29","date_gmt":"2016-02-09T01:20:29","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=3095"},"modified":"2016-02-09T01:20:29","modified_gmt":"2016-02-09T01:20:29","slug":"extracting-images-from-pdfs-in-c-sharp","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/extracting-images-from-pdfs-in-c-sharp\/","title":{"rendered":"Extracting images from PDFs in C#"},"content":{"rendered":"<p>TinyMCE is a Javascript rich-text editor that allows for a lot of extensibility. For example, this is a screenshot of what it looks like in my WordPress installation:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3102\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2016\/02\/wordpress-tinymce.png\" alt=\"wordpress-tinymce\" width=\"815\" height=\"229\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-tinymce.png 815w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-tinymce-300x84.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-tinymce-768x216.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/p>\n<p>There are a lot of customization hooks, and because it&#8217;s used in WordPress, it gets a lot of maintenance. For instance, it supports pasting from Word pretty well. There is even a flag you can enable to allow people to paste in images (this base64 encodes them in the page).<\/p>\n<p>The toughest area to work on is adding a media library, since that is typically dependent on the backend services (i.e. you&#8217;d need a different implementation depending on whether you use PHP, C#, Javascript etc, and Postgres \/ MySQL \/ Mongo).<\/p>\n<p>WordPress lets you drag and drop images onto the page, and then they go into a paged list of images that you can search later.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3101\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2016\/02\/wordpress-upload.png\" alt=\"wordpress-upload\" width=\"790\" height=\"543\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-upload.png 790w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-upload-300x206.png 300w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2016\/02\/wordpress-upload-768x528.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/p>\n<p>One thing that surprises me that more people don&#8217;t do is to let you upload PDFs and automatically extract the images, but as I&#8217;ve found out, this turns out to be a little more difficult than I&#8217;d anticipated.<\/p>\n<p>In .NET, there are a few options for PDF libraries, including PDFBox, Aspose, iTextSharp. I believe thatall three are originally Java libraries, which adds some complexity.<\/p>\n<p>PDFBox is an Apache library, so it is the &#8220;cheapest&#8221;, but only if you value your time low. In order to use PDFBox, you have to run the library through IKVM (or download a copy from someone who has). IKVM converts bytecode from Java to .NET, and adds a ton of libraries to replace the JDK. Unfortunately it&#8217;s a big pain to do interop, since you need to write wrapper classes for things like streams. Image processing depends on AWT, which didn&#8217;t work in the versions of PDFBox I found, and after a few hours of poking at this I abandoned this approach.<\/p>\n<p>iTextSharp is licensed under a license that lets you look at it, but if you want to use this without releasing your source you&#8217;d need to purchase a license. At one point it was licensed more openly. Someone ported this to C# under the old license and added it to Nuget, which is an option for testing. At this point this is probably missing a lot of bug fixes (you can certainly find many Stackoverflow posts where iText reps say this).<\/p>\n<p>There are a lot of examples of how to do this already written, so for completeness, this is a good one to start from:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/7shi\/805326.js\"><\/script><\/p>\n<p>The biggest problem I&#8217;ve had with this is that it appears that images in PDFs may include transparency, and this information can get lost in translation on the way out.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TinyMCE is a Javascript rich-text editor that allows for a lot of extensibility. For example, this is a screenshot of what it looks like in my WordPress installation: &nbsp; There are a lot of customization hooks, and because it&#8217;s used in WordPress, it gets a lot of maintenance. For instance, it supports pasting from Word &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/extracting-images-from-pdfs-in-c-sharp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Extracting images from PDFs in C#&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[22],"tags":[96,299,419],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3095"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=3095"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3095\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=3095"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=3095"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=3095"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}