Extract Images from PDF with Apache Tika

Extract Images from PDF with Apache Tika

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.

My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.

I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.

Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?

TIKA-1268 and TIKA-1396 were both marked as fixed in 1.6, are you sure you're really using Tika 1.6 for this?
– Gagravarr
Sep 11 '14 at 10:37

Assuming that the one marked 1.6 on the website and that is called tika-app-1.6.jar is actually Tika 1.6, then yes I'm sure!
– James Baker
Sep 11 '14 at 12:31

And you're trying the Tika App with the --extract flag to test the image extraction?
– Gagravarr
Sep 11 '14 at 12:52

--extract

I'm trying to do it programmatically, but I've tried the --extract flag and using the GUI and haven't successfully managed to find the images in the document with either methods.
– James Baker
Sep 11 '14 at 13:42

Sounds like you need to hop onto one of those bugs then, and flag up that it isn't properly fixed yet
– Gagravarr
Sep 11 '14 at 14:02

2 Answers
2

Try the code bellow, ContentHandler turned has your xml content.

public ContentHandler convertPdf(byte content, String path, String filename)throws IOException, SAXException, TikaException Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); ContentHandler handler = new ToXMLContentHandler(); PDFParser parser = new PDFParser(); PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(true); parser.setPDFParserConfig(config); EmbeddedDocumentExtractor embeddedDocumentExtractor = new EmbeddedDocumentExtractor() @Override public boolean shouldParseEmbedded(Metadata metadata) return true; @Override public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws SAXException, IOException Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath(); Files.copy(stream, outputFile); ; context.set(PDFParser.class, parser); context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor ); try (InputStream stream = new ByteArrayInputStream(content)) parser.parse(stream, handler, metadata, context); return handler;

It is possible to use an AutoParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.

AutoParser

PDFParser

Here I have a parseDocument() and a setPdfConfig() function which makes use of a AutoParser.

parseDocument()

setPdfConfig()

AutoParser

EmbeddedDocumentExtractor

ParseContext

AutoParser

ParseContext

PDFParserConfig

ParseContext

AutoParser.parse()

The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/.

<sourceFile>_/

private static void setPdfConfig(ParseContext context) PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); pdfConfig.setExtractUniqueInlineImagesOnly(true); context.set(PDFParserConfig.class, pdfConfig); private static String parseDocument(String path) TikaException e) e.printStackTrace(); return xhtmlContents;

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

wW0cRkdXmHA,vqfXE2RRYlGiK,b0vp2JAI JFl5iuCzj,P8F Ye bW6L6afwZB sfZ l,yY,3bx3iYCLHWV2sJ,VcUONWHZhx4D46L2

搜尋此網誌

Sfyjdyy