Extract Images from PDF with Apache Tika

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Extract Images from PDF with Apache Tika



Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.



My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.



I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.



Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?





TIKA-1268 and TIKA-1396 were both marked as fixed in 1.6, are you sure you're really using Tika 1.6 for this?
– Gagravarr
Sep 11 '14 at 10:37





Assuming that the one marked 1.6 on the website and that is called tika-app-1.6.jar is actually Tika 1.6, then yes I'm sure!
– James Baker
Sep 11 '14 at 12:31





And you're trying the Tika App with the --extract flag to test the image extraction?
– Gagravarr
Sep 11 '14 at 12:52


--extract





I'm trying to do it programmatically, but I've tried the --extract flag and using the GUI and haven't successfully managed to find the images in the document with either methods.
– James Baker
Sep 11 '14 at 13:42





Sounds like you need to hop onto one of those bugs then, and flag up that it isn't properly fixed yet
– Gagravarr
Sep 11 '14 at 14:02




2 Answers
2



Try the code bellow, ContentHandler turned has your xml content.


public ContentHandler convertPdf(byte content, String path, String filename)throws IOException, SAXException, TikaException

Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ContentHandler handler = new ToXMLContentHandler();
PDFParser parser = new PDFParser();

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);

parser.setPDFParserConfig(config);


EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor()
@Override
public boolean shouldParseEmbedded(Metadata metadata)
return true;

@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException
Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.copy(stream, outputFile);

;

context.set(PDFParser.class, parser);
context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

try (InputStream stream = new ByteArrayInputStream(content))
parser.parse(stream, handler, metadata, context);


return handler;



It is possible to use an AutoParser to extract images, without relying on PDFParser. This code works just as well for extracting images out from docx, pptx, etc.


AutoParser


PDFParser



Here I have a parseDocument() and a setPdfConfig() function which makes use of a AutoParser.


parseDocument()


setPdfConfig()


AutoParser


AutoParser


EmbeddedDocumentExtractor


ParseContext


AutoParser


ParseContext


PDFParserConfig


ParseContext


ParseContext


AutoParser.parse()



The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/.


<sourceFile>_/


private static void setPdfConfig(ParseContext context)
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(true);

context.set(PDFParserConfig.class, pdfConfig);


private static String parseDocument(String path) TikaException e)
e.printStackTrace();


return xhtmlContents;






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard