Extract Images from PDF with Apache Tika
Clash Royale CLAN TAG#URR8PPP
Extract Images from PDF with Apache Tika
Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.
My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.
I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.
Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?
Assuming that the one marked 1.6 on the website and that is called tika-app-1.6.jar is actually Tika 1.6, then yes I'm sure!
– James Baker
Sep 11 '14 at 12:31
And you're trying the Tika App with the
--extract
flag to test the image extraction?– Gagravarr
Sep 11 '14 at 12:52
--extract
I'm trying to do it programmatically, but I've tried the --extract flag and using the GUI and haven't successfully managed to find the images in the document with either methods.
– James Baker
Sep 11 '14 at 13:42
Sounds like you need to hop onto one of those bugs then, and flag up that it isn't properly fixed yet
– Gagravarr
Sep 11 '14 at 14:02
2 Answers
2
Try the code bellow, ContentHandler turned has your xml content.
public ContentHandler convertPdf(byte content, String path, String filename)throws IOException, SAXException, TikaException
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ContentHandler handler = new ToXMLContentHandler();
PDFParser parser = new PDFParser();
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
parser.setPDFParserConfig(config);
EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor()
@Override
public boolean shouldParseEmbedded(Metadata metadata)
return true;
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException
Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.copy(stream, outputFile);
;
context.set(PDFParser.class, parser);
context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );
try (InputStream stream = new ByteArrayInputStream(content))
parser.parse(stream, handler, metadata, context);
return handler;
It is possible to use an AutoParser
to extract images, without relying on PDFParser
. This code works just as well for extracting images out from docx, pptx, etc.
AutoParser
PDFParser
Here I have a parseDocument()
and a setPdfConfig()
function which makes use of a AutoParser
.
parseDocument()
setPdfConfig()
AutoParser
AutoParser
EmbeddedDocumentExtractor
ParseContext
AutoParser
ParseContext
PDFParserConfig
ParseContext
ParseContext
AutoParser.parse()
The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/
.
<sourceFile>_/
private static void setPdfConfig(ParseContext context)
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(true);
context.set(PDFParserConfig.class, pdfConfig);
private static String parseDocument(String path) TikaException e)
e.printStackTrace();
return xhtmlContents;
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
TIKA-1268 and TIKA-1396 were both marked as fixed in 1.6, are you sure you're really using Tika 1.6 for this?
– Gagravarr
Sep 11 '14 at 10:37