Issue
I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.
My code look like
Parser parser= new AutoDetectParser();
InputStream stream = new FileInputStream(fileAttachment);
int writerHandler =-1;
ContentHandler contentHandler= new BodyContentHandler(writerHandler);
Metadata metadata= new Metadata();
parser.parse(stream, contentHandler, metadata, new ParseContext());
String mimeType = metadata.get(Metadata.CONTENT_TYPE);
logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);
This code is not working properly for the office 03 and 07 documents.
While running from eclipse I am getting correct mimetypes.
I build jar file and running from command its giving wrong mimetypes.
out put from command
------------
File Attachment: Testpdf.pdf MimeType is: application/pdf
File Attachment: Testpdf.tif MimeType is: image/tiff
File Attachment: Testpdf.xlsx MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xltx MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.pptx MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.docx MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xls MimeType is: application/zip
File Attachment: Testpdf.doc MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.dot MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.ppt MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.xlt MimeType is: application/vnd.ms-excel
I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.
Please help me in this. Thanks in advance.
CVSR Sarma
Solution
Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:
TikaConfig tika = new TikaConfig();
Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
String mimetype = tika.getDetector().detect(stream, metadata);
If you have only the tika-core
jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"
Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parsers-standard
jar and its dependencies.
Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you
Answered By - Gagravarr
Answer Checked By - Mary Flores (JavaFixing Volunteer)