Hadoop Binary Streaming and PDF File Inclusion
In a previous post I talked about Hadoop Binary Streaming for the processing of Microsoft Office Word documents. However, due to there popularity, I thought inclusion for support of Adobe PDF documents would be beneficial. To this end I have updated the source code to support processing of both “.docx” and “.pdf” documents.
iTextSharp
To support reading PDFs I have used the open source library provided by iText (https://itextpdf.com/). iText is a library that allows you to read, create and manipulate PDF documents (https://itextpdf.com/download.php). The original code was written in Java but a port for .Net is also available (https://sourceforge.net/projects/itextsharp/files/).
In using these libraries I only use the PdfReader class, from the Core library. This class allows one to derive the page count, and the Author from an Info property.
To use the library in Hadoop one just has to specify a file property for the iTextSharp core library:
-file "C:\Reference Assemblies\itextsharp.dll"
This assumes the downloaded and extracted DLL has been copied to and referenced from the “Reference Assemblies” folder.
Source Code Changes
To support the PDF document inclusion only two changes were necessary to the code.
Firstly, a new Mapper was defined that supports the processing of a PdfReader type and returns the author and pages for the document:
namespace FSharp.Hadoop.MapReduce
open System
open iTextSharp.text
open iTextSharp.text.pdf
// Calculates the pages per author for a Pdf document
module OfficePdfPageMapper =
let authorKey = "Author"
let unknownAuthor = "unknown author"
let getAuthors (document:PdfReader) =
// For PDF documents perform the split on a ","
if document.Info.ContainsKey(authorKey) then
let creators = document.Info.[authorKey]
if String.IsNullOrWhiteSpace(creators) then
[| unknownAuthor |]
else
creators.Split(',')
else
[| unknownAuthor |]
let getPages (document:PdfReader) =
// return page count
document.NumberOfPages
// Map the data from input name/value to output name/value
let Map (document:PdfReader) =
let pages = getPages document
(getAuthors document)
|> Seq.map (fun author -> (author, pages))
Secondly one has to call the correct mapper based on the document type; namely the file extension:
let (|WordDocument|PdfDocument|UnsupportedDocument|) extension =
if String.Equals(extension, ".docx", StringComparison.InvariantCultureIgnoreCase) then
WordDocument
else if String.Equals(extension, ".pdf", StringComparison.InvariantCultureIgnoreCase) then
PdfDocument
else
UnsupportedDocument
// Check we do not have a null document
if (reader.Length > 0L) then
try
match Path.GetExtension(filename) with
| WordDocument ->
// Get access to the word processing document from the input stream
use document = WordprocessingDocument.Open(reader, false)
// Process the word document with the mapper
OfficeWordPageMapper.Map document
|> Seq.iter (fun value -> outputCollector value)
// close document
document.Close()
| PdfDocument ->
// Get access to the pdf processing document from the input stream
let document = new PdfReader(reader)
// Process the word document with the mapper
OfficePdfPageMapper.Map document
|> Seq.iter (fun value -> outputCollector value)
// close document
document.Close()
| UnsupportedDocument ->
()
with
| :? System.IO.FileFormatException ->
// Ignore invalid files formats
()
And that is it.
Conclusion
In Microsoft Word, if one needs to process the actual text/words of a document, this is relatively straight-forward:
document.MainDocumentPart.Document.Body.InnerText
Using iText the text/word extraction code is a little more complex but relativity easy. An example can be found here:
https://itextpdf.com/examples/iia.php?id=275
Enjoy!
Comments
- Anonymous
March 20, 2013
Sir,Your idea is great,as I want same with some modification.The theme is,I have thousand of files pdf,txt,docx in a folder.I want to extracts most occuring top 10 words for each file using Hadoop/any software which gives quick resultsI totally don't know C# & .NET,I try to understand the code,but I can't.I know little bit of Java.Can u tell me how to modify it into Java Program?I will be thankful,if u convert it completely into MapReduce form as many peoples are using Java for Hadoop programmingYou can mail me also - sagarnikam123@gmail.com - Anonymous
March 24, 2013
Hi Sagar you can use Hadoop for document processing in this fashion, provided you have sufficient volume. If you know Java you can use the Java binary reader that comes with this code as the reader for submitting a MR job written just in Java. - Anonymous
May 08, 2014
HI Sir -I want to load pdf, word format data in the Hadoop systems. whats the best way to load the data in HDFS.