pdfclown highlighting doesn't work for some pdf files -
i using pdfclown library highlight text inside pdf file reason, nullpointerexception error when run texthighlightsample.
[java] java.lang.nullpointerexception [java] @ java.util.hashtable.hash(hashtable.java:239) [java] @ java.util.hashtable.put(hashtable.java:519) [java] @ org.pdfclown.documents.contents.fonts.simplefont.onload(simplefont.java:139) [java] @ org.pdfclown.documents.contents.fonts.font.load(font.java:738) [java] @ org.pdfclown.documents.contents.fonts.font.<init>(font.java:351) [java] @ org.pdfclown.documents.contents.fonts.simplefont.<init>(simplefont.java:62) [java] @ org.pdfclown.documents.contents.fonts.truetypefont.<init>(truetypefont.java:68) [java] @ org.pdfclown.documents.contents.fonts.font.wrap(font.java:253) [java] @ org.pdfclown.documents.contents.fontresources.wrap(fontresources.java:72) [java] @ org.pdfclown.documents.contents.fontresources.wrap(fontresources.java:1) [java] @ org.pdfclown.documents.contents.resourceitems.get(resourceitems.java:119) [java] @ org.pdfclown.documents.contents.objects.setfont.getresource(setfont.java:119) [java] @ org.pdfclown.documents.contents.objects.setfont.getfont(setfont.java:83) [java] @ org.pdfclown.documents.contents.objects.setfont.scan(setfont.java:97) [java] @ org.pdfclown.documents.contents.contentscanner.movenext(contentscanner.java:1330) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.extract(contentscanner.java:811) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.<init>(contentscanner.java:777) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.<init>(contentscanner.java:770) [java] @ org.pdfclown.documents.contents.contentscanner$graphicsobjectwrapper.get(contentscanner.java:690) [java] @ org.pdfclown.documents.contents.contentscanner$graphicsobjectwrapper.access$0(contentscanner.java:682) [java] @ org.pdfclown.documents.contents.contentscanner.getcurrentwrapper(contentscanner.java:1154) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:633) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:647) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:647) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:296) [java] @ org.pdfclown.samples.cli.texthighlightsample.run(texthighlightsample.java:56) [java] @ org.pdfclown.samples.cli.sampleloader.run(sampleloader.java:140) [java] @ org.pdfclown.samples.cli.sampleloader.main(sampleloader.java:56)
does know how solve problem?
the foreground issue
the foreground issue pdfclown in simplefont.onload()
(while reading widths font dictionary own structures) assumes has glyphindexes
entry each codes
value key firstchar-based indices in widths array:
if(glyphwidthobjects != null) { bytearray charcode = new bytearray( new byte[] {(byte)((pdfinteger)getbasedataobject().get(pdfname.firstchar)).getintvalue()} ); for(pdfdirectobject glyphwidthobject : glyphwidthobjects) { int glyphwidth = ((pdfnumber<?>)glyphwidthobject).getintvalue(); if(glyphwidth > 0) { integer code = codes.get(charcode); if(code != null) { glyphwidths.put( glyphindexes.get(code), //<<<<<<<<<<<<<<<<<<<<<< glyphwidth ); } } charcode.data[0]++; } }
if check null
here, e.g. replacing
if(code != null)
by
if(code != null && glyphindexes.get(code) != null)
you rid of nullpointerexception
.
usually there glyphindexes
entries values. thus, don't nullpointerexception
here. pdfclown in attempt able extract as possible uses mixture of information pdf objects , embedded font objects, , there still seem shortcomings in coordination of information, e.g. in case of document:
the background issue
while constructing truetypefont
object font sourcesanspro-regular pdfclown
- (
font.load
) tries read tounicode map mapping character codes unicode , putcodes
; unfortunately font has no tounicode map; thus,codes
remainsnull
; - (
openfontparser
construction intruetypefont.loadencoding
calledsimplefont.onload
) tries read information embedded font file; among other data retrieved mapping 32..213 -> 0..44 mapping character codes in-font glyph indices; - (still in
truetypefont.loadencoding
calledsimplefont.onload
) sets font object'sglyphindexes
member map; if therecodes
mapping now, used here change mapping mapping unicode -> 0..44;codes
null
(see above),glyphindexes
remains is; - (still in
truetypefont.loadencoding
calledsimplefont.onload
) there nocodes
mapping yet, creates 1 based on macromanencoding entry pdf font dictionary; - (still in
truetypefont.loadencoding
calledsimplefont.onload
) if there noglyphindexes
yet, derive 1 currentcodes
mapping , widths array; have one, remains is; - (
simplefont.onload
) tries put contents of pdf font dictionary's widths arrayglyphwidths
map. code (see above) assumesglyphindexes
mapping of unicode codes and, therefore, translates them usingcodes
first. unfortunatelyglyphindexes
here not unicode codes character codes. failure observed above occurs.
font extraction in pdfclown 0.1.3 in need of clean-up. tries make use of information both pdf objects , embedded fonts (which idea) situations here shoots in foot.
but it's still 0.x version after all, issues expected...
Comments
Post a Comment