pdfclown highlighting doesn't work for some pdf files -
i using pdfclown library highlight text inside pdf file reason, nullpointerexception error when run texthighlightsample.
[java] java.lang.nullpointerexception [java] @ java.util.hashtable.hash(hashtable.java:239) [java] @ java.util.hashtable.put(hashtable.java:519) [java] @ org.pdfclown.documents.contents.fonts.simplefont.onload(simplefont.java:139) [java] @ org.pdfclown.documents.contents.fonts.font.load(font.java:738) [java] @ org.pdfclown.documents.contents.fonts.font.<init>(font.java:351) [java] @ org.pdfclown.documents.contents.fonts.simplefont.<init>(simplefont.java:62) [java] @ org.pdfclown.documents.contents.fonts.truetypefont.<init>(truetypefont.java:68) [java] @ org.pdfclown.documents.contents.fonts.font.wrap(font.java:253) [java] @ org.pdfclown.documents.contents.fontresources.wrap(fontresources.java:72) [java] @ org.pdfclown.documents.contents.fontresources.wrap(fontresources.java:1) [java] @ org.pdfclown.documents.contents.resourceitems.get(resourceitems.java:119) [java] @ org.pdfclown.documents.contents.objects.setfont.getresource(setfont.java:119) [java] @ org.pdfclown.documents.contents.objects.setfont.getfont(setfont.java:83) [java] @ org.pdfclown.documents.contents.objects.setfont.scan(setfont.java:97) [java] @ org.pdfclown.documents.contents.contentscanner.movenext(contentscanner.java:1330) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.extract(contentscanner.java:811) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.<init>(contentscanner.java:777) [java] @ org.pdfclown.documents.contents.contentscanner$textwrapper.<init>(contentscanner.java:770) [java] @ org.pdfclown.documents.contents.contentscanner$graphicsobjectwrapper.get(contentscanner.java:690) [java] @ org.pdfclown.documents.contents.contentscanner$graphicsobjectwrapper.access$0(contentscanner.java:682) [java] @ org.pdfclown.documents.contents.contentscanner.getcurrentwrapper(contentscanner.java:1154) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:633) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:647) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:647) [java] @ org.pdfclown.tools.textextractor.extract(textextractor.java:296) [java] @ org.pdfclown.samples.cli.texthighlightsample.run(texthighlightsample.java:56) [java] @ org.pdfclown.samples.cli.sampleloader.run(sampleloader.java:140) [java] @ org.pdfclown.samples.cli.sampleloader.main(sampleloader.java:56) does know how solve problem?
the foreground issue
the foreground issue pdfclown in simplefont.onload() (while reading widths font dictionary own structures) assumes has glyphindexes entry each codes value key firstchar-based indices in widths array:
if(glyphwidthobjects != null) { bytearray charcode = new bytearray( new byte[] {(byte)((pdfinteger)getbasedataobject().get(pdfname.firstchar)).getintvalue()} ); for(pdfdirectobject glyphwidthobject : glyphwidthobjects) { int glyphwidth = ((pdfnumber<?>)glyphwidthobject).getintvalue(); if(glyphwidth > 0) { integer code = codes.get(charcode); if(code != null) { glyphwidths.put( glyphindexes.get(code), //<<<<<<<<<<<<<<<<<<<<<< glyphwidth ); } } charcode.data[0]++; } } if check null here, e.g. replacing
if(code != null) by
if(code != null && glyphindexes.get(code) != null) you rid of nullpointerexception.
usually there glyphindexes entries values. thus, don't nullpointerexception here. pdfclown in attempt able extract as possible uses mixture of information pdf objects , embedded font objects, , there still seem shortcomings in coordination of information, e.g. in case of document:
the background issue
while constructing truetypefont object font sourcesanspro-regular pdfclown
- (
font.load) tries read tounicode map mapping character codes unicode , putcodes; unfortunately font has no tounicode map; thus,codesremainsnull; - (
openfontparserconstruction intruetypefont.loadencodingcalledsimplefont.onload) tries read information embedded font file; among other data retrieved mapping 32..213 -> 0..44 mapping character codes in-font glyph indices; - (still in
truetypefont.loadencodingcalledsimplefont.onload) sets font object'sglyphindexesmember map; if therecodesmapping now, used here change mapping mapping unicode -> 0..44;codesnull(see above),glyphindexesremains is; - (still in
truetypefont.loadencodingcalledsimplefont.onload) there nocodesmapping yet, creates 1 based on macromanencoding entry pdf font dictionary; - (still in
truetypefont.loadencodingcalledsimplefont.onload) if there noglyphindexesyet, derive 1 currentcodesmapping , widths array; have one, remains is; - (
simplefont.onload) tries put contents of pdf font dictionary's widths arrayglyphwidthsmap. code (see above) assumesglyphindexesmapping of unicode codes and, therefore, translates them usingcodesfirst. unfortunatelyglyphindexeshere not unicode codes character codes. failure observed above occurs.
font extraction in pdfclown 0.1.3 in need of clean-up. tries make use of information both pdf objects , embedded fonts (which idea) situations here shoots in foot.
but it's still 0.x version after all, issues expected...
Comments
Post a Comment