c# - Parsing Large List of Excel Files Failing -
this c#/vsto program. i've been working on data capture project. scope 'process excel files sent variety of third party companies.' practically, mean:
- locate columns contain data want through search method.
- grab data out of workbooks
- clean data, run calculations, etc
- output cleaned data new workbook
the program have written works great small-medium data sets, ~25 workbooks combined total of ~1000 rows of relavent data. i'm grabbing 7 columns of data out of these workbooks. 1 edge case have, though, need run larger data set, ~50 workbooks combined total of ~8,000 rows of relavent data (and possibly ~2000 of duplicate data have remove).
i putting list of files through parallel.foreach
loop inside of open new excel.application()
process each file multiple activesheet
s. parallel process runs faster on smaller data set going through each 1 sequentially. on larger data set, seem hit wall.
i start getting message: microsoft excel waiting application complete ole action
, fails. switching sequential foreach
allow program finish, grinds along - going 1-3 minutes parallel medium sized data set 20+ minutes sequential large data set. if mess paralleloptions.maxdegreeofparallelism
set 10 complete cycle, still take 15 minutes. if set 15, fails. don't messing tpl settings if don't have to. i've tried inserting thread.sleep
manually slow things down, made failure happen further out.
i close workbook, quit application, releasecomobject
excel object , gc.collect
, gc.waitforpendingfinalizers
@ end of each loop.
my ideas @ moment are:
- split list in half , run them seperately
- open number of
new excel.application()
in parallel, run list of files sequentially inside of excel instance (so kinda #1, using different path) - seperate list file size, , run small set of large files independently/sequentially, run rest have been
things hoping with:
- suggestions on making real sure memory getting cleared (maybe
process.id
getting twisted in opening , closing?) - suggestions on ordering parallel process - i'm wondering if can throw 'big' guys in first, make longer-running process more stable.
i have been looking at: http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/ , says "with prior knowledge work, may possible partition data more meaningfully default partitioner." i'm having hard time knowing what/if partitioning makes sense.
really appreciate insights!
update
so general rule test against excel 2010, have both 2010 , 2013 under use here. ran against 2013 , works fine - run time 4 minutes, expect. before abandon 2010 compatibility, other ideas? 2010 machine 64-bit machine 64-bit office, , 2013 machine 64-bit machine 32-bit office. matter @ all?
a few years ago worked excel files , automation. had problems of having zombie processes in task manager. although our program ended , thought quit excel properly, processes not quitting.
the solution not liked effective. can summarize solution this.
1) never use 2 dots consecutively like:
workbook.activesheet.pagesetup
instead use variables.. when done relase , null them.
example: instead of doing this:
m_currentworkbook.activesheet.pagesetup.leftfooter = str.tostring();
follow practices in function. (this function adds barcode excel footer.)
private bool setbarcode(string text) { excel._worksheet sheet; sheet = (excel._worksheet)m_currentworkbook.activesheet; try { stringbuilder str = new stringbuilder(); str.append(@"&""idautomationhc39m,regular""&22("); str.append(text); str.append(")"); excel.pagesetup setup; setup = sheet.pagesetup; try { setup.leftfooter = str.tostring(); } { removereference(setup); setup = null; } } { removereference(sheet); sheet = null; } return true; }
here removereference function (putting null in function did not work)
private void removereference(object o) { try { system.runtime.interopservices.marshal.releasecomobject(o); } catch { } { o = null; } }
if follow pattern everywhere guarantees no leaks, no zombie processes etc..
2) in order create excel files can use excel application, data excel, suggesst using oledb. can approach excel database , data sql queries, datatables etc.
sample code: (instead of filling dataset, can use datareader memory performance)
private list<datatable> getmovietables() { list<datatable> movietables = new list<datatable>(); var connectionstring = "provider=microsoft.ace.oledb.12.0;data source=" + excelfilepath + ";extended properties=\"excel 12.0;imex=1;hdr=no;typeguessrows=0;importmixedtypes=text\""; ; using (var conn = new oledbconnection(connectionstring)) { conn.open(); datarowcollection sheets = conn.getoledbschematable(oledbschemaguid.tables, new object[] { null, null, null, "table" }).rows; foreach (datarow sheet in sheets) { using (var cmd = conn.createcommand()) { cmd.commandtext = "select * [" + sheet["table_name"].tostring() + "] "; var adapter = new oledbdataadapter(cmd); var ds = new dataset(); try { adapter.fill(ds); movietables.add(ds.tables[0]); } catch (exception ex) { //debug.writeline(ex.tostring()); continue; } } } } return movietables; }
Comments
Post a Comment