multithreading - Node.js data processing distribution -
i'm in need of strategy distribute data processing using node.js. i'm trying figure out if using worker pool , isolate groups of tasks in these workers best way, or using pipe/node-based system http://strawjs.com/ way go.
the steps have following (for single job):
- extract zip-file containing gis shapefiles
- convert files geojson using ogr2ogr
- denormalize data in geojson file
- transform data format use in mongodb
- upsert data mongodb collection
the main problem don't know how merge data different geojson files when using pipe/node-based system straw.
i understand how work in worker pools. don't know how distribute workers on several machines.
i've tried naive way of doing in single thread on single machine using async module. works small sets of data. in production i'm need able support millions of documents on pretty frequent interval.
the reasons behind using node.js have solid infrastructure scaling node.js processes , use node.js every aspect of our production environment.
author of straw here.
you can run straw pretty on multiple machines.
set dedicated redis server, , run straw topology on number of separate worker machines, them using redis server (via config pass in topo).
by using named pipes in topologies can connect separate machines together. it's same if running on single machine.
a useful technique have multiple straw nodes getting input same pipe. load-balance automatically.
also, straw uses separate os process per node, on multicore machine make better use of cores single node.js process.
let me know if need more info or help.
Comments
Post a Comment