web crawler - NUTCH does not crawl a particular website -

September 15, 2015

i'm using apache nutch version 2.2.1 crawl websites. works fine except 1 website http://eur-lex.europa.eu/homepage.html website.

i tried apache nutch version 1.8, have same behaviour, nothing fetched. fetches , parses entry page after if can not extract links.

i see following:

------------------------------ -finishing thread fetcherthread5, activethreads=4 -finishing thread fetcherthread4, activethreads=3 -finishing thread fetcherthread3, activethreads=2 -finishing thread fetcherthread2, activethreads=1 0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 urls in 1 queues -finishing thread fetcherthread0, activethreads=0  -----------------

any idea?

this might because site's robots.txt file restricts crawler's access site.

by default nutch checks robots.txt file, located in http://yourhostname.com/robots.txt, , if it's not allowed crawl site not fetch page.

Search This Blog

My

web crawler - NUTCH does not crawl a particular website -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -