web crawler - NUTCH does not crawl a particular website -


i'm using apache nutch version 2.2.1 crawl websites. works fine except 1 website http://eur-lex.europa.eu/homepage.html website.

i tried apache nutch version 1.8, have same behaviour, nothing fetched. fetches , parses entry page after if can not extract links.

i see following:

------------------------------ -finishing thread fetcherthread5, activethreads=4 -finishing thread fetcherthread4, activethreads=3 -finishing thread fetcherthread3, activethreads=2 -finishing thread fetcherthread2, activethreads=1 0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 urls in 1 queues -finishing thread fetcherthread0, activethreads=0  ----------------- 

any idea?

this might because site's robots.txt file restricts crawler's access site.

by default nutch checks robots.txt file, located in http://yourhostname.com/robots.txt, , if it's not allowed crawl site not fetch page.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -