web crawler - NUTCH does not crawl a particular website -
i'm using apache nutch version 2.2.1 crawl websites. works fine except 1 website http://eur-lex.europa.eu/homepage.html website.
i tried apache nutch version 1.8, have same behaviour, nothing fetched. fetches , parses entry page after if can not extract links.
i see following:
------------------------------ -finishing thread fetcherthread5, activethreads=4 -finishing thread fetcherthread4, activethreads=3 -finishing thread fetcherthread3, activethreads=2 -finishing thread fetcherthread2, activethreads=1 0/1 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 urls in 1 queues -finishing thread fetcherthread0, activethreads=0 -----------------
any idea?
this might because site's robots.txt file restricts crawler's access site.
by default nutch checks robots.txt file, located in http://yourhostname.com/robots.txt, , if it's not allowed crawl site not fetch page.
Comments
Post a Comment