java - Algorithm of crawling Top10 PR/Alexa sites -

August 15, 2014

i'm trying write script crawl current top 10 pr/alexa sites. since pr/alexa changes. script should take care of mean if today there not site in top 10 tomorrow.

i dont know how start with. know crawling concepts here i'm stuck. there top50 sites or top500 sites. can configure of course.

i read google spider complicated simple task. how google,yahoo,bing crawl billions of sites around web. i'm curious. cursor point, mean how google can identify newly launch site.

ok these deep details, read these later. right i'm more concern problem. how crawl top10 pr sites.

can provide sample program can understand better?

it's rather simple fetch top25sites (if understood correctly wanted do)

code:

from bs4 import beautifulsoup urllib.request import urlopen b = beautifulsoup(urlopen("http://www.alexa.com/topsites").read()) paragraphs = b.find_all('p', {'class':'desc-paragraph'}) p in paragraphs:    print(p.a.text)

output:

google.com facebook.com youtube.com yahoo.com baidu.com wikipedia.org (...)

but have in mind law in countries more strict. on own risk.

Search This Blog

My

java - Algorithm of crawling Top10 PR/Alexa sites -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

linux - phpmyadmin, neginx error.log - Check group www-data has read access and open_basedir -