java - Algorithm of crawling Top10 PR/Alexa sites -
i'm trying write script crawl current top 10 pr/alexa sites. since pr/alexa changes. script should take care of mean if today there not site in top 10 tomorrow.
i dont know how start with. know crawling concepts here i'm stuck. there top50 sites or top500 sites. can configure of course.
i read google spider complicated simple task. how google,yahoo,bing crawl billions of sites around web. i'm curious. cursor point, mean how google can identify newly launch site.
ok these deep details, read these later. right i'm more concern problem. how crawl top10 pr sites.
can provide sample program can understand better?
it's rather simple fetch top25sites (if understood correctly wanted do)
code:
from bs4 import beautifulsoup urllib.request import urlopen b = beautifulsoup(urlopen("http://www.alexa.com/topsites").read()) paragraphs = b.find_all('p', {'class':'desc-paragraph'}) p in paragraphs: print(p.a.text)
output:
google.com facebook.com youtube.com yahoo.com baidu.com wikipedia.org (...)
but have in mind law in countries more strict. on own risk.
Comments
Post a Comment