html - Python error trying to parse webpage -
from urllib.request import urlopen bs4 import beautifulsoup html = urlopen("http://www.animeplus.tv/anime-show-list/") content =(html.read()) soup = beautifulsoup(content) print(soup.prettify()) the script works fine other webpages, run program targeted website get.
<meta .$_server["request_uri"]."'"="" content="0;url='" http-equiv="refresh"/> i not understand html code.
i assume it's sort of redirect or way prevent web scrapping.
is there way python access code after redirect or in way browser return source code?
thank you!
the trick here page redirects , sets cookie header important, without not html see in browser.
here's solution using requests - opening same page in same session:
import requests bs4 import beautifulsoup url = "http://www.animeplus.tv/anime-show-list/" session = requests.session() session.get(url) response = session.get(url) # open page second time soup = beautifulsoup(response.content) print(soup.title.text) # prints: "watch anime | anime online | free anime | english anime | watch anime online - animeplus.tv" alternatively, can use mechanize, doesn't support python 3 @ moment. here's how works:
>>> import mechanize >>> browser = mechanize.browser() >>> browser.open('http://www.animeplus.tv/anime-show-list/') >>> print browser.response().read() <!doctype html> <html> <head> <title>watch anime | anime online | free anime | english anime | watch anime online - animeplus.tv</title> ...
Comments
Post a Comment