python - Cleaning text string after getting body text using Beautifulsoup -
i'm trying text articles on various webpages , write them clean text documents. don't want visible text because includes irrelevant links on side of webpages. i'm using beautifulsoup extract information pages. but, links not on side of page in middle of body text , @ bottom of articles make final product.
does know how deal problem of links converted text not part of real article's text?
#some of imports other portions of code not shown here. #i'm new python , bad @ remembering library has functions. import os import sys import urllib2 import webbrowser bs4 import beautifulsoup os import path cookielib import cookiejar #i made opener deal proxies , put *** instead of information #cookielib helps me articles nytimes proxy = urllib2.proxyhandler({'http': '***' % '***'}) auth = urllib2.httpbasicauthhandler() cj = cookiejar() opener = urllib2.build_opener(proxy, auth, urllib2.httphandler, urllib2.httpcookieprocessor(cj)) urllib2.install_opener(opener) #uses url input string upen webpage , and pulls out information. def baumeister(url): req = urllib2.request(url) opened = urllib2.urlopen(req) html_doc = opened.read() soup = beautifulsoup(html_doc) return soup #gets body text html information. def substanz(url): soup = baumeister(url) body = soup.find_all("p") #this have tried fix problem , failed result = "" e in body: = e.gettext().replace("\t", "").replace(" ", " ").strip().encode(errors="ignore") result += + "\r\n\r\n" return result
one article have used test substanz gets cleaned in exact way want is:
http://blogs.hbr.org/2014/06/do-you-really-want-to-be-yourself-at-work/
i'm trying test more articles different sites. i'm trying clean result of substanz (the result big string). problem have article:
i've used print substanz('url')
see result looks like. cnbc article links turned text not part of article. whereas in harvard business review article works out fine included links part of actual text.
i'm not going attach full result each article here viewing because each full page of text long.
if try code have posted above opener not going work, use whatever opener access websites. have access proxy @ work that's format works me.
final note, i'm using python 3.4, , writing code in ipython notebook.
import requests bs4 import beautifulsoup r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7cfinance%7cheadline%7cheadline%7cstory&par=yahoo&doc=101790001%7cthink%20college%20is%20expensiv#") soup = beautifulsoup(r.content) text =[''.join(s.findall(text=true))for s in soup.findall('p')] print (text) ['>> view results ""', 'enter multiple symbols separated commas', 'london quotes available', 'interest rates on loans jump', "because federal student loans tied 10-year treasury note, cnbc's sharon epperson reports borrowers see impact of rise in treasury yields on past year.", ' congratulations, graduates, on diploma. $29,000 student loan debt? ', ' more 70 percent of graduates carry student debt real world, according institute college access , success. , average debt shy of $30,000. ', ' news worse next week when interest rates on student loans set rise again. ', ' though federal student loan rates fixed life of loan, these rates reset new borrowers every july 1, legislation ties rates performance of financial markets. ', ' interest rate on federal stafford loans go current fixed rate of under 4 percent 4.66 percent loans distributed between july 1 , june 30, 2015. ', ' read morestudent loan problem easy fix: sen. warren ', ' graduate students, rate on stafford loans rise on 5 percent 6.21 percent. ', ' direct plus loans graduates , parents still expensive, rates rising 7.21 percent.', 'which college major pays off most?', "cnbc's sharon epperson reports majoring in engineering lucrative. ", " increase in monthly federal student loan payments can add quickly, shouldn't burdensome students. every $10,000 in loans, new borrowers pay $4 more month based on 10-year repayment period. ", " read morewhy millennial women don't save retirement ", ' still, experts warn beginning. ', ' "federal student loan rates continue increase in next few years , hit maximum rate caps high 10.5 percent loans," said mark kantrowitz, senior vice president , publisher of edvisors.com. ', ' sophomore student samantha cook, decision go george washington university big 1 financially. says had doubts it. ', ' "my parents wanted assure me no matter picked, we\'d find way make work," cook said. families, cook , parents making work combining household savings, scholarships , grants—and student loans. ', ' read morecramer: offset high cost of higher education ', ' despite rising tuition , borrowing costs, cook family decided against samantha transferring in-state university. ', ' despite debt load taking on, said, "the value of gw degree me @ least more valuable when looking jobs later on." ', " —by cnbc's sharon epperson ", 'hosting yard sale may not profitable way rid of old junk.', 'many americans debit cards tied checking accounts still confused how these programs work. ', "here's how avoid these deadly sins if you're contemplating or in divorce.", "the irs offers lot of students. problem is, educational tax breaks , how work -- or don't -- confusing.", 'get best of cnbc in inbox', 'tips home buyers find right home bank account.', 'complaints movers down. how find right one—and save.', "forget bathing suit season. why it's time join gym. ", 'drivers might see lower gas prices year, smart shopping tactics them save more.', 'data real-time snapshot *data delayed @ least 15 minutesglobal business , financial news, stock quotes, , market data , analysis', '© 2014 cnbc llc. rights reserved.', 'a division of nbcuniversal']
from website in link text main article.
import requests bs4 import beautifulsoup r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7cfinance%7cheadline%7cheadline%7cstory&par=yahoo&doc=101790001%7cthink%20college%20is%20expensiv#") soup = beautifulsoup(r.content) text =[''.join(s.findall(text=true)) s in soup.findall("div", {"class":"group"})] print (text) ['\n congratulations, graduates, on diploma. $29,000 student loan debt? \n more 70 percent of graduates carry student debt real world, according institute college access , success. , average debt shy of $30,000. \n news worse next week when interest rates on student loans set rise again. \n though federal student loan rates fixed life of loan, these rates reset new borrowers every july 1, legislation ties rates performance of financial markets. \n interest rate on federal stafford loans go current fixed rate of under 4 percent 4.66 percent loans distributed between july 1 , june 30, 2015. \n read morestudent loan problem easy fix: sen. warren \n graduate students, rate on stafford loans rise on 5 percent 6.21 percent. \n direct plus loans graduates , parents still expensive, rates rising 7.21 percent.\n', '\n increase in monthly federal student loan payments can add quickly, shouldn\'t burdensome students. every $10,000 in loans, new borrowers pay $4 more month based on 10-year repayment period. \n read morewhy millennial women don\'t save retirement \n still, experts warn beginning. \n "federal student loan rates continue increase in next few years , hit maximum rate caps high 10.5 percent loans," said mark kantrowitz, senior vice president , publisher of edvisors.com. \n sophomore student samantha cook, decision go george washington university big 1 financially. says had doubts it. \n "my parents wanted assure me no matter picked, we\'d find way make work," cook said. families, cook , parents making work combining household savings, scholarships , grants—and student loans. \n read morecramer: offset high cost of higher education \n despite rising tuition , borrowing costs, cook family decided against samantha transferring in-state university. \n despite debt load taking on, said, "the value of gw degree me @ least more valuable when looking jobs later on." \n —by cnbc\'s sharon epperson \n']
Comments
Post a Comment