c# - How to get text from html nodes and solve character encoding issue? -
i'm trying innertext in site http://www.hurriyet.com.tr/yazarlar/22933964.asp
with htmlagilitypack. html structure
<div class="detailtext"> <span class="yzrarticledate">30 mart 2014</span> <h1 class="yazararticletitle">31 mart sabahı için acil ihtiyaç listesi</h1> <p></p><p><p >akıl.<br />sağduyu.<br />barış.<br /> Özgürlük.<br />kardeşlik.<br />vicdan.<br />huzur............. and current code
string htmlcontent = getsource(s); htmlagilitypack.htmldocument document = new htmlagilitypack.htmldocument(); document.loadhtml(htmlcontent); var noa =document.documentnode.selectsinglenode("*//div[@class='detailtext']").innertext; problem gets heading , date. mean "30 mart 2014" , "31 mart sabahı için acil ihtiyaç listesi".
i want part begins
<*p><*/p><*p><p* >akıl.<*br " i tried different variation
var noa =document.documentnode.selectsinglenode("*//div[@class='detailtext']").innerhtml; var noa = document.documentnode.selectsinglenode("*//div[@class='detailtext']").nextsibling.nextsibling.innertext; var noa = document.documentnode.selectsinglenode("*//div[@class='detailtext']").lastsibling.innertext; my second question ; if manage text text ll faced character encoding problem, how can fix this
the easiest solution remove nodes don't want , innerhtml/innertext covered in remove html node htmldocument :htmlagilitypack.
var noa =document.documentnode.selectsinglenode("*//div[@class='detailtext']") noa.removechild(noa.selectsinglenode("span")); // remove rest too... var result = noa.innertext; there should no encoding problem unless site reports invalid encoding c# strings unicode (utf16).
Comments
Post a Comment