regex - Using multiple regexes to capture matching nested xml tags -
suppose have xml file contains tags nested inside themselves, eg
<tag>one<tag>two</tag>one</tag>
from this page, have 2 examples of regex expressions don't match string, eg get
<tag>one<tag>two</tag>
which not balanced. according google, it's not possible find regex parse html correctly, eg here or here.
entire html parsing not possible regular expressions, since depends on matching opening , closing tag not possible regexps.
regular expressions can match regular languages html context-free language. thing can regexps on html heuristics not work on every condition. should possible present html file matched wrongly regular expression.
that's nice clear-cut theoretical answer, got me thinking: would possible programmatically, using multiple regexes and/or loops?
here's simple recursive descent xml parser, i'm making right rough , ready, writing in ruby didn't specify language. not use in production (or anywhere really, curiosities sake):
string = "<tag>one<other_tag>two</other_tag>one</tag>" regex_xml_parser = -> string { stuff_before = [] matches = [] stuff_after = [] while string =~ />/ stuff_before << string[ /^[^<]*/ ] string.sub!(/^[^<]*/, '') matches << string.match(/<([^>]+)>(.*)<\/\1>/) string.sub!(/<([^>]+)>(.*)<\/\1>/, '') stuff_after << string[ /[^>]*$/ ] string.sub!(/[^>]*$/, '') p [ stuff_after, "stuff_after" ] end values = stuff_before + stuff_after + [string] return_value = values.clone matching_nodes = matches.map { |match| make_matches[match]} {values: return_value.select { |x| x != "" }, nodes: matching_nodes} } make_matches = -> match_item { {match_item[1] => regex_xml_parser[match_item[2]]} } regex_xml_parser[string]
remember, building parser here, think goes without saying using parser exists easier.
Comments
Post a Comment