I have some document stored as a large String. In the String I have some inline XML tags and I want to get out the words inbetween the tags. The documents may also contain HTML tags, as the documents are often web sites.
Example Document:
"< tr > My name is < b >< PERSON >Bobby< /PERSON >< /b >, I live in the USA."
Current RegEx:
Pattern p = Pattern.compile("<(LOCATION|PERSON|ORGANIZATION)>[\\w[ '\"/\\!%$\\(\\)\\-\\+]]*</(LOCATION|PERSON|ORGANIZATION)>");
Matcher m = p.matcher("I'm <PERSON>Graham Brown</PERSON> I went to the <LOCATION>USA'S</LOCATION>");
while(m.find()){
System.out.println(m.group());
}
Result = < PERSON >Bobby< /PERSON > < LOCATION >USA< /LOCATION >
This works fine with pretty much most puntuation and grammer, but the Regex should allow any character pattern to be found between the tags. When I try using '.' (any character), as below it returns the whole String.
"< tr > My name is < b >< PERSON >Bobby< /PERSON >< /b >, I live in the USA."
Pattern p = Pattern.compile("<(LOCATION|PERSON|ORGANIZATION)>.</(LOCATION|PERSON|ORGANIZATION)>");
How do I return any characters between the angular openinng and closing tags?
EDIT: Thanks for your responses. Just and for helping get the correct answer. For clarification I have marked Named Entites using NER. If you are unware of what this is please see some of the papers I have referenced at the bottom.
All I am interested in is getting the text between the three opening and closing tags. There are no other tags and the documents are not XML files and I am not parsing all the HTML tags nor I am I interested in them. All I am interested in is parsing the XML tags that I have created hence I though RegEx would be the simplest way to do so.
Papers to be added later...