tags:

views:

45

answers:

3

hi i need help with python programming: i need a command which can search all the words between tags from a text file. for example in the text file has <concept> food </concept>. i need to search all the words between <concept> and </concept> and display them. can anybody help please.......

+2  A: 
  1. Load the text file into a string.
  2. Search the string for the first occurrence of <concept> using pos1 = s.find('<concept>')
  3. Search for </concept> using pos2 = s.find('</concept>', pos1)

The words you seek are then s[pos1+len('<concept>'):pos2]

Aaron Digulla
This method does not take comments and tags with whitespace into account if question's author imply XML
nailxx
+1 for simplicity
jensgram
+1  A: 

Have a look at regular expressions. http://docs.python.org/library/re.html

If you want to have for example the tag <i>, try

text = "text to search. <i>this</i> is the word and also <i>that</i> end"
import re
re.findall("<i>(.*?)</i>",text)

Here's a short explanation how findall works: It looks in the given string for a given regular expression. The regular expression is <i>(.*?)</i>:

  • <i> denotes just the opening tag <i>
  • (.*?) creates a group and matches as much as possible until it comes to the first
  • </i>, which concludes the tag

Note that the above solution does not mach something like

<i> here's a line
break </i>

Since you just wanted to extract words.

However, it is of course possible to do so:

re.findall("<i>(.*?)</i>",text,re.DOTALL)
phimuemue
+2  A: 

There is a great library for HTML/XML traversing named BeautifulSoup. With it:

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(open('myfile.xml', 'rt').read())
for t in soup.findAll('concept'):
   print t.string
nailxx