I have a XML file which contains 100s of documents inside . Each block looks like this:
<DOC>
<DOCNO> FR940104-2-00001 </DOCNO>
<PARENT> FR940104-2-00001 </PARENT>
<TEXT>
<!-- PJG FTAG 4703 -->
<!-- PJG STAG 4703 -->
<!-- PJG ITAG l=90 g=1 f=1 -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=90 g=1 f=4 -->
Federal Register
<!-- PJG /ITAG -->
<!-- PJG ITAG l=90 g=1 f=1 -->
/ Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=01 g=1 f=1 -->
Vol. 59, No. 2
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG ITAG l=02 g=1 f=1 -->
Tuesday, January 4, 1994
<!-- PJG 0012 frnewline -->
<!-- PJG 0012 frnewline -->
<!-- PJG /ITAG -->
<!-- PJG /STAG -->
<!-- PJG /FTAG -->
</TEXT>
</DOC>
I want load this XML doc into a dictionary Text
. Key as DOCNO & Value as text inside tags. Also this text should not contain all the comments. Example Text['FR940104-2-00001']
must contain Federal Register / Vol. 59, No. 2 / Tuesday, January 4, 1994 / Notices Vol. 59, No. 2 Tuesday, January 4, 1994
. This is the code I wrote.
L = doc.getElementsByTagName("DOCNO")
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
docno.append(node3.data);
#print node2.data
L = doc.getElementsByTagName("TEXT")
i = 0
for node2 in L:
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
Text[docno[i]] = node3.data
i = i+1
Surprisingly, with my code I'm getting Text['FR940104-2-00001'] as u'\n'
How come?? How to get what I want