tags:

views:

72

answers:

1

I am poking at XBRL documents trying to get my head around how to effectively extract and use the data. One thing I have been struggling with is making sure I use the context information correctly. Below is a snippet from one of the documents I am playing with (this is from Mattel's latest 10-K)

I want to be able to efficiently collect the context key value pairs as they are important to help align the 'real' data' Here is an example of a context element

- <context id="eol_PE6050----0910-K0010_STD_0_20091231_0">
  - <entity>
     <identifier scheme="http://www.sec.gov/CIK"&gt;0000063276&lt;/identifier&gt; 
   </entity>
  - <period>
   <instant>2009-12-31</instant> 
   </period>
   </context>

When I started this I thought that if there was a parent-child relationship I should be able to get the attributes, keys, values and text of all the children directly from applying a method (?) to the parent. But the children retain their independence though they can be found from the parent. What I mean is that if the children have attributes, keys, values and or text those constructs cannot be directly accessed from the parent instead you have to determine/identify the children and from the children access the data or metadata that is needed.

I am not fully certain why this block of code is a good starting point:

 from lxml import etree
 test_tree=etree.parse(r'c:\temp\test_xml\mat-20091231.xml')
 tree_list=[p for p in test_tree.getiterator() 

so my tree_list is a list of the elements that were determined to exist in my xml file
Because there were only 664 items in my tree_list I made the very bad assumption that all of the elements within a parent were subsumed in the parent so I kept trying to access the entity, period and instant by referencing just those elements (not their children)

for each in tree_list:
    if 'context' in each.tag:
        contextlist.append(each)

That is I kept applying different methods to the items in the contextlist and got really frustrated. Finally while I was writing out the question I was trying to get some help figuring out what method would give me the entity and period I just decided to try

children=[c for c in contextlist[0].iterchildren()]

so my list children has all of the children from the first item in my contextlist

One of the children is the entity element, the other is the period element

Now, it should be that each of those children have a child, the entity element has an identifier child element and the period element has an instant child element This is getting much more complicated than it seemed this morning.

I have to know the details that are reported by the context elements to correctly evaluate and operate on the real data. It seems like I have to test each of the children of the context elements Is there a faster more efficient way to get those values? Rephrased, is there a way to have some element and create a data structure that contains all of its children, and grandchildren etc without having to do a fair amount of try else statements

Once I have them I can start building a data dictionary and assign data elements to particular entries based on the context. So getting the context elements efficiently and completely is critical to my task.

+1  A: 

Using the element-tree interface (which lxml also supports), getiterator iterates over all the nodes in the subtree rooted at the current element.

So, [list(c.getiterator()) for c in contextlist] gives you the list of lists you want (or you may want to keep c in the resulting list to avoid having to zip it with contextlist later, i.e. diretly make a list of tuples [(c, list(c.getiterator())) for c in contextlist], depending on your intended use).

Note in passing that a listcomp of the exact form [x for x in whatever] never makes much sense -- use list(whatever), instead, to turn whatever other iterable into a list.

Alex Martelli
Thank again Alex, maybe I can get your direct line?Cheers
PyNEwbie
@PyNewbie, oh, I'm pretty active on SO as you may have noticed;-)
Alex Martelli