views:

110

answers:

2

I am trying to convert a 120mb xml database of terrorist incidents (the first file for download available here http://wits.nctc.gov/Export.do) to spreadsheet form so i can merge it with other data and do statistical analysis.

so far I have worked with stata, which is useless now because it wont read xml. the site offers smaller files by month which can be opened via excel, but excel does not display them in the form i want and there ought to be a better way to transform the complete file rather than opening over a hundred single files, manually saving them as tab seperated and then merging them.

i am looking for a way to convert the complete WITS.xml file to a spreadsheet where one row represents a single terrorist incident, and no info from the xml should be missing. even a differently structured xml is probably fine. i have tried converters but they are either not free, do not perform in the way i want them to or the file size is too large, and i have no idea how to use xslt. I am studying economics, and my programming knowledge is virtually nonexistent, which is increasingly becoming a drawback. i have seen that there is a package for R that i could use, maybe now is the right moment to start learning R or some other language. However, if there is a quick and easy way to do it, i'd sure prefer it.

thanks alot,

Helge

+1  A: 

I have started using an Open Source product called Talend Open Studio to doing these kind of extract/transform/load tasks. It is a GUI-based code-generation tool that outputs to portable Perl or Java, and comes with gazillions of connections to database and file types.

It would require a learning curve; it is not entirely intuitive to do some of the more complex tasks. However, I suspect that configuring it to read your XML and output to XLS would be fairly fast and easy.

ewall
this was indeed what i was looking for. it works fine on the small files. however, when i try to read the large file at some point i receive "Unhandled event loop exception Java heap space"after increasing memory, as suggested on the talend forums, the error still persists.
ilprincipe
Hmm, that's frustrating, and surprising since it doesn't sound like your incoming data is so large that it would overflow the heap. Sadly, I have not encountered that and don't know where to start. Hopefully the note that ilprincipe left will help you get it going another way!
ewall
+4  A: 

You seem to need more help with XML concepts, in general, than with specific R packages and snippets to deal with XML files (although this may come later ;-) ). Also, you may find it preferable to convert the input file to some more palatable format before using it within R, Stata or other statistical tools.

For illustration purposes, I'm reproducing below the first <incident> record from the source mentioned in the question. We can assume that other incidents will have a similar structure. By looking at the DTD file, we could assert whether the root contains other nodes ("records") than <incidents> and whether these incidents have exactly the same structure (or if for example some incident types may have say an extra, say, <LocalWeatherConditions> node, or if, say, the <facilityList> node is optional). For the purpose of this discussion it's OK to assume that all incidents records have the same general structure.

Your request of a "spreadsheet where one row represents a single terrorist incident, and no info from the xml should be missing" may be hard to achieve because of cardinality issues. This is a fancy way to say that some sub-elements of the incident records may be repeated. For example most of the nodes which name ends with "List" can typically contain more than one sub-record (BTW this "List in the name" thing is not an XML rule, merely a convention the custodians of this particular database are using). For example, there could be multiple <CityStateProvince> records, each with its own values the for City and StateProvince, or there could be multiple ` records, each with its long list of values.
It is possible to "flatten out" the data, into a single row. The general process is one of "denormalization", whereby the single row includes columns with numbered labels:

  ..., City1, StateProv1, City2, StateProv2, City3, StateProv3 ... (btw where do we stop?)

Furthermore, aside from leading to wide records which possibly exceed the (absolute or practical) limits of the underlying language, this format is very cumbursome with regards to aggregating and performing statistics at large: Say you wish to get counts by StateProv: you now need to instruct the program to "look" into all possible locations where this info is found: "StateProv1", "StateProv2"...

An alternative format, more suitable for statistical treatment, is to export to multiple "spreadsheets". whereby a main spreadsheet contains one row per incident for all the non repeatable properties of the incident record, and additional spreadsheets contain the "sub-records" that may repeat. These sub-records should include a "key" which can be used to relate to the underlying record in the main spreadsheet (probably the ICN, here), and they may also include duplicated info from the main spreadsheet, for example bringing in the IncidateDate, the Assanination flag etc. The purpose of this denormalization [of another kind] is to possibly make these extra spreadsheet self sufficient for some of the targeted analysis.

Where to go from there?

  • You need to define the precise format for the speadsheet(s) to be produced from the XML input.
    You'll likely agree with the fact that the numbered-labels approach is impractical and hence you'll need to look at the input data and see how you wish to split it (again with ability to replicate data).
  • You can use R for example this XML Package to parse the input into R variables (table, lists, vectors...)
  • Alternatively, you can (I think should), use an external program, to perform this export of the XML input into tabular form (CSV format and the like), which is more readily ingested by R.
    Although I use the XML package mentioned, for small files (and mostly for output purposes), I fear it may be inefficient, bug prone (you lack the ability to inspect, easily, the effective input, as can be done with a text file), and generally clumsy.

Luckily, you can soon get over this conversion/import job, and focus on the stats at hand!

A few final pointers:

  • Even if you do not readily understand the DTD language, take a look at the XTD file, in particular the many <xs:enumeration ...> lists, which comprise the bulk of the file, as these will supply you the factor (in R lingo) values. Of course R can infer these as well, from the data, but you can use the info from the enumerations for cross referencing purposes (to confirm that the data was a priori loaded properly etc.)

  • It is probably ok to infer the schema from several record samples (people unfamiliar with XML can more readily understand XML data than XSD files). To be sure however one needs to read the XSD file.


<IncidentList xmlns="http://wits.nctc.gov" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://wits.nctc.gov WITS.XSD">

<Incident>
   <ICN>200458431</ICN>
   <Subject>10 civilians killed, at least 45 wounded by suspected GAM in Peureulak, Indonesia</Subject>
   <Summary>On 1 January 2004, in Peureulak, Aceh Province, Indonesia, a bomb exploded at a concert, killing ten civilians, wounding 45 others, and causing major damage to the stage area.  Many of the victims were Indonesian teenagers.  Police blamed the Free Aceh Movement (GAM), although the GAM denied responsibility.  No other group claimed responsibility.</Summary>
   <IncidentDate>01/01/2004</IncidentDate>
   <ApproximateDate>No</ApproximateDate>
   <MultipleDays>No</MultipleDays>
   <EventTypeList>
      <EventType>Bombing</EventType>
   </EventTypeList>
   <Assassination>No</Assassination>
   <Suicide>No</Suicide>
   <WeaponTypeList>
       <WeaponType>Explosive</WeaponType>
   </WeaponTypeList>
   <IED>No</IED>
   <Location>
      <Region>East Asia-Pacific</Region>
      <Country>Indonesia</Country>
      <CityStateProvinceList>
         <CityStateProvince>
            <City>Peureulak</City>
            <StateProvince>Aceh</StateProvince>
         </CityStateProvince>
      </CityStateProvinceList>
   </Location>
   <VictimList>
      <Victim>
      <VictimType>Civilian</VictimType>
      <Combatant>No</Combatant>
      <Nationality>Indonesia</Nationality>
      <DefiningCharacteristicList>
         <DefiningCharacteristic>None</DefiningCharacteristic>
      </DefiningCharacteristicList>
      <TargetedCharacteristicList>
         <TargetedCharacteristic>Unknown</TargetedCharacteristic>
      </TargetedCharacteristicList>
      <Indicator>Targeted</Indicator>
      <Child>No</Child>
      <DeadCount>10</DeadCount>
      <WoundedCount>45</WoundedCount>
      <HostageCount>0</HostageCount>
      </Victim>
   </VictimList>
   <FacilityList>
      <Facility>
         <FacilityType>Public Place/Retail</FacilityType>
         <Combatant>No</Combatant>
         <Nationality>Indonesia</Nationality>
         <DefiningCharacteristicList>
         <DefiningCharacteristic>None</DefiningCharacteristic>
         </DefiningCharacteristicList>
         <TargetedCharacteristicList>
         <TargetedCharacteristic>Unknown</TargetedCharacteristic>
         </TargetedCharacteristicList>
         <Indicator>Targeted</Indicator>
         <Damage>Light</Damage>
         <Quantity>1</Quantity>
      </Facility>
   </FacilityList>
   <PerpetratorList>
      <Perpetrator>
         <Nationality>Indonesia</Nationality>
         <Characteristic>Secular/Political/Anarchist</Characteristic>
      </Perpetrator>
   </PerpetratorList>
</Incident>
[...]
</IncidentList>
mjv
thanks very much for the detailed explanation. i had thought of the repetition issue, however considered it to be a minor issue if any at all. the program mentioned below seems suited to export into csv, but apparrently the size of the file is too much for it. ill probably have to resort to converting the monthly files. thanks alot!
ilprincipe