You seem to need more help with XML concepts, in general, than with specific R packages and snippets to deal with XML files (although this may come later ;-) ). Also, you may find it preferable to convert the input file to some more palatable format before using it within R, Stata or other statistical tools.
For illustration purposes, I'm reproducing below the first <incident>
record from the source mentioned in the question. We can assume that other incidents will have a similar structure.
By looking at the DTD file, we could assert whether the root contains other nodes ("records") than <incidents>
and whether these incidents have exactly the same structure (or if for example some incident types may have say an extra, say, <LocalWeatherConditions>
node, or if, say, the <facilityList>
node is optional). For the purpose of this discussion it's OK to assume that all incidents records have the same general structure.
Your request of a "spreadsheet where one row represents a single terrorist incident, and no info from the xml should be missing" may be hard to achieve because of cardinality issues. This is a fancy way to say that some sub-elements of the incident records may be repeated. For example most of the nodes which name ends with "List" can typically contain more than one sub-record (BTW this "List in the name" thing is not an XML rule, merely a convention the custodians of this particular database are using). For example, there could be multiple <CityStateProvince>
records, each with its own values the for City and StateProvince, or there could be multiple ` records, each with its long list of values.
It is possible to "flatten out" the data, into a single row. The general process is one of "denormalization", whereby the single row includes columns with numbered labels:
..., City1, StateProv1, City2, StateProv2, City3, StateProv3 ... (btw where do we stop?)
Furthermore, aside from leading to wide records which possibly exceed the (absolute or practical) limits of the underlying language, this format is very cumbursome with regards to aggregating and performing statistics at large: Say you wish to get counts by StateProv: you now need to instruct the program to "look" into all possible locations where this info is found: "StateProv1", "StateProv2"...
An alternative format, more suitable for statistical treatment, is to export to multiple "spreadsheets". whereby a main spreadsheet contains one row per incident for all the non repeatable properties of the incident record, and additional spreadsheets contain the "sub-records" that may repeat. These sub-records should include a "key" which can be used to relate to the underlying record in the main spreadsheet (probably the ICN, here), and they may also include duplicated info from the main spreadsheet, for example bringing in the IncidateDate, the Assanination flag etc. The purpose of this denormalization [of another kind] is to possibly make these extra spreadsheet self sufficient for some of the targeted analysis.
Where to go from there?
- You need to define the precise format for the speadsheet(s) to be produced from the XML input.
You'll likely agree with the fact that the numbered-labels approach is impractical and hence you'll need to look at the input data and see how you wish to split it (again with ability to replicate data).
- You can use R for example this XML Package to parse the input into R variables (table, lists, vectors...)
- Alternatively, you can (I think should), use an external program, to perform this export of the XML input into tabular form (CSV format and the like), which is more readily ingested by R.
Although I use the XML package mentioned, for small files (and mostly for output purposes), I fear it may be inefficient, bug prone (you lack the ability to inspect, easily, the effective input, as can be done with a text file), and generally clumsy.
Luckily, you can soon get over this conversion/import job, and focus on the stats at hand!
A few final pointers:
Even if you do not readily understand the DTD language, take a look at the XTD file, in particular the many <xs:enumeration ...>
lists, which comprise the bulk of the file, as these will supply you the factor (in R lingo) values. Of course R can infer these as well, from the data, but you can use the info from the enumerations for cross referencing purposes (to confirm that the data was a priori loaded properly etc.)
It is probably ok to infer the schema from several record samples (people unfamiliar with XML can more readily understand XML data than XSD files). To be sure however one needs to read the XSD file.
<IncidentList xmlns="http://wits.nctc.gov" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://wits.nctc.gov WITS.XSD">
<Incident>
<ICN>200458431</ICN>
<Subject>10 civilians killed, at least 45 wounded by suspected GAM in Peureulak, Indonesia</Subject>
<Summary>On 1 January 2004, in Peureulak, Aceh Province, Indonesia, a bomb exploded at a concert, killing ten civilians, wounding 45 others, and causing major damage to the stage area. Many of the victims were Indonesian teenagers. Police blamed the Free Aceh Movement (GAM), although the GAM denied responsibility. No other group claimed responsibility.</Summary>
<IncidentDate>01/01/2004</IncidentDate>
<ApproximateDate>No</ApproximateDate>
<MultipleDays>No</MultipleDays>
<EventTypeList>
<EventType>Bombing</EventType>
</EventTypeList>
<Assassination>No</Assassination>
<Suicide>No</Suicide>
<WeaponTypeList>
<WeaponType>Explosive</WeaponType>
</WeaponTypeList>
<IED>No</IED>
<Location>
<Region>East Asia-Pacific</Region>
<Country>Indonesia</Country>
<CityStateProvinceList>
<CityStateProvince>
<City>Peureulak</City>
<StateProvince>Aceh</StateProvince>
</CityStateProvince>
</CityStateProvinceList>
</Location>
<VictimList>
<Victim>
<VictimType>Civilian</VictimType>
<Combatant>No</Combatant>
<Nationality>Indonesia</Nationality>
<DefiningCharacteristicList>
<DefiningCharacteristic>None</DefiningCharacteristic>
</DefiningCharacteristicList>
<TargetedCharacteristicList>
<TargetedCharacteristic>Unknown</TargetedCharacteristic>
</TargetedCharacteristicList>
<Indicator>Targeted</Indicator>
<Child>No</Child>
<DeadCount>10</DeadCount>
<WoundedCount>45</WoundedCount>
<HostageCount>0</HostageCount>
</Victim>
</VictimList>
<FacilityList>
<Facility>
<FacilityType>Public Place/Retail</FacilityType>
<Combatant>No</Combatant>
<Nationality>Indonesia</Nationality>
<DefiningCharacteristicList>
<DefiningCharacteristic>None</DefiningCharacteristic>
</DefiningCharacteristicList>
<TargetedCharacteristicList>
<TargetedCharacteristic>Unknown</TargetedCharacteristic>
</TargetedCharacteristicList>
<Indicator>Targeted</Indicator>
<Damage>Light</Damage>
<Quantity>1</Quantity>
</Facility>
</FacilityList>
<PerpetratorList>
<Perpetrator>
<Nationality>Indonesia</Nationality>
<Characteristic>Secular/Political/Anarchist</Characteristic>
</Perpetrator>
</PerpetratorList>
</Incident>
[...]
</IncidentList>