tags:

views:

259

answers:

1

I need a regular expression that matches APA format references.

I currently have this:

/([A-Z][a-zA-Z\-\:\'\s\´]{3,}\, ([a-zA-Z]\.[\s|,|.]| &?){1,}){1,}\(\d\d\d\d(, [A-Z][a-z\- ]*\d\d?|)\)\.[a-zA-Z\-\:\'\s]{3,}\.[a-zA-Z\-\s]+\,[ ]*\d\d(\(\S\))*,\d+.\d+./

It only catches 10 and is fragile as hell.

I only need journal articles - not books, not non-english articles.

Any tips on how to make this regex more manageable appreciated.

I built it using Rubular

This is the source data ( I know about the missing spaces and international character issues ):

Bre´dart, S., Valentine, T., Calder, A., & Gassi, L. (1995). An interactiveactivation model of face naming.Quarterly Journal of ExperimentalPsychology, 48(A),466–486.Bruce, V., & Young, A. (1986). Understanding face recognition.BritishJournal of Psychology, 77,305–327.Burton, A. M., & Bruce, V. (1992). I recognize your face but I can’tremember your name: A simple explanation?British Journal of Psy-chology, 83,45–60.Flude, B., Ellis, A., & Kay, J. (1990). Face processing and name retrievalin an anomic aphasic: Names are stored separately from semanticinformation about people.Brain and Cognition, 11,60–72.Gratton, G., Coles, M. G. H., Sirevaag, E. J., Eriksen, C. W., & Donchin,E. (1988). Pre- and poststimulus activation of response channels: Apsychophysiological analysis.Journal of Experimental Psychology: Hu-man Perception and Performance, 14,331–344.Hodges, J. R., & Greene, J. D. W. (1998). Knowing about people andnaming them: Can Alzheimer’s disease patients do one without theother?Quarterly Journal of Experimental Psychology, 51(A),121–134.Huynh, H., & Feldt, L. S. (1976). Estimation of the box correction fordegrees of freedom from sample data in the randomized block andsplit-plot designs.Journal of Educational Statistics, 1,69–82.Jasper, H. H. (1958). Report of the committee on methods of clinicalexamination in electroencephalography.Electroencephalography andClinical Neurophysiology, 10,370–375.Johnston, R. A., & Bruce, V. (1990). Lost properties? Retrieval differencesbetween name codes and semantic codes for familiar people.Psycho-logical Research 52,62–67.Kornhuber, H. H., & Deecke, L. (1965). Hirnpotentialaenderungen beiWillkuerbewegungen und passiven Bewegungen des Menschen: Be-reitschaftspotential und reafferente Potentiale [Brain potential changesfor voluntary and passive movements in humans: Readiness potentialand afferent potentials].Pfluegers Archiv fuer die Gesamte Physiologie,284,1–17.Kutas, M., & Donchin, E. (1974, November 8). Studies of squeezing:Handedness, responding hand, response force, and asymmetry of readi-ness potential.Science, 186,545–547.Kutas, M., & Donchin, E. (1980). Preparation to respond as manifested bymovement-related brain potentials.Brain Research, 202,95–115

Examples of book references that mess up mletterle's answer

Lippold, O. C. J. (1967). Electromyography. In P. H. Venables & I. Martin
(Eds.), A manual of psychophysiological methods (pp. 245–298). Amsterdam:
North-Holland.
Low, K. A., & Miller, J. (1999). The usefulness of partial information:
Effects of go probability in the choice/nogo task. Psychophysiology, 36,
288–297.
+9  A: 

This regex should do what you want

([^\.].*?[0-9])(?=\.|\Z)

It uses positive look ahead to check for numbers followed by a period (or the end of the string), it excludes the periods from the captures. You can see the result here: http://www.rubular.com/regexes/6293

mletterle
So much better than what I was preparing. Nice!
m104
I tried making a regex that would extract the various parts of the string into match-groups correctly, but I guess the format is just to ambiguous and error prone (typos etc) to get a regex to do it. I guess yours is as close as it gets. +1
Tomalak
Your regex allowed me to understand the question :)
Lieven
Sweet. I knew there was a more elegant way. I just didn't realize I was so far off-base and over-complicated. Fortunately, My work won't be wasted because I'll still have to use the pieces I put together to pull apart the individual elements of each reference. Thanks.
srboisvert
It does fail in some cases. Unfortunately, the subset of a reference section that I put up didn't have to full range of what could be in a reference section. Book references don't end with numbers so they end up sucking up the next article reference.
srboisvert
@srboisvert: I can only say that it is impossible to match all kind of references with a single regex. The format is just too ambiguous, and no clear delimiters are defined. The reference format is all in the semantics, not in the syntax, and all you can see with regex is syntax.
Tomalak
I know Tomalek. I just put the note up for future searchers. I'll be kludging a fix post regex for my project.
srboisvert