tags:

views:

318

answers:

3

The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc.

Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must)

+1  A: 

The following regex should do the job (Perl regex syntax):

/(10\.\d+\/\d+)/

You could do some additional sanity checking by opening the urls

http://hdl.handle.net/<doi>

and

http://dx.doi.org/<doi>

where is the candidate doi,

and testing that you a) get a 200 OK http status, and b) the returned page is not the "DOI not found" page for the service.

Silas
+1  A: 

@Silas The sanity checking is a good idea. However, the regex doesn't cover all DOIs. The first element must (currently) be 10, and the second element must (currently) be numeric, but the third element is barely restricted at all:

"Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F..."

and that's where the real problem lies. In practice, I've never seen whitespace used, but the spec specifically allows for it. Basically, there doesn't seem to be a sensible way of detecting the end of a DOI.

Kai
+2  A: 

I'm sure it's not super-helpful for the OP at this point, but I figured I'd post what I am trying in case anyone else like me stumbles upon this:

(10.(\d)+/(\S)+)

This matches: "10 dot number slash anything-not-whitespace"

But for my use (scraping HTML), this was finding false-positives, so I had to match the above, plus get rid of quotes and greater-than/less-than:

(10.(\d)+/([^(\s\>\"\<)])+)

I'm still testing these out, but I'm feeling hopeful thus far.

agentj0n