views:

85

answers:

2

I have a list of books titles:

  • "The Hobbit: 70th Anniversary Edition"
  • "The Hobbit"
  • "The Hobbit (Illustrated/Collector Edition)[There and Back Again]"
  • "The Hobbit: or, There and Back Again"
  • "The Hobbit: Gift Pack"

and so on...


I thought that if I normalised the titles somehow, it would be easier to implement an automated way to know what book each edition is referring to.

normalised = ''.join([char for char in title 
                       if char in (string.ascii_letters + string.digits)])

or

normalised = ''
for char in title:
  if char in ':/()|':
    break
  normalised += char
return normalised

But obviously they are not working as intended, as titles can contain special characters and editions can basically have very different title layouts.


Help would be very much appreciated! Thanks :)

+1  A: 

It depends completely on your data. For the examples you gave, a simple normalization solution could be:

import re

book_normalized = re.sub(r':.*|\[.*?\]|\(.*?\)|\{.*?\}', '', book_name).strip()

This will return "The Hobbit" for all the examples. What it does is remove anything after and including the first colon, or anything in brackets (normal, square, curly) as well as leading and trailing spaces.

However, this is not a very good solution in the general case, as some books have colons or bracketed parts in the actual book name. E.g. the name of the series, followed by a colon, followed by the name of the particular entry of the series.

Max Shawabkeh
@Max thanks for your answer! you are right about books having **series numbers** and so on, that is also part of the confusion I'm facing.
RadiantHex
+1  A: 

I would suggest using a 3rd party web service, such as librarything which I believe can do what you're asking, for a starting point, see their documentation:

http://www.librarything.com/services/rest/documentation/1.0/librarything.ck.getwork.php

Tom
@Tom thank for this. That was quite useful!
RadiantHex
@Tom: librarything isn't failure free unfortunately :)
RadiantHex