data-cleansing

Normalising book titles - Python

I have a list of books titles: "The Hobbit: 70th Anniversary Edition" "The Hobbit" "The Hobbit (Illustrated/Collector Edition)[There and Back Again]" "The Hobbit: or, There and Back Again" "The Hobbit: Gift Pack" and so on... I thought that if I normalised the titles somehow, it would be easier to implement an automated way to kno...

Fuzzy data matching for personal demographic information

Lets say I have a database filled with people with the following data elements: PersonID (meaningless surrogate autonumber) FirstName MiddleInitial LastName NameSuffix DateOfBirth AlternateID (like an SSN, Militarty ID, etc) I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of infor...

Problem looking at data between 0 and -1...

I'm trying to write a program that cleans data, using Matlab. This program takes in the max and min that the data can be, and throws out data that is less than the min or greater than the max. There looks like a small issue with the cleaning part. This case ONLY happens when the minimum range of the variable being checked is 0. If this i...

How to handle MySQL shutdown in Matlab?

Greetings all- I'm writing a program that parses and cleans a lot of data from one database to another on Matlab, querying from MySQL. This would run continuously, as new data come into the first db every minute, are cleaned, and put to the clean db before the next data point comes in. I was wondering how, during this process, I could a...

Blocking '0000-00-00' from MySQL Date Fields

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the db level to block this? I know that I can set a column to be not-null, but that does not seem to be blocking these zero values. What is the b...