tags:

views:

33

answers:

1

Currently I have a pysqlite db that I am using to store a list of road conditions. The source this list is generated from however is buggy and sometimes generates duplicates. Some of these duplicates will have the start and end points swapped but everything else the same.

The method i currently have looks like this:

def getDupes(self):
    '''This method is used to return a list of dupilicate entries 
    '''
    self.__curs.execute('SELECT * FROM roadCond GROUP BY road, start, end, cond, reason, updated, county, timestmp HAVING count(*)>1')
    result = self.__curs.fetchall()

    def getSwaps():
        '''This method is used to grab the duplicates with swapped columns
        '''
        self.__curs.execute('SELECT * FROM roadCond WHERE ')
        extra = self.__curs.fetchall()
        return extrac

    result.extend(getSwaps())

    return result

The the initial query works but I am suspicious of it (I think there is a better way, I just don't know) but I am not all to sure how to make the inner method work.

Thank you ahead of time. :-D

A: 

Instead of the first query, you could use

SELECT DISTINCT * FROM roadCond

which will retrieve all the records from the table, removing any duplicates.

As for the inner method, this query will return all the records which have "duplicates" with start and end swapped. Note that, for each record with "duplicates", this query will return both the "original" and the "copy".

SELECT DISTINCT * FROM roadCond WHERE EXISTS (
    SELECT * FROM roadCond rc2 WHERE
        roadCond.road = rc2.road AND
        roadCond.end = rc2.start AND roadCond.start = rc2.end AND
        roadCond.cond = rc2.cond AND
        ... AND
        roadCond.timestamp = rc2.timestamp)

Edit: To detect and remove "duplicates" with start and end swapped, you could make sure that your data always contains these values laid out in the same order:

UPDATE roadCond SET start = end, end = start WHERE end < start;

But this approach only works if it doesn't matter which is which.

Danilo Piazzalunga
So what I want is a list of the duplicates items to be returned rather than the list, with the duplicate items removed. Also ultimately, I would like the duplicate items to be removed from the table so that way they don't show up in future queries. Also, the data is semi-auto-generated. Some times however, it gets messed up and duplicate entries are created where I am scraping values from. So I don't have any way of knowing weather or not the duplicates even have swapped values before I put them in the DB because I'm trying to use a query to determine this and retrieve them Thank you.
Stephen Tanner