I have a huge number of files to sort all named in some terrible convention.
Here are some examples:
(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn______.txt
These names are supposed to be a different person (speaker) each. Someone in another IT department produced these from a ton of XML files using some script but the naming is unfathomably stupid as you can see.
I need to sort literally tens of thousands of these files with multiple files of text for each person; each with something stupid making the filename different, be it more underscores or some random number. They need to be sorted by speaker.
This would be easier with a script to do most of the work then I could just go back and merge folders that should be under the same name or whatever.
There are a number of ways I was thinking about doing this.
- parse the names from each file and sort them into folders for each unique name.
- get a list of all the unique names from the filenames, then look through this simplified list of unique names for similar ones and ask me whether they are the same, and once it has determined this it will sort them all accordingly.
I plan on using Perl, but I can try a new language if it's worth it. I'm not sure how to go about reading in each filename in a directory one at a time into a string for parsing into an actual name. I'm not completely sure how to parse with regex in perl either, but that might be googleable.
For the sorting, I was just gonna use the shell command:
`cp filename.txt /example/destination/filename.txt`
but just cause that's all I know so it's easiest.
I dont even have a pseudocode idea of what im going to do either so if someone knows the best sequence of actions, im all ears. I guess I am looking for a lot of help, I am open to any suggestions. Many many many thanks to anyone who can help.
B.