tags:

views:

256

answers:

6

I have an ITunes library XML file backup file - about 15 MB.

I have 20K music files on my C drive and about 25K files on E drive under exactly similar folder structures.

I am traversing the first location and going file by file and checking if the file exiss in the second location. That part works for me.

Now, for all such duplicate files, if the file path from E drive exists in the XML, but the C drive path does not exist in the XML, then I want to delete the file from the C drive.

What is my best way of checking if a string exists in the XML file (I have to do this atleast 20K times)?

+3  A: 
SimpleCoder
-1 for claiming that searching the whole of memory, repeatedly, for each string will be "fast".
ChrisW
@ChrisW: -1 to you for poor reading comprehension. I said, in my Edit, to load each node/line one by one and scan for each string in the line.
SimpleCoder
+2  A: 

A suggestion: load as text, use a regular expression to extract the desired strings (I suppose they are enclosed with a specific tag) and build a hash list with them. You can use the list to check the existence.

Eduardo Mauro
-1 for using regular expression instead of XML API to extract strings.
ChrisW
@ChrisW: The question does not impose the use of XML API. Also, the question was edited after my answer. In the original question I was said that was not necessary to read as XML. So I don't agree with yours -1 for my answer.
Eduardo Mauro
It's his having an XML file as input that suggests using one of the built-in XML APIs.
ChrisW
@ChrisW: As I said, in the original question he said it was not necessary to read as a XML file.
Eduardo Mauro
+1  A: 

Alphabetically sort your list of strings that you're matching on, then build an index array which tells you where the start of your list is for each character that is a starting character for one of the strings, maybe indexing to the second character depending on the breadth of variety and if your match is case sensitive or not.

Read the file character by character with a stream to minimize memory footprint, checking into the index array to see where that character starts and ends in the list of strings so you can pull out that characters page, if there's anything starting with those character combinations. Then continue filtering inside of the page until you have one match left and the next character makes matches 0.

Remove that string from the list of strings to match, put it in another list if you want. Then start checking your index on the next character and continue doing so each time you run into no matches.

The index gives you a more efficient aggregate to minimize number of items iterated against.

This could give you a two character depth index:

Dictionary<string,int> stringIndex = new Dictionary<char,int>();
for(int i = 0; i < sortedSearchStrings.Length; i++;)
{
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0])) stringIndex[sortedSearchStrings[i][0]] = i;
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0] + sortedSearchStrings[i][1])) stringIndex[sortedSearchStrings[i][0] + sortedSearchStrings[i][1]] = i;
}

Then to find the starting index in your list you just access:

int startOfCurrentCharPage = stringIndex[string.Format("{0}{1}", lastChar, currentChar)];
Jimmy Hoffa
-1 for taking the time to write/use a non-standard container without proof that that's needed.
ChrisW
@ChrisW: Seriously? -1 for giving a creative answer? Sorry I didn't just say to load it all into memory.. Nice job giving -1 to every answer that isn't yours..
Jimmy Hoffa
It's creative but I disagree that it's good. Good includes simplest, least maintenance, least effort/expense, least debugging, good-enough performance, understandable to the nest guy, etc.
ChrisW
@ChrisW: I don't think you downvoted the other answers because they were wrong. They may not represent the best solutions in your mind, but it's the asker's opinion that counts. I think you downvoted them because they are not your answer.
SimpleCoder
FYI the accepted answer to [The answer to tactical downvoting problem?](http://meta.stackoverflow.com/questions/22771/the-answer-to-tactical-downvoting-problem/22776#22776), from a moderator, says that you can flag this "instance of possible tactical downvoting" for moderator attention (if you want to).
ChrisW
A: 

Read every string from the XML and write them into a HashSet<string>. When you want to lookup a string, look it up in the HashSet. The cost will be O(n) to read the XML, and O(n) to do the n lookups fro the HashSet. Don't try to search repeatedly in the XML (instead do your 20,000 searches in the HashSet), because the XML isn't indexed/optimized for searching.

ChrisW
+1  A: 

Would it be possible to work directly out of the xml document and skip the first step?

If so, you could just make use of Xml.XmlDocument, and from there, Xml.XmlNode.SelectNodes( string ), using xpath to navigate the document. I don't know what kind of information is present in the document, but the way you have the second stage worded gives the idea that at times both the path on C:\ and the path on E:\ are present? If so, it would be as simple as two IO.File.Exists checks and then a IO.File.Delete( ).

What I mean to say is that rather than searching your XML document N times for a string, do your search through the document and delete duplicates as you go so that you only run through the document once.

I don't use iTunes or have one of its library backups on hand to say if it could work or not, though.

Kogitsune
+1  A: 

Here is a simple solution using Linq. Runs sufficiently fast enough for one-time use:

using System;
using System.IO;
using System.Linq;
using System.Xml.Linq;

class ITunesChecker
{
    static void Main(string[] args)
    {
        // retrieve file names
        string baseFolder = @"E:\My Music\";
        string[] filesM4a = Directory.GetFiles(baseFolder, "*.m4a", SearchOption.AllDirectories);
        string[] filesMp3 = Directory.GetFiles(baseFolder, "*.mp3", SearchOption.AllDirectories);
        string[] files = new string[filesM4a.Length + filesMp3.Length];
        Array.Copy(filesM4a, 0, files, 0, filesM4a.Length);
        Array.Copy(filesMp3, 0, files, filesM4a.Length, filesMp3.Length);

        // convert to the format used by iTunes
        for (int i = 0; i < files.Length; i++)
        {
            Uri uri = null;
            if (Uri.TryCreate(files[i], UriKind.Absolute, out uri))
            {
                files[i] = uri.AbsoluteUri.Replace("file:///", "file://localhost/");
            }
        }

        // read the files from iTunes library.xml
        XDocument library = XDocument.Load(@"E:\My Music\iTunes\iTunes Music Library.xml");
        var q = from node in library.Document.Descendants("string")
                where node.ElementsBeforeSelf("key").Where(n => n.Parent == node.Parent).Last().Value == "Location"
                select node.Value;

        // do the set operations you are interested in
        var missingInLibrary = files.Except(q, StringComparer.InvariantCultureIgnoreCase);
        var missingInFileSystem = q.Except(files, StringComparer.InvariantCultureIgnoreCase);
        var presentInBoth = files.Intersect(q, StringComparer.InvariantCultureIgnoreCase);
    }
}
0xA3