ansaurus

Question

Answer 1

+3 A:

SimpleCoder 2010-09-10 20:13:02

-1 for claiming that searching the whole of memory, repeatedly, for each string will be "fast".

ChrisW 2010-09-10 20:28:38

@ChrisW: -1 to you for poor reading comprehension. I said, in my Edit, to load each node/line one by one and scan for each string in the line.

SimpleCoder 2010-09-10 20:30:38

Answer 2

+2 A:

A suggestion: load as text, use a regular expression to extract the desired strings (I suppose they are enclosed with a specific tag) and build a hash list with them. You can use the list to check the existence.

Eduardo Mauro 2010-09-10 20:14:14

-1 for using regular expression instead of XML API to extract strings.

ChrisW 2010-09-10 20:27:43

@ChrisW: The question does not impose the use of XML API. Also, the question was edited after my answer. In the original question I was said that was not necessary to read as XML. So I don't agree with yours -1 for my answer.

Eduardo Mauro 2010-09-10 20:37:02

It's his having an XML file as input that suggests using one of the built-in XML APIs.

ChrisW 2010-09-10 20:43:49

@ChrisW: As I said, in the original question he said it was not necessary to read as a XML file.

Eduardo Mauro 2010-09-10 20:48:02

Answer 3

+1 A:

Alphabetically sort your list of strings that you're matching on, then build an index array which tells you where the start of your list is for each character that is a starting character for one of the strings, maybe indexing to the second character depending on the breadth of variety and if your match is case sensitive or not.

Read the file character by character with a stream to minimize memory footprint, checking into the index array to see where that character starts and ends in the list of strings so you can pull out that characters page, if there's anything starting with those character combinations. Then continue filtering inside of the page until you have one match left and the next character makes matches 0.

Remove that string from the list of strings to match, put it in another list if you want. Then start checking your index on the next character and continue doing so each time you run into no matches.

The index gives you a more efficient aggregate to minimize number of items iterated against.

This could give you a two character depth index:

Dictionary<string,int> stringIndex = new Dictionary<char,int>();
for(int i = 0; i < sortedSearchStrings.Length; i++;)
{
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0])) stringIndex[sortedSearchStrings[i][0]] = i;
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0] + sortedSearchStrings[i][1])) stringIndex[sortedSearchStrings[i][0] + sortedSearchStrings[i][1]] = i;
}

Then to find the starting index in your list you just access:

int startOfCurrentCharPage = stringIndex[string.Format("{0}{1}", lastChar, currentChar)];

Jimmy Hoffa 2010-09-10 20:15:00

-1 for taking the time to write/use a non-standard container without proof that that's needed.

ChrisW 2010-09-10 20:27:05

@ChrisW: Seriously? -1 for giving a creative answer? Sorry I didn't just say to load it all into memory.. Nice job giving -1 to every answer that isn't yours..

Jimmy Hoffa 2010-09-10 20:36:26

It's creative but I disagree that it's good. Good includes simplest, least maintenance, least effort/expense, least debugging, good-enough performance, understandable to the nest guy, etc.

ChrisW 2010-09-10 20:38:28

@ChrisW: I don't think you downvoted the other answers because they were wrong. They may not represent the best solutions in your mind, but it's the asker's opinion that counts. I think you downvoted them because they are not your answer.

SimpleCoder 2010-09-10 20:43:02

FYI the accepted answer to [The answer to tactical downvoting problem?](http://meta.stackoverflow.com/questions/22771/the-answer-to-tactical-downvoting-problem/22776#22776), from a moderator, says that you can flag this "instance of possible tactical downvoting" for moderator attention (if you want to).

ChrisW 2010-09-10 21:02:27

Answer 4

A:

Read every string from the XML and write them into a HashSet<string>. When you want to lookup a string, look it up in the HashSet. The cost will be O(n) to read the XML, and O(n) to do the n lookups fro the HashSet. Don't try to search repeatedly in the XML (instead do your 20,000 searches in the HashSet), because the XML isn't indexed/optimized for searching.

ChrisW 2010-09-10 20:24:38

Answer 5

+1 A:

Would it be possible to work directly out of the xml document and skip the first step?

If so, you could just make use of Xml.XmlDocument, and from there, Xml.XmlNode.SelectNodes( string ), using xpath to navigate the document. I don't know what kind of information is present in the document, but the way you have the second stage worded gives the idea that at times both the path on C:\ and the path on E:\ are present? If so, it would be as simple as two IO.File.Exists checks and then a IO.File.Delete( ).

What I mean to say is that rather than searching your XML document N times for a string, do your search through the document and delete duplicates as you go so that you only run through the document once.

I don't use iTunes or have one of its library backups on hand to say if it could work or not, though.

Kogitsune 2010-09-10 20:57:15

Answer 6

+1 A:

Here is a simple solution using Linq. Runs sufficiently fast enough for one-time use:

using System;
using System.IO;
using System.Linq;
using System.Xml.Linq;

class ITunesChecker
{
    static void Main(string[] args)
    {
        // retrieve file names
        string baseFolder = @"E:\My Music\";
        string[] filesM4a = Directory.GetFiles(baseFolder, "*.m4a", SearchOption.AllDirectories);
        string[] filesMp3 = Directory.GetFiles(baseFolder, "*.mp3", SearchOption.AllDirectories);
        string[] files = new string[filesM4a.Length + filesMp3.Length];
        Array.Copy(filesM4a, 0, files, 0, filesM4a.Length);
        Array.Copy(filesMp3, 0, files, filesM4a.Length, filesMp3.Length);

        // convert to the format used by iTunes
        for (int i = 0; i < files.Length; i++)
        {
            Uri uri = null;
            if (Uri.TryCreate(files[i], UriKind.Absolute, out uri))
            {
                files[i] = uri.AbsoluteUri.Replace("file:///", "file://localhost/");
            }
        }

        // read the files from iTunes library.xml
        XDocument library = XDocument.Load(@"E:\My Music\iTunes\iTunes Music Library.xml");
        var q = from node in library.Document.Descendants("string")
                where node.ElementsBeforeSelf("key").Where(n => n.Parent == node.Parent).Last().Value == "Location"
                select node.Value;

        // do the set operations you are interested in
        var missingInLibrary = files.Except(q, StringComparer.InvariantCultureIgnoreCase);
        var missingInFileSystem = q.Except(files, StringComparer.InvariantCultureIgnoreCase);
        var presentInBoth = files.Intersect(q, StringComparer.InvariantCultureIgnoreCase);
    }
}

0xA3 2010-09-10 21:43:31

ansaurus

tags:

views:

answers:

Does string exist check 20k times

related questions