views:

499

answers:

4

This problem is similar to blind SQL injections. The goal is to determine the exact value of a string, and the only test you can do is to see if a DOS-style wildcard (? = any character, * = any number of any characters) you specify is matched by the string. (So practically you only have access to a bool DoesWildcardMatch(string wildcard) function).

The straight-forward way is to test against a*, b*, c*... until you find the first letter, then repeat. Some optimizations I can think of:

  • search for *a*, *b* etc. to determine the character set
  • when a match on *x* is found, perform divide-et-impera (*a*x*, *b*x*, ...)
+1  A: 

If a specific number of ? works, you can also check "?", "??", "???" etc. to get the length of the string, but I doubt this will help much as you can also check if you've got the right length with just one additional check without any wildcards after each round.

I think the divide method with a character set check before is almost optimal, there are some additional details, for example if you matched *a*b*, you should check *ab* afterwards to know if there are letters in between and of course as stated above, check *ab and "ab" after this to know if you've finished on the right side or completely.

schnaader
A: 

Why not convert your DOS style wildcard string into a regular expression? e.g.:

?a*

becomes:

.a.*

Then just perform a simple regular expression match comparing that to your test string.

Matt Huggins
I'm sorry, I just don't understand how regular expressions would help solve this problem. From a code perspective, imagine that you can only call a function that takes a DOS wildcard and returns a boolean stating if the string we need to determine matches the given wildcard. We can't test the unknown string against a regular expression.
CyberShadow
+2  A: 

As for the divide-et-impera, be sure to keep track of value that you known are not present. Also I'd not go with a, b, c, but with frequency order. Some sort of markov chain from that might make it even faster.

One thing to watch out for is that you can't assume that a given literal will always match the same location in the input. This will be of particular interest regarding removing the wild cards at the end.

c a b a
--------
* a *     match
  * b*a*  woops!
BCS
Regarding the frequency - yeah that's obvious, I just forgot to mention that. Good point about the divide-et-impera.
CyberShadow
+2  A: 

A first thought. You can determin the length n of the string in O(log2(n)).

  • Check Z* where Z represents k question marks starting with 0, then 1, and then doubling the number of question marks with every check until no match occurs. n must be between k / 2 and k
  • Find the exact length using the same pattern changing k in the same way as binary search does.

Knowing the exact length might help to perform a kind of divide-et-impera in the spatial domain.

UPDATE

If you know the length, you can use the same pattern to correctly locate a symbol.

Example:

    ..X. ..XX (spaces added for readability)

                              + symbol may be X
                              - symbol is not X
                              X symbol is X

    *X*         => MATCH      ++++ ++++
    *X*   ????  => MATCH      ++++ ++++
    *X*?? ????  => NO MATCH   --++ ++++
    ??X?  ????  => MATCH      --X+ ++++
    ??XX  ????  => NO MATCH   --X- ++++
    ??X?  *X*?? => NO MATCH   --X- --++
    ??X?  ??X?  => MATCH      --X- --X+
    ??X?  ??XX  => MATCH      --X- --XX

For string length n and alphabet size m this will take about O(log2(n)) to find the length of the string, about O(n • log2(n)) to correctly place n symbols, and O(m) to find the used symbols - summing all together yields O(n • log2(n) + m).

I could imagine that it is possible to speed this up by merging several steps - maybe test for used symbols while determining the string length or simultaneously locating two (or even more?) symbols in the first and second half of the string. This will require to recheck the merged steps in isolation if the check fails in order to determine which check faild. But as long as the merged check succeeds, you gain information on both.

Maybe I will calculate that tomorrow in order to see if it will really speed the thing up.

Daniel Brückner
Let's see. If we have an alphabet of size m and strings of fixed size n, then a string contains n * log(m) bits of information. Each query can only get at most 1 bit of information. Hence you need at least O(n log(m)) queries. This is can be bigger than O(n log(n) + m). Therefore your answer must be wrong.
Accipitridae
No, I thought that, too. But we can obtain more than one bit per check. For example, if *A* fails, we know that none of the n symbols equals A and we obtained more than one bit of information with a single check.
Daniel Brückner
More precisely it depends on the length of the string how much information is obtained. A failing *A* check reduces the search space from m^n to (m - 1)^n, hence from n * log2(m) to n * (log2(m - 1)) bits. For n > 1 / log2(m / m - 1) we obtain more than one bit of information. In the case of m = 26 this requires n > 18.
Daniel Brückner
A nice possibility to speed up the match would be kind of a wordlist - you said it's a hostname, so for example searching for suffixes like ".com", ".org" etc. would be possible
schnaader