ansaurus

Question

Fastest way to bruteforce a string using a DOS wildcard

Answer 1

+1 A:

If a specific number of ? works, you can also check "?", "??", "???" etc. to get the length of the string, but I doubt this will help much as you can also check if you've got the right length with just one additional check without any wildcards after each round.

I think the divide method with a character set check before is almost optimal, there are some additional details, for example if you matched *a*b*, you should check *ab* afterwards to know if there are letters in between and of course as stated above, check *ab and "ab" after this to know if you've finished on the right side or completely.

schnaader 2009-05-14 19:20:03

Answer 2

A:

Why not convert your DOS style wildcard string into a regular expression? e.g.:

?a*

becomes:

.a.*

Then just perform a simple regular expression match comparing that to your test string.

Matt Huggins 2009-05-14 19:23:20

I'm sorry, I just don't understand how regular expressions would help solve this problem. From a code perspective, imagine that you can only call a function that takes a DOS wildcard and returns a boolean stating if the string we need to determine matches the given wildcard. We can't test the unknown string against a regular expression.

CyberShadow 2009-05-14 19:48:22

Answer 3

+2 A:

As for the divide-et-impera, be sure to keep track of value that you known are not present. Also I'd not go with a, b, c, but with frequency order. Some sort of markov chain from that might make it even faster.

One thing to watch out for is that you can't assume that a given literal will always match the same location in the input. This will be of particular interest regarding removing the wild cards at the end.

c a b a
--------
* a *     match
  * b*a*  woops!

BCS 2009-05-14 19:34:17

Regarding the frequency - yeah that's obvious, I just forgot to mention that. Good point about the divide-et-impera.

CyberShadow 2009-05-14 19:45:23

Answer 4

+2 A:

A first thought. You can determin the length n of the string in O(log2(n)).

Check Z* where Z represents k question marks starting with 0, then 1, and then doubling the number of question marks with every check until no match occurs. n must be between k / 2 and k
Find the exact length using the same pattern changing k in the same way as binary search does.

Knowing the exact length might help to perform a kind of divide-et-impera in the spatial domain.

UPDATE

If you know the length, you can use the same pattern to correctly locate a symbol.

Example:

    ..X. ..XX (spaces added for readability)

                              + symbol may be X
                              - symbol is not X
                              X symbol is X

    *X*         => MATCH      ++++ ++++
    *X*   ????  => MATCH      ++++ ++++
    *X*?? ????  => NO MATCH   --++ ++++
    ??X?  ????  => MATCH      --X+ ++++
    ??XX  ????  => NO MATCH   --X- ++++
    ??X?  *X*?? => NO MATCH   --X- --++
    ??X?  ??X?  => MATCH      --X- --X+
    ??X?  ??XX  => MATCH      --X- --XX

For string length n and alphabet size m this will take about O(log2(n)) to find the length of the string, about O(n • log2(n)) to correctly place n symbols, and O(m) to find the used symbols - summing all together yields O(n • log2(n) + m).

I could imagine that it is possible to speed this up by merging several steps - maybe test for used symbols while determining the string length or simultaneously locating two (or even more?) symbols in the first and second half of the string. This will require to recheck the merged steps in isolation if the check fails in order to determine which check faild. But as long as the merged check succeeds, you gain information on both.

Maybe I will calculate that tomorrow in order to see if it will really speed the thing up.

Daniel Brückner 2009-05-14 19:39:04

Let's see. If we have an alphabet of size m and strings of fixed size n, then a string contains n * log(m) bits of information. Each query can only get at most 1 bit of information. Hence you need at least O(n log(m)) queries. This is can be bigger than O(n log(n) + m). Therefore your answer must be wrong.

Accipitridae 2009-05-14 21:08:02

No, I thought that, too. But we can obtain more than one bit per check. For example, if *A* fails, we know that none of the n symbols equals A and we obtained more than one bit of information with a single check.

Daniel Brückner 2009-05-14 22:08:26

More precisely it depends on the length of the string how much information is obtained. A failing *A* check reduces the search space from m^n to (m - 1)^n, hence from n * log2(m) to n * (log2(m - 1)) bits. For n > 1 / log2(m / m - 1) we obtain more than one bit of information. In the case of m = 26 this requires n > 18.

Daniel Brückner 2009-05-15 10:36:08

A nice possibility to speed up the match would be kind of a wordlist - you said it's a hostname, so for example searching for suffixes like ".com", ".org" etc. would be possible

schnaader 2009-05-21 09:51:37

ansaurus

tags:

views:

answers:

Fastest way to bruteforce a string using a DOS wildcard

related questions