There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string
, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern
. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.
What can I use that's faster than python's bultin find for matching 1KB patterns against 1GB or greater data strings?
(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)
Update: Please bound memory requirements to 16GB.