ansaurus

Question

Python Script to find instances of a set of strings in a set of files

Answer 1

A:

You might consider using ack.

% ack --java 'search_string'

This will search under the current directory.

Jason Baker 2009-09-27 16:09:20

Answer 2

+4 A:

Assuming the files are of reasonable size (as source files will be) so you can easily read them in memory, and that you're looking for the parts in quotes right of the = signs:

import collections
files_by_str = collections.defaultdict(list)

thestrings = []
with open('Strings.txt') as f:
  for line in f:
    text = line.split('=', 1)[1]
    text = text.strip().replace('"', '')
    thestrings.append(text)

import os

for root, dirs, files in os.walk('/top/dir/of/interest'):
  for name in files:
    path = os.path.join(root, name)
    with open(path) as f:
      data = f.read()
      for text in thestrings:
        if text in data:
          files_by_str[text].append(path)
          break

This gives you a dict with the texts (those that are present in 1+ files, only), as keys, and lists of the paths to the files containing them as values. If you care only about a yes/no answer to the question "is this text present somewhere", and don't care where, you can save some memory by keeping only a set instead of the defaultdict; but I think that often knowing what files contained each text will be useful, so I suggest this more complete version.

Alex Martelli 2009-09-27 16:13:23

Fantastic answer, greatly appreciated.

gav 2009-09-27 18:16:41

@gav, you're welcome!

Alex Martelli 2009-09-27 19:15:48

Answer 3

A:

to parse your strings.txt you don't need regular expressions:

all_strings = [i.partition('=')[0] for i in open('strings.txt')]

to parse your source you could use the dumbest regex:

re.search('\bTITLE\b', source)        # for each string in all_strings

to walk the source directory you could use os.walk.

Successful re.search would mean that you need to remove that string from the all_strings: you'll be left with strings that needs to be removed from strings.txt.

SilentGhost 2009-09-27 16:18:09

Answer 4

A:

You should consider using YAML: easy to use, human readable.

jldupont 2009-09-27 17:13:07

Answer 5

A:

You are re-inventing gettext, the standard for translating programs in the Free Software sphere (even outside python).

Gettext works with, in principle, large files with strings like these :-). Helper programs exist to merge in new marked strings from the source into all translated versions, marking unused strings etc etc. Perhaps you should take a look at it.

kaizer.se 2009-09-27 17:16:55

ansaurus

tags:

views:

answers:

Python Script to find instances of a set of strings in a set of files

related questions