As I said in a comment, it doesn't look to me like the size of "bigfile" should be slowing down the speed at which the count increments. When you iterate over a file like that, Python just reads one line at a time in order.
The optimizations you can do at this point depend on how big matchedLines is, and what relationship the matchedLines strings have to the text you're looking in.
If matchedLines is big, you could save time by only doing the 'find' once:
for line in completedataset:
text = line[:line.find(',')]
for t in matchedLines:
if t in text:
line = line.strip().split(',')
smallerdataset.write(','.join(line[:3]) + '\n')
break
In my tests, the 'find' took about 300 nanoseconds, so if matchedLines is a few million items long, there's your extra second right there.
If you're looking for exact matches, not substring matches, you can speed things WAY up by using a set:
matchedLines = set(matchedLines)
for line in completedataset:
target = line[:line.find(',')]
## One lookup and you're done!
if target in matchedLines:
line = line.strip().split(',')
smallerdataset.write(','.join(line[:3]) + '\n')
If the target texts that don't match tend to look completely different from ones that do (for example, most of the targets are random strings, and matchedLines is a bunch of names) AND the matchedLines are all above a certain length, you could try to get really clever by checking substrings. Suppose all matchedLines are at least 5 characters long...
def subkeys(s):
## e.g. if len(s) is 7, return s[0:5], s[1:6], s[2:7].
return [s[i:i+5] for i in range(len(s) + 1 - 5)]
existing_subkeys = set()
for line in matchedLines:
existing_subkeys.update(subkeys(line))
for line in completedataset:
target = line[:line.find(',')]
might_match = False
for subkey in subkeys(target):
if subkey in existing_subkeys:
might_match = True
break
if might_match:
# Then we have to do the old slow way.
for matchedLine in matchedLines:
if matchedLine in target:
# Do the split and write and so on.
But it's really easy to outsmart yourself trying to do things like that, and it depends what your data looks like.