views:

346

answers:

3

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.

import re

...
    else:
            loc += 1
            for line in infile:
                line_t = line.lstrip()
                if len(line_t) > 0 \
                and not line_t.startswith('#') \
                and not line_t.startswith('"""'):
                    if not line.startswith('\s'):
                        print ('line = ' + repr(line))
                        loc += 1
                        return (loc, name)
                    else:
                        loc += 1
                elif line_t.startswith('"""'):
                    while True:
                        if line_t.rstrip().endswith('"""'):
                            break
                        line_t = infile.readline().rstrip()

            return(loc,name)

Output:

Enter the file name: test.txt
line = '\tloc = 0\n'

There were 19 lines of code in "test.txt"

Function names:

    count_loc -- 2 lines of code

As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.

Here is my full test file I have been using:

def count_loc(infile):
    """ Receives a file and then returns the amount
        of actual lines of code by not counting commented
        or blank lines """

    loc = 0
    for line in infile:
        line = line.strip()
        if len(line) > 0 \
        and not line.startswith('//') \
        and not line.startswith('/*'):
            loc += 1
            func_loc, func_name = checkForFunction(line);
        elif line.startswith('/*'):
            while True:
                if line.endswith('*/'):
                    break
                line = infile.readline().rstrip()

    return loc

 if __name__ == "__main__":
    print ("Hi")
    Function LOC = 15
    File LOC = 19
+8  A: 

\s is only whitespace to the re package when doing pattern matching.

For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.

S.Lott
I am using import re - I'll add that to the original post
Justen
@Justen You're importing re, but you're only using basic string methods
JimB
Ah I see, well as stated in the my comment in the post below, I tried \t and ' ' but it doesn't detect the 'i' in the if __name__ == ... so it keeps counting the function's loc until the end of file is reached. (btw, I am using regex elsewhere in the prog, so the import re is still needed)
Justen
+1 S.Lott, this and help earlier.
Aiden Bell
+2  A: 

You string literals aren't what you think they are. You can specify a space or TAB like so:

space = ' '
tab = '\t'
JimB
I tried that. I changed my code to: if not line.startswith('\t') and not line.startswith(' '): print ('line = ' + repr(line)) return (loc, name) and it doesn't detect the if __name__ == '__main__' line and just counts until eof
Justen
+3  A: 

Your question has already been answered and this is slightly off-topic, but...

If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.

As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:

import tokenize

ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
                 ,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
                 ,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
    g = tokenize.generate_tokens(f.readline)
    line_num = 0
    for a_token in g:
        if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
            line_num = a_token[2][0]
            print(a_token)

As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.

stephan
+1 - didn't get around to adding this myself
JimB