views:

214

answers:

6

Basically, if I have a line of text which starts with indention, what's the best way to grab that indention and put it into a variable in Python? For example, if the line is:

\t\tthis line has two tabs of indention

Then it would return '\t\t'. Or, if the line was:

    this line has four spaces of indention

Then it would return four spaces.

So I guess you could say that I just need to strip everything from a string from first non-whitespace character to the end. Thoughts?

A: 

How about using the regex \s* which matches any whitespace characters. You only want the whitespace at the beginning of the line so either search with the regex ^\s* or simply match with \s*.

MatrixFrog
If you want the match to start at the beginning, use `match`. If the match can start anywhere, use `search`.
Mark Byers
`match` has to match the entire string, right? So you would have to add a separate group that matches the entire *rest* of the string and then the whitespace would just be in that first group. I think.
MatrixFrog
@MatrixFrog: Read the doc (http://docs.python.org/library/re.html#matching-vs-searching). The only difference between `match` and `search` is the implicit anchor at the start. There's no restriction to match the entire string.
KennyTM
@KennyTM I've edited my answer to reflect that information. Please comment again if it's still wrong.
MatrixFrog
+11  A: 
import re
s = "\t\tthis line has two tabs of indention"
re.match(r"\s*", s).group()
// "\t\t"
s = "    this line has four spaces of indention"
re.match(r"\s*", s).group()
// "    "

And to strip leading spaces, use lstrip.


As there are down votes probably questioning the efficiency of regex, I've done some profiling to check the efficiency of each cases.

Very long string, very short leading space

RegEx > Itertools >> lstrip

>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*")s="          hello world!"*10000', number=100000)
0.10037684440612793
>>> timeit.timeit('"".join(itertools.takewhile(lambda x:x.isspace(),s))', 'import itertools;s="          hello world!"*10000', number=100000)
0.7092740535736084
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s="          hello world!"*10000', number=100000)
0.51730513572692871
>>> timeit.timeit('s[:-len(s.lstrip())]', 's="          hello world!"*10000', number=100000)
2.6478431224822998

Very short string, very short leading space

lstrip > RegEx > Itertools

If you can limit the string's length to thousounds of chars or less, the lstrip trick maybe better.

>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s="          hello world!"*100', number=100000)
0.099548101425170898
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s="          hello world!"*100', number=100000)
0.53602385520935059
>>> timeit.timeit('s[:-len(s.lstrip())]', 's="          hello world!"*100', number=100000)
0.064291000366210938

This shows the lstrip trick scales roughly as O(√n) and the RegEx and itertool methods are O(1) if the number of leading spaces is not a lot.

Very short string, very long leading space

lstrip >> RegEx >>> Itertools

If there are a lot of leading spaces, don't use RegEx.

>>> timeit.timeit('s[:-len(s.lstrip())]', 's=" "*2000', number=10000)
0.047424077987670898
>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s=" "*2000', number=10000)
0.2433168888092041
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s=" "*2000', number=10000)
3.9949162006378174

Very long string, very long leading space

lstrip >>> RegEx >>>>>>>> Itertools

>>> timeit.timeit('s[:-len(s.lstrip())]', 's=" "*200000', number=10000)
4.2374031543731689
>>> timeit.timeit('r.match(s).group()', 'import re;r=re.compile(r"\s*");s=" "*200000', number=10000)
23.877214908599854
>>> timeit.timeit('"".join(itertools.takewhile(str.isspace,s))', 'import itertools;s=" "*200000', number=100)*100
415.72158336639404

This shows all methods scales roughly as O(m) if the non-space part is not a lot.

KennyTM
For a line with *no* indentation, this gives an AttributeError because `match` returns `None`. Assuming the desired result is `''`, the empty string, changing the `+` to a `*` seems to solve this.
MatrixFrog
I prefer my solution of abusing lstrip! No slow regexes needed...
Phil H
@Phil: Even after implementing Adam Bernier's fix, your `lstrip()` trick is still 10x slower than the regex version. Try it: `timeit.timeit('re.match(r"\s*", s)', 'import re;s=" hello world!"*10000', number=1000000)` (0.25 sec) vs `timeit.timeit('s[:len(s)-len(s.lstrip())]', 's=" hello world!"*10000', number=100000)` (2.7 sec)`
KennyTM
@Kenny: Remove `*10000` (so the test better models real lines) and lstrip is *much* faster than re: `timeit.timeit('s[:-len(s.lstrip())]', 's=" hello world!"', number=100000)` (4.810 vs 0.074 for me).
Roger Pate
@Roger: That means Phil's method does not scale to very long string (O(string length) vs O(number of leading spaces)). Not a good algorithm I would say.
KennyTM
@Kenny: That may be true, and yours may even be better for other reasons, but it's pointless to say yours is faster on lines of 130k(!) characters if it's extremely rare to have lines longer than 80 characters.
Roger Pate
probably half the time is taken compliling the regex, why don't you do that in the setup?
gnibbler
@Roger: That's good point. I've updated the test cases to show the time in each extreme cases. @gnibbler: From the doc, *"The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached,"* and compiling at setup doesn't affect the time much either.
KennyTM
Compiling the regex, even though I know it's otherwise catched, more than halves the time in these rudimentary tests. http://codepad.org/Oc5KDQhU
Roger Pate
@Roger: Ah, that's good. I'd just re-tested the last case when replying and saw no difference.
KennyTM
@Kenny: What I was trying to say is that *common cases* matter much more than extreme cases. I think all this testing is missing the real point, however, as this is not likely to be a bottleneck and could be implemented in C if it was, for better performance than any pure-Python we'll come up with. (It wasn't my downvote, btw, and I only see the one.)
Roger Pate
"If you can limit the string's length to thousounds [sic] of chars or less..."?! So according to these tests, lstrip always beats regex, except for the cases that never happen. Hmm... wonder which way I'd use....
John Y
A: 

If you're interested in using regular expressions you can use that. /\s/ usually matches one whitespace character, so /^\s+/ would match the whitespace starting a line.

adamse
+2  A: 

A sneaky way: abuse lstrip!

fullstr = "\t\tthis line has two tabs of indentation"
startwhites = fullstr[:len(fullstr)-len(fullstr.lstrip())]

This way you don't have to work through all the details of whitespace!

(Thanks Adam for the correction)

Phil H
Your shadowing of the str builtin is confusing
prestomation
Nice idea. Doesn't work as written. `lstrip()` returns a copy of the string with leading characters removed. Instead: `startwhites = s[:len(s)-len(s.lstrip())]` Btw, I wasn't the one who downvoted.
Adam Bernier
You probably meant `[:-len(..`.
Roger Pate
+3  A: 

This can also be done with str.isspace and itertools.takewhile instead of regex.

import itertools

tests=['\t\tthis line has two tabs of indention',
       '    this line has four spaces of indention']

def indention(astr):
    # Using itertools.takewhile is efficient -- the looping stops immediately after the first
    # non-space character.
    return ''.join(itertools.takewhile(str.isspace,astr))

for test_string in tests:
    print(indention(test_string))
unutbu
you can use `str.isspace` in place of the `lambda` function
gnibbler
@gnibbler: Indeed. Thank you for that!
unutbu
A: 
def whites(a):
return a[0:a.find(a.strip())]

Basically, the my idea is:

  1. Find a strip of starting line
  2. Find a difference between starting line and stripped one
woo