views:

1356

answers:

5

I would like do something like that.

list_of_urls = ['http://www.google.fr/', 'http://www.google.fr/', 
                'http://www.google.cn/', 'http://www.google.com/', 
                'http://www.google.fr/', 'http://www.google.fr/', 
                'http://www.google.fr/', 'http://www.google.com/', 
                'http://www.google.fr/', 'http://www.google.com/', 
                'http://www.google.cn/']

urls = [{'url': 'http://www.google.fr/', 'nbr': 1}]

for url in list_of_urls:
    if url in [f['url'] for f in urls]:
         urls[??]['nbr'] += 1
    else:
         urls.append({'url': url, 'nbr': 1})

How can I do ? I don know if I should take the tuple to edit it or figure out the tuple indice?

Any help ?

+2  A: 

To do it exactly your way? You could use the for...else structure

for url in list_of_urls:
    for url_dict in urls:
        if url_dict['url'] == url:
            url_dict['nbr'] += 1
            break
    else:
        urls.append(dict(url=url, nbr=1))

But it is quite inelegant. Do you really have to store the visited urls as a LIST? If you sort it as a dict, indexed by url string, for example, it would be way cleaner:

urls = {'http://www.google.fr/': dict(url='http://www.google.fr/', nbr=1)}

for url in list_of_urls:
    if url in urls:
        urls[url]['nbr'] += 1
    else:
        urls[url] = dict(url=url, nbr=1)

A few things to note in that second example:

  • see how using a dict for urls removes the need for going through the whole urls list when testing for one single url. This approach will be faster.
  • Using dict( ) instead of braces makes your code shorter
  • using list_of_urls, urls and url as variable names make the code quite hard to parse. It's better to find something clearer, such as urls_to_visit, urls_already_visited and current_url. I know, it's longer. But it's clearer.

And of course I'm assuming that dict(url='http://www.google.fr', nbr=1) is a simplification of your own data structure, because otherwise, urls could simply be:

urls = {'http://www.google.fr':1}

for url in list_of_urls:
    if url in urls:
        urls[url] += 1
    else:
        urls[url] = 1

Which can get very elegant with the defaultdict stance:

urls = collections.defaultdict(int)
for url in list_of_urls:
    urls[url] += 1
NicDumZ
The second version is good since I can convert the dict as a list after.
Natim
+7  A: 

That is a very strange way to organize things. If you stored in a dictionary, this is easy:

urls = {'http://www.google.fr/' : 1 }
for url in list_of_urls:
    if not url in urls:
        urls[url] = 1
    else:
        urls[url] += 1

This is a common "pattern" in Python. It is so common that there is a special data structure, defaultdict, created just to make this even easier:

from collections import defaultdict
urls = defaultdict(int)
for url in list_of_urls:
    urls[url] += 1

If you access the defaultdict using a key, and the key is not already in the defaultdict, the key is automatically added with a default value. The default value comes from the argument. In this case, the default value comes from int(), which will return a 0 value. So, the first time you reference a URL, its count is initialized to zero, and then you add one to it.

If you really need to do it the way you showed, the easiest and fastest way would be to do it the way I showed, and then build the one you need:

from collections import defaultdict
urls_d = defaultdict(int)
for url in list_of_urls:
    urls_d[url] += 1

urls = []
for key, value in urls_d:
    urls.append({"url": key, "nbr": value})
steveha
I do like that to send it to a django template so I can do : `{% for u in urls %} {{ u.url }} : {{ u.nbr }}{% endfor %}
Natim
You can still do {% for url, nbr in urls.items %}{{ url }} : {{ nbr }}{% endfor %}
stefanw
Ok sounds great :) Thank you
Natim
+2  A: 

Use defaultdict:

from collections import defaultdict

urls = defaultdict(int)

for url in list_of_urls:
    urls[url] += 1
Greg Hewgill
+3  A: 

Using the default works, but so does:

urls[url] = urls.get(url, 0) + 1

using .get, you can get a default return if it doesn't exist. By default it's None, but in the case I sent you, it would be 0.

mikelikespie