+3  A: 

For N recent, seems like you could pass parameter ?num=50 or something in the feed url

For example, 50 new messages from comp.unix.shell group

http://groups.google.com/group/comp.unix.shell/feed/atom_v1_0_msgs.xml?num=50

and then pick up a feedparser program like Universal Feed Parser

There is .update_parsed option in feedparser, you could use that to check the msg within particular date range

>>> e.updated_parsed              # parses all date formats
(2005, 11, 9, 11, 56, 34, 2, 313, 0)
S.Mark
You could change it to 1 to 100 probably.
S.Mark
Nope, more than 100 at a time does not work unfortunately.
Hamish Grubijan
+4  A: 

Crawling google groups violates the Google's Terms of Service, specifically the phrase:

use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of the Service or collect information about users for any unauthorized purpose

Are you sure you want to announce that you're doing that so openly? And are you blind to the consequences of your result?

Randal Schwartz
+1  A: 

Have you thought about yahoos YQL? It's not too bad and can access a lot of APIs. http://developer.yahoo.com/yql/

I don't know if groups is suported but u can access rss feeds. Could be helpful.

Thorn007
+1  A: 

As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.

At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.

swanson