ansaurus

Question

[Python] Extracting data from a URL result with special formating

Answer 1

+1 A:

i didn't understand well your problem because from your code there it seem to me that you use Visualization API (it's the first time that i hear about it by the way).

But well if you are just searching for a way to fetch data from a web page you could use urllib2 this is just for getting data, and if you want to parse the retrieved data you will have to use a more appropriate library like BeautifulSoop

if you are dealing with another web service (RSS, Atom, RPC) rather than web pages you can find a bunch of python library that you can use and that deal with each service perfectly.

import urllib2

from BeautifulSoup import BeautifulSoup

result =  urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&amp;query=%s' % (2, 'seedterm'))

htmletxt = resul.read()

result.close()

soup = BeautifulSoup(htmltext, convertEntities="html" )

# you can parse your data now check BeautifulSoup API.

singularity 2010-10-29 23:27:35

Answer 2

+2 A:

It sounds like you can break this problem up into several subproblems.

Subproblems

There are a handful of problems that need to be solved before composing the completed script:

Forming the request URL: Creating a configured request URL from a template
Retrieving data: Actually making the request
Unwrapping JSONP: The returned data appears to be JSON wrapped in a JavaScript function call
Traversing the object graph: Navigating through the result to find the desired bits of information

Forming the request URL

This is just simple string formatting.

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&amp;query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')

Python 2 Note

You will need to use the string formatting operator (%) here.
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&amp;query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')

Retrieving data

You can use the built-in urllib.request module for this.

import urllib.request
data = urllib.request.urlopen(url) # url from previous section

This returns a file-like object called data. You can also use a with-statement here:

with urllib.request.urlopen(url) as data:
    # do processing here

Python 2 Note

Import urllib2 instead of urllib.request.

Unwrapping JSONP

The result you pasted looks like JSONP. Given that the wrapping function that is called (oo.visualization.Query.setResponse) doesn't change, we can simply strip this method call out.

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]

Parsing JSON

The resulting result string is just JSON data. Parse it with the built-in json module.

import json

result_object = json.loads(result)

Traversing the object graph

Now, you have a result_object that represents the JSON response. The object itself be a dict with keys like version, reqId, and so on. Based on your question, here is what you would need to do to create your list.

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]

Putting it all together

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&amp;query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))

Python 2.7 version

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&amp;query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))

Wesley 2010-10-29 23:51:48

ansaurus

tags:

views:

answers: