tags:

views:

37

answers:

2

The situation:

Each page I scrape has <input> elements with a title= and a value=

I don't know what is going to be on the page.

I want to have all my collected data in a single table at the end, with a column for each title.

So basically, I need each row of data to line up with all the others, and if a row doesn't have a certain element, then it should be blank (but there must be something there to keep the alignment).

eg.

First page has: {animal: cat, colour: blue, fruit: lemon, day: monday}

Second page has: {animal: fish, colour: green, day: saturday}

Third page has: {animal: dog, number: 10, colour: yellow, fruit: mango, day: tuesday}

Then my resulting table should be:

animal | number | colour | fruit | day
cat    | none   | blue   | lemon | monday
fish   | none   | green  | none  | saturday
dog    | 10     | yellow | mango | tuesday

Although it would be good to keep the order of the title value pairs, which I know dictionaries wont do.

So basically, I need to generate columns from all the titles (kept in order but somehow merged together)

What would be the best way of going about this without knowing all the possible titles and explicitly specifying an order for the values to be put in?

A: 

I would suggest that you could use optional parameters, or alternatively use overloaded constructors to populate the values:

Page(string animal = string.empty, 
int number = -999, string colour = string.empty, day = string.empty )

Either that or store each key/value pair as type object and then cast it from your pages.

Anish Patel
That would require knowledge of what might appear on the page though wouldn't it?
Acorn
+2  A: 

You need a multipass algorithm. Remember all the scraped pages in a list of dicts. In the first pass, go over this list and collect all the titles in a set(), and create an ordering (for example, convert to list sort them alphabetically).

In the second pass you print the table and use your generated ordering as column names, extracting the values from the dictionaries as needed (defaulting to empty to handle missing values), for example with dict.get(name, "").

wump
Ah fantastic, sets sound really useful. Although this method wouldn't retain the order in which the `title/value` pairs appeared on the page. How could you do that?
Acorn
You can look at the answers here: http://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set for implementations of Ordered Sets. Also, from Python 2.7/3.1 there's an OrderedDict in the standard library: http://docs.python.org/dev/library/collections.html#ordereddict-objects
miles82
I read over that. Do the values appear in the same order in each page? Even then, it will be difficult to reconstruct the order. You'd have to keep an extra list with the order (or ordered set) and determine where to insert a new key based on the keys around it.
wump
They will always be in the same order in relation to each other, but some pages will have values missing or have extra values.
Acorn