views:

324

answers:

2

I have a client who is using a separate vCard on a separate page. These are being pasted into a wordpress text field. (Not the most efficient way to maintain a list of people, but I won't editorialize after the fact.) My mission is to write something to parse through all the addresses in the vCards and to dump the information into a central database. This would allow all the disparate pages to become addresses replete with lat and lng coordinates from google and display a lovely front page with pins galore.
This page would show all the vcards from the rest of the pages of the site.

Oh, this is an example, sanitized, of a vcard on the site, in reality it would be surrounded by a lot of dubious HTML code:

<div class="vcard">
<span class="fn org">XYZ Org Name</span><br />
<span class="url">http://www.someurl.com/&lt;/span&gt;
<div class="adr"><span class="street-address">1234 Main Ave</span><br />
<span class="locality">Chicago</span><br />
<span class="region">IL</span><br /><span class="postal-code">60647</span></div>
</div>

Now, each page has one of these, and to spider through the entire site, and collect them into an array is a bit out of my league. I can handle dumping them into a database, using PHP and mySQL.
Any and all advice would be welcome!
EDIT: Not sure how important this is, but I am pulling the data from a different server.

A: 

I believe you are looking for HTML parsers. Here is HTML parsing module for python

You need to parse the relevant data out of all the HTML files and then do whatever with it.

I have not tried any php html parsers to recommend any but since you are working on a webserver I'm hoping it has perl? Take a look at perl html parsers.

#this snippet will get contents of organization name

 sub start {
      my ($self, $tag, $attr, $attrseq, $origtext) = @_;

      if ($tag =~ /^span$/i && $attr->{'class'} =~ /^fn org$/i) {
          # see if we find <span class="fn org"
          push (@org_names, $origtext);
      } 
  }

now you have @org_names array that contains all organization names.

Omnipresent
I can't run Python on my server.
WillKop
A: 

Try the DOMDocument class' loadHTML method. Then you can use DOMDocument methods to select the nodes, attributes and values you want. Or if you're familiar with XPath, you can also instantiate a DOMXPath object to query against the loaded DOMDocument to select the desired data.

grantwparks