views:

140

answers:

3

Hello.
Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.

I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.

Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.

A: 

Do you want to get content of HTML file or just some specific data ?

svlada
I want to retrieve specific data from tags within the html pages. This retrieval works on a single file. I probably confused myself in thinking about whether I had to use read to read contents of file or just pass an array through. As you can tell, I'm no guru :-)
Jessica X
A: 

Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...

You should take care of circular references though.

Quamis
+1  A: 

You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.

You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().

Ollie Saunders
Thank you very much. Works like a charm. Thank you thank you thank you.
Jessica X