views:

118

answers:

5

this website http://courses.westminster.ac.uk/CourseList.aspx which lists over 250 courses in one list, i wanted to get the name of each course and insert that into my mysql database using php, the courses are listed like this:

<td> computer science</td>
<td> media studeies</td>
etc...

is thier a way to do that in php, instead of me having a mad data entry nightmare, thank you very much :)) thanks

+2  A: 

Hello, You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/

greg0ire
Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
+4  A: 

Regular expressions work well.

$page = // get the page
$page = preg_split("/\n/", $page);
for ($text in $page) {
    $matches = array();
    preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
    // insert $matches[1] into the database
}

See the documentation for preg_match.

alpha123
oh i love this,,, this is exactly what i need, but can you elobrate on how im going to get the page! in terms of inserting, do you just insert $matches[1] into the database, or deos it have to change to $matches[2] ect..
getaway
Just insert $matches[1] into the database. It will be updated every iteration of the loop. An easy way to get the page is `file_get_contents("http://your-url.com/page.html")`.
alpha123
[obligatory link telling you Regex aint for parsing HTML](http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html)
Gordon
Yeah, I know, but for a quick-and-dirty job like this that he's only gonna use once and he already knows the structure of the HTML, regexes are really convenient. Besides, if he wants maintainable, bug-free code he should stay away from PHP....
alpha123
They are not any more convenient than using a proper parser. And please keep the language bias away. No language is bugfree and there is no reason why you would not be able to create a maintainable application with PHP (unless you are a bad developer).
Gordon
I got 6 lines (took me less than 2 minutes), not including insertion into the database. And there is a difference between a bug-free language and a language that makes it easy to write buggy code.
alpha123
It takes 5 lines of code with DOM (excluding insertion). It takes less than 1 minute to write. And it's much more reliable than your Regex. And I still dont see why PHP should make it any more easier than any other scripting language to write buggy code.
Gordon
Nice. If you know how to use the library, of course.... Doesn't downloading, installing, and learning how to use a full HTML-parsing library seem a little overkill to you? And PHP never warns you when it should, is way too loosely typed, and supports combining presentation and logic by embedding PHP directly in HTML.
alpha123
No, I dont think that is overkill. DOM is a native extension of PHP and enabled by default, so there is nothing to download or install. DOM is an implementation of the language agnostic W3C DOM interface, so chances are the OP already knows it from another language. With five lines of code there isnt much to learn and Regex patterns and functions have to be learned too, so you hardly have an argument. I wont argue about your other claims since they are nonsense. Maybe you should heed your own advice to the OP instead and not use PHP (or give ill advise about it ftm).
Gordon
Sorry. I am not a PHP programmer and didn't know DOM stuff was built in. I saw other people suggesting separate libraries and assumed it wasn't. In the languages that I come from regexes would be the most convenient (though certainly not the best) way to do this. Also, I couldn't help but notice that your "5-line 1-minute" answer is missing. Consider posting it if you want to show him the "real way to do it". And yes, I don't use PHP (for the reasons I stated, plus the community seems obnoxious) and wouldn't recommend it to anybody but he specifically said "is there a way to do this in PHP".
alpha123
I dont post it because the [question is a duplicate](http://stackoverflow.com/search?q=dom+regex+php). The problem is [finding the question to closevote is tedious](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php). One OP wants to parse td elements while the other wants img elements. The approach is always the same, yet there is at least one question like this daily. [I've answered so many](http://stackoverflow.com/search?q=user%3A208809+dom+regex) of them by now, that I got the Regex Badge for providing DOM solutions. It's just tiring by now.
Gordon
I dont agree about your comments about the PHP community. If you go to a PHP conference (that's where the community is) you will notice the people are quite cheer- and helpful and open-minded. Of course, when you approach them with bias and silly arguments about how PHP doesnt do this or that, they likely wont react like that anymore. PHP is what it is (and it is very successful the way it is). Yes, it is not perfect but neither are other languages.
Gordon
I'm not even going to bother responding to something as subjective as this. Suffice it to say, if you want something done right, do it yourself. So if you want him to do things your way, show him your way.
alpha123
Yes, actually you're right on this one. I've spent so much time on this question already that I might as well add the DOM solution (actually I already did). Of course I am being subjective on the community (they gave me free beer) but I dont think [PHP's success](http://blogs.gartner.com/mark_driver/2009/12/03/php-past-present-and-future/) can reasonably be denied. You are free to think different and you are also free to nurture your bias against PHP. But to repeat myself, I'd just appreciate if you'd keep it to yourself (at least within StackOverflow's PHP tag - it's not helpful).
Gordon
Okay, the DOM solution is definitely nicer. PHP is (very) successful because it is easy to learn and use (and that's also the reason I think it is bad; it combines the presentation and logic of a page). And I'm allowed to express my opinions about PHP, thank you. All I said was if he wanted more maintainable code, he should stay away from PHP (it's easier for learning developers to abuse PHP than, say, Python or Scheme).
alpha123
Whether you do combine presentation and logic of a page or not is up to you. If you are just going to whip out a small homepage it's perfectly fine to do so. Keep it simple. If you are going to write an enterprise webapp, you probably will use MVC and that's very much possible (and encouraged and established) too. As for PHP being easy to learn, yes that's true, but dont tell me you havent see sloppy "professional" code in languages that are hard to learn. Bad code exists in every language.
Gordon
Yeah, I admit, I've prototyped things in PHP before. I never would be foolish enough to use it in production though. It doesn't even auto-escape HTML (at least if it does, it isn't on by default). I've seen sloppy Java code, believe me. And it's not that the language is too easy to learn, it's not strict enough, IMO.
alpha123
I dont think it's fair to blame the entire language just because it doesnt do a particular thing. You could still write your own output function that wraps `echo` and `htmlentities` for escaping, so it's not a big thing. I'm not sure in what regard you consider PHP being not strict enough, but I am also quite sure that you wouldn't convince me of it anyway. For me, PHP is fine. I make a living from it. I agree PHP is not an overly pretty language. But ultimately any language is just a tool. And PHP does very well as a tool.
Gordon
I'm not blaming the whole language because of it. I'm saying you have to be very careful using it in production environments. PHP is not strict enough means it is too easy to make messy bad code. I've seen bad Python, but not as bad as some PHP I've seen, because Python is more structured. And I've never seen bad Scheme, but that's because one doesn't find much Lisp code out there. I don't make a living off any language (because I'm 14) but for me, PHP isn't good enough.
alpha123
Anyway, you're entitled to your opinion about PHP as much as I am. So let's just agree to disagree, 'cause I'm getting tired of this.
alpha123
A: 

I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery

Sam
A: 

Just for fun, here's a quick shell script to do the same thing.

curl http://courses.westminster.ac.uk/CourseList.aspx \
| sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
| uniq > courses.txt
no
A: 

How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
foreach($dom->getElementsByTagName('td') as $title) {
    echo $title->nodeValue;
}

For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.

Gordon