Given an HTML page I would like to get all the 'x' files that are embedded in the HTML file or are linked by it, where 'x' equals:
- Images (JPG,PNG,GIF...)
- Documents (Word, PowerPoint, PDF...)
- Flash (.flv, .swf)
How do I do this?
- So images are easy to extract because they are either linked to with a link ending in a (.png|.jpg|....) or they are embedded with an img tag.
- Documents can not be embedded, they can only be linked to (with a link ending in a .doc|.ppt|.pdf|...). So they are also easy to get.
Here is my problem:
How do I get the flash files that are embedded in webpages?
Please give me a pseudo-algorithm or a regex pattern.
If I am wrong in my points above (1. and 2.) please tell me so too.
Thanks!