ansaurus

Question

Answer 1

A:

If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:

(defun getlinks ()
  (beginning-of-buffer)
  (replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
  (beginning-of-buffer)
  (replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
  (beginning-of-buffer)
  (replace-regexp "
+" "
")
  (beginning-of-buffer)
  (replace-regexp "^LINK:\\(.*\\)$" "\\1")
)

It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".

Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).

Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.

Heinzi 2009-10-29 10:09:30

Answer 2

A:

You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:

(defun my-grab-html (file)
  (interactive "fHtml file: ")
  (let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
    (mapc (lambda (n)
            (when (consp n) ; don't operate on the whitespace, xml preserves whitespace
              (let ((link (cdr (assq 'href (xml-node-attributes n)))))
                (when link
                  (insert link)
                  (insert "|")
                  (insert (car (xml-node-children n))) ;# grab the text for the link
                  (insert "\n")))))
          (xml-node-children res))))

This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.

Trey Jackson 2009-10-29 16:05:56

Answer 3

+1 A:

I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.

(defun extract-urls (fname)
 "Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
  (setq in-buf (set-buffer (find-file fname))); Save for clean up
  (beginning-of-buffer); Need to do this in case the buffer is already open
  (setq u1 '())
  (while
      (re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)

      (when (match-string 0)            ; Got a match
        (setq url (match-string 1) )    ; URL
        (setq title (match-string 2) )  ; Title
        (setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
       )
      )
  (kill-buffer in-buf)    ; Don't leave a mess of buffers
  (progn
    (with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
      (mapcar 'insert u1))
    (switch-to-buffer "new-urls.csv"); Finally, show the new buffer
    )
  )

;; Create a list of files to process
;;
(mapcar 'extract-urls '(
                       "/tmp/foo.html"
                       "/tmp/bar.html"
            ))

2009-11-01 14:02:26

Great to see your lisp skills combined with my regexp. Thanks for sharing the final solution.

Heinzi 2009-11-01 14:26:36

ansaurus

tags:

views:

answers:

Extracting URLs from an Emacs buffer?

related questions