views:

36

answers:

1

Hi,

I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.

<title>
<link>
<publish_date>
<date_downloaded>
<source>  // to this page
<keyword> // for Solr indexing
<text>    // cleaned body of page

Is there something important what I could miss in future?

+1  A: 

There is some others that you might find interesting:

  • Document type (is it an article, a publicity, a landing page, etc)
  • Subtitle/Headline/Abstract
  • Image location (url of images if you want to display in your webapp)
  • Author
  • Section (so you could use fq in your Solr queries to restrict results to specific sections)
Pascal Dimassimo