Hi,
I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.
<title>
<link>
<publish_date>
<date_downloaded>
<source> // to this page
<keyword> // for Solr indexing
<text> // cleaned body of page
Is there something important what I could miss in future?