Extracting dates from html meta data in FAST-ESP | ansaurus

tags:

fast-esp

views:

64

answers:

1

+1 Q:

Extracting dates from html meta data in FAST-ESP

During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).

<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data

Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.

#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes

Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of specifying "all meta attributes" as input.

Has anyone done something similar and can offer any advice on the best way to do this.

Thanks

Neil

+1 A:

Hi,

I suppose that a custom stage that takes all the needed date attributes as an input, processes a comparison between all them (to find the newest date), and outputs the most up-to-date field will do the job.

2010-05-09 08:42:03

related questions

Fast Esp Custom Stage Development

Storing index values in FAST-ESP without modifications

FAST-ESP: How does the Java Boosting API work?

How does FAST ESP's xrank operator work?

Can I redirect a query from default search box in SharePoint to a different search engine.

What components in FAST ESP can be versioned for builds and deployment?

Fast ESP character normalization

Google search vs FAST ESP - what are the tech differences?

enterprise search engine development asking for advice

Google Search, FAST ESP and Lucene

Enterprise Search: Has anybody developed on FAST ESP? What did you think about it?

Robots.txt to disallow everything and allow only specific parts of the site/pages. Is "allow" supported by crawlers like Ultraseek and FAST?

FAST ESP vs Google Search Appliance for development

Microsoft Enterprise Search - FAST