Indexing text content of html | ansaurus

tags:

views:

58

answers:

1

Q:

Indexing text content of html

I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me.

Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed could be improved.

So, with Linux/unix being priority, what (c/c++) libraries would be best suited to this kind of task?

+2 A:

To extract the text you can use an HTML parser like htmlcxx or libxml. You can can also use any XML library after tidying up the HTML. For indexing the text you can use CLucene.

Vijay Mathew 2010-01-28 06:49:28

libxml will do. Xapian is the indexer in this case.

Named 2010-01-28 07:14:09

related questions

Of Memory Management, Heap Corruption, and C++

How do I make a GUI?

Alpha blending sprites in Nintendo DS Homebrew

Thread safe lazy contruction of a singleton in C++

Interview Programming Questions - In house Exam

Link issues (VC6)

What are the barriers to understanding pointers and what can be done to overcome them?

Why are professors or schools picking Java over C++ to teach to students?

What is the best way to create a sparse array in C++

C/C++ library for reading MIDI signals from a USB MIDI device

How do you pack a visual studio c++ project for release?

How to set up unit testing for Visual Studio C++

How do I configure and communicate with a serial port?

Lightweight IDE for Linux

Mapping Stream data to data structures in C#

CPU throttling in C++

Asynchronous multi-direction server-client communication over the same open socket?

Exceptions in C++

Heap corruption under Win32; how to locate?

Build for Windows NT 4.0 using Visual Studio 2005?

C++: Should I use nested classes in this case?

BerkeleyDB Concurrency

GTK implementation of MessageBox

Is gettimeofday() guaranteed to be of microsecond resolution?

How to use the C socket API in C++ on z/OS