text-extraction

How to get the number of results found for a keyword in google

I need to supply a keyword like "blue metal kettle" (with/without quotes) and get only the number of results found for this search. If I search without quotes right now, I get: Results 1 - 10 of about 1,040,000 for blue metal kettle. (0.19 seconds) Here '1,040,000' is the number I want. Is there any API function to do this, or I must...

Advanced PDF Parsing Using Python (extracting text without tables, etc.): What's the Best Library?

I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be prob...

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to goo...

Get Selected text in browser programatically

Hi, From my windows application, i want to detect selected text in "Internet Explorer", Firefox and any other browser. Do you know what piece of code should i use in order to achieve this? Thanks, The idea is not to search for a text in IE, but instead "capture the selected text" in IE. By the way not only IE, but any windows applica...

How do I extract lines from a file using their line number on unix?

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how would I get these 4 lines? What if I have a fairly large number of lines I need to extract? If I had a file with 100 lines, each representing a line number that I wanted to extract from another file, how would I do that? ...

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is text (they are more or less RTF converted to HTML) I don't care about formatting or even pa...

Parsing SGML and storing it in a PHP array

If you can help with this you're a genius. Basically, I will have some text like this: <parent wealthy> <parent> <children female> <child> jessica <hobbies> basketball, soccer, video games </hobbies> </child> <child> jane <hobbies> ...

Parsing date from text using Ruby

I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? ...

(experimental) wrapper induction libraries for java. Do any exist?

I'd like to know if any (experimental or not ) wrapper induction libraries for java exist. Given a website of choice I would like to be able to point my code to product-pages of a particular website. The Wrapper Induction library should be able to: - infer the 'wrapper' or schema of the product pages from a couple of examples. - have ...

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try ...

Extracting Window Contents

I need to extract window content if this is based on text, or at least the file path associated to that window. To-date, I have considered: 1. win32api 2. 3rd party libraries 3. wrapper classes However, I am not satisfied with the solutions. So any ideas how this can be done in a clean way? ...

Extract strings in python

Basically, I want to extract the strings "AAA", "BBB", "CCC", "DDD" from a text file.. ...... (other text goes here)..... <TD align="left" class=texttd><font class='textfont'>AAA</font></TD> ..... (useless text here)..... <TD align="left" class=texttd><font class='textfont'>BBB</font></TD> ....(more text)..... <TD align="left" class=tex...

Is there a "Reply via email" script?

Posterous allows you to post a myraid of objects via email. We would like to allow users to reply to an email we send them, and we extract out the content to use somewhere. What is the most effective way of doing that? ...

Is there a way to extract the message from a JavaScript dialog in Chrome?

I’ve been working on an extension for automating tests in Chrome, and I came across an obscure issue with JavaScript dialogs. The message shown in the dialog can’t be readily retrieved/copied. I’ve used the GetWindowText and InternalGetWindowText functions, but they only return the title of the dialog and the text from the buttons, not ...

need help working with the Jericho Html Parser

Hi all I've simply used the following program on the url below http://jericho.htmlparser.net/samples/console/src/ExtractText.java My goal is to be able to extract the main body text, to be able to summarize it and present the summarized text as output to the user. My problem is that, I'm not sure how I'd modify the above program to on...

How should I extract % delimited tags

I want to get the %tagname% from a file and copy them to a dictionary only tagname in python. ...

Extracting text from PDF with Poppler (C++)

I'm trying to get my way through Poppler and its (lack of) documentation. What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here. So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rec...

tag generation from a small text content (such as tweets)

Hello, I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on ...

Access Adobe InDesign files

I need some directions for the following problem: I have a lot of InDesign files and i have to setup a process that will track if a certain paragraph or text block has changed between diferent versions of the file. If the text block has changed i want to extract that text block in a "portable" format (html, pdf, txt). Is there an Adob...

I want to create an expression for querystrings, this stuff is hard!

I want to extract some keywords out of a query string for a search application in asp.net. I decoded the url string first, so it's plain text I have this to start with, but I want to add a keyword group ([\?\&])q=[^\&]+[\&]? I get this ?q=harbour landing dental& I'd like to trim off the stuff for pure words, but not sure if that's...