I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html
Ideally I'd be able to get a CSV file with the address, ward, price, and square feet? Is there an easy way to do this?
...
Hello,
Lately I've realized that one must be good at handling (parsing) text. It may be from as simple as interpreting the HTTP response or reading a settings file (*.ini or *.xml or *.json) to as hard as writing a compiler or regex engine.
I agree that now we have library functions/methods for interpreting popular formats of text. But...
Hi,
Hope someone may be able to help. What i am looking to do is create a small winform app in c# to read the content of a email from a pop account, and upload key values to a sql automatically. The email format is always the same for each email, eg,
First name :
Last name :
Phone number :
etc...
Currently the emails are being store...
URI.extract claims to do this, but it doesn't handle matched parens:
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
What's the best way to extract URLs from text without breaking parenthesized URLs (which users like to use)?
...
I'd love like to read books properly on my Kindle.
To achieve my dream, I need a script to fix broken lines in a txt file.
For example, if the txt file has this line:
He watched Kahlan as she walked with her shoulders slumped
down.
... then it should fix it by deleting the newline before the word "down":
He watched Kahlan as she wa...
I have a text file that includes data inside {[]} tags. What would be the suggested way to parse that data so I can just use the data inside the tags?
Example text file would look like this:
'this is a bunch of text that is not {[really]} useful in any {[way]}. I need to {[get]} some items {[from]} it.'
I would like to end up with '...
Hi.
I have a text and I would like to extract the most recurrent terms, even if made up by more than one word (i.e.: managing director, position, salary, web developer).
I would need a library or an installable executable, more than a web service.
I came across some complex tools (such as Topia's Term Extraction, MAUI) that require tr...
I've simplified the problem somewhat, but I hope I've still captured the essence of my problem.
Let's say I have the following simple XML file:
<main>
outside1
===BEGIN===
inside1
====END====
outside2
=BEGIN=
inside2
==END==
outside3
</main>
Then I can use the following the XSLT 2.0:
<?xml version="1.0" encodin...
My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whit...
I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text.
I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python?
...
Explaining it further
Assume i have two strings like below
I am a super boy who can Fly! Really .
I am super boy who can Break walls!
Really .
So some characters are similar I am super boy who can and Really . .
Is there anything ready to use to find percentage similarity/diffrence between those two strings.
...
We have a large number (read: 50,000) of relatively small (read under 500K, typically under 50K) log files created using log4net from our client application. A typical log looks like:
Start Painless log
Framework:8.1.7.0
Application:8.1.7.0
2010-05-05 19:26:07,678 [Login ] INFO Application.App.OnShowLoginMessage(194) - Validating Crede...
The title says it all.
Right now, I am implementing this with a split, slice, and implosion:
$exploded = implode(' ',array_slice(preg_split('/(?=[A-Z])/','ThisIsATest'),1));
//$exploded = "This Is A Test"
Prettier version:
$capital_split = preg_split('/(?=[A-Z])/','ThisIsATest');
$blank_first_ignored = array_slice($capital_split,1);...
If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks.
<div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div>
...
In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:
get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing
In brackets there are preprocessing steps. The application runs in ab...
Hi,
I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
...
I need to show the first 100 characters of an HTML text, which means, I have to pick the first 100 characters that are not tags and then close any open tags leaving a balanced HTML. Is there any library that can do it? Or is there any trivial way to do it that I am missing?
The text is originally written in Textile which can and does co...
I have 20,000 company addresses on various documents, which are all formatted differently. For example:
Company A
12345 street
US
CompanyA, Inc
box2, 12345 street
WA, US
The Company B company Ltd
123 happy street UK
company B, Ltd
123, happy street, london, S1 1AA
I'd like to be able to combine the records for each company (i.e. sepe...
I have a program to process very large files. Now I need to show a progress bar to show the progress of the processing. The program works on a word level, read one line at a time, splitting it into words and processing the words one by one. So while the programs runs, it knows the count of the words processed. If somehow it knows the wor...
Hello, I have many large (~30 MB a piece) tab-delimited text files with variable-width lines. I want to extract the 2nd field from the nth (here, n=4) and next-to-last line (the last line is empty). I can get them separately using awk:
awk 'NR==4{print $2}' filename.dat
and (I don't comprehend this entirely but)
awk '{y=x "\n" $2};EN...