views:

100

answers:

4

Hello,

I have a large string (an RSS Article to be more precise) and I want to get the word in a specific startIndex and endIndex. String provides the substring method, but only using ints as its parameters. My start and end indexes are of type long.

What is the best way to get the word from a String using start and end indexes of type long?

My first solution was to start trimming the String and get it down so I can use ints. Didn't like where it was going. Then I looked at Apache Commons Lang but didn't find anything. Any good solutions?

Thank you.


Update:

Just to provide a little more information.

I am using a tool called General Architecture for Text Engineering (GATE) which scans a String and returns a list of Annotations. An annotation holds a type of a word (Person, Location, etc) and the start and end indexes of that word .

For the RSS, I use ROME, which reads an RSS feed and contains the body of the article in a String.

A: 

Probably it would be better not to use String but StringReader.

nanda
Hmmm... I know StringBuffer and StringBuilder, but never heard of StringReader. Could you expand please? I don't see any substring methods.
pek
+8  A: 

There is no point doing this on a String because a String can hold at 2^31 - 1 characters. Internally the string's characters are held in a char[], and all of the API methods use int as the type for lengths, positions and offsets.

  • The same restriction applied to StringBuffer or StringBuilder; i.e. an int length.
  • A StringReader is backed by a String, so that won't help.
  • Both CharBuffer and ByteBuffer have the same restriction; i.e. an int length.
  • A bare array of a primitive type is limited to an int length.

In short, you are going to have to implement your own "long string" type that internally holds its characters in (for example) an array of arrays of characters.

(I tried a Google search but I couldn't spot an existing implementation of long strings that looked credible. I guess there's not a lot of call for monstrously large strings in Java ...)

By the way, if you anticipate that the strings are never going to be this large, you should just convert your long offsets to int. A cast would work, but you might want to check the range and throw an exception if you ever get an offset >= 2^31.

Stephen C
+1  A: 

A String is backed by a char[], and arrays can only be indexed with ints (and can consequently only hold 231 characters). If you have long indexes, just cast them to ints - if they're larger than Integer.MAX_VALUE, your program is broken.

gustafc
+1  A: 

You'd better use a java.io.Reader. This class supports the methods skip(long n) and read(char[] cbuf). But please note they return a long (how many bytes were skipped / read), so you need to call those methods in a loop.

Thomas Mueller