views:

133

answers:

3

Dear experts:

In php and java there are explode and tokenizer function to convert a string into array without punctuations. Are are functions or some way in delphi to do the work. Suppose there is a large file " This is, a large file with punctuations,, and spaces and numbers 123..." How can we get array "This is a large file with punctuations and spaces and numbers 123"

Thank you very much in advance.

Yes, we want only [0..9],[a..z],[A..Z], like \w in regex. Can we use regex in Tperlregex to extract \w and put them in Tstringlist as if tstringlist is a array, but it may not be so efficient? Thank you.

+2  A: 

This depends on the definition of "alphanumerical character" and "puncutation character".

If we for instance define the set of punctuation characters

const
  PUNCT = ['.', ',', ':', ';', '-', '!', '?'];

and consider all other characters alphanumeric, then you could do

function RemovePunctuation(const Str: string): string;
var
  ActualLength: integer;
  i: Integer;
const
  PUNCT = ['.', ',', ':', ';', '-', '!', '?'];
begin
  SetLength(result, length(Str));
  ActualLength := 0;
  for i := 1 to length(Str) do
    if not (Str[i] in PUNCT) then
    begin
      inc(ActualLength);
      result[ActualLength] := Str[i];
    end;
  SetLength(result, ActualLength);
end;

This function turns a string into a string. If you want to turn a string into an array of characters instead, just do

type
  CharArray = array of char;

function RemovePunctuation(const Str: string): CharArray;
var
  ActualLength: integer;
  i: Integer;
const
  PUNCT = ['.', ',', ':', ';', '-', '!', '?'];
begin
  SetLength(result, length(Str));
  ActualLength := 0;
  for i := 1 to length(Str) do
    if not (Str[i] in PUNCT) then
    begin
      result[ActualLength] := Str[i];
      inc(ActualLength);
    end;
  SetLength(result, ActualLength);
end;

(Yes, in Delphi, strings use 1-based indexing, whereas arrays use 0-based indexing. This is for historical reasons.)

Andreas Rejbrand
I believe the OP needs a parser function which will take a string and create an array of substrings, extracted by splitting on punctuation marks.
Eugene Mayevski 'EldoS Corp
Ah, I see. (But why didn't he/she say so?)
Andreas Rejbrand
A: 

There seems to be no built-in functionality like in Java tokenizer. Long time ago we wrote a tokenizer class similar to Java one which became part of ElPack component suite (now LMD ElPack). Here's some implementation of string tokenizer similar to Java one (just found this link in Google, so I can't comment on code quality).

Eugene Mayevski 'EldoS Corp
+3  A: 

If you need a function that takes a string and returns an array of strings, these strings being the substrings of the original separated by punctuation, as Eugene suggested in my previous answer, then you can do

type
  StringArray = array of string;
  IntegerArray = array of integer;
  TCharSet = set of char;

function split(const str: string; const delims: TCharSet): StringArray;
var
  SepPos: IntegerArray;
  i: Integer;
begin
  SetLength(SepPos, 1);
  SepPos[0] := 0;
  for i := 1 to length(str) do
    if str[i] in delims then
    begin
      SetLength(SepPos, length(SepPos) + 1);
      SepPos[high(SepPos)] := i;
    end;
  SetLength(SepPos, length(SepPos) + 1);
  SepPos[high(SepPos)] := length(str) + 1;
  SetLength(result, high(SepPos));
  for i := 0 to high(SepPos) -  1 do
    result[i] := Trim(Copy(str, SepPos[i] + 1, SepPos[i+1] - SepPos[i] - 1));
end;

Example:

const
  PUNCT = ['.', ',', ':', ';', '-', '!', '?'];

procedure TForm4.FormCreate(Sender: TObject);
var
  str: string;
begin
  for str in split('this, is, a! test!', PUNCT) do
    ListBox1.Items.Add(str)
end;
Andreas Rejbrand
Thanks so much.