views:

145

answers:

8

Regex:

String regexp = "([0-9.]{1,15})[ \t]*([0-9]{1,15})[ \t]*([0-9.]{1,15})[ \t]*(\"(.*?)\"\\s+\\((\\d{4})\\)\\s+\\{(.*?)\\})";

Text:

1000000103      50   4.5  #1 Single (2006)
2...1.2.12       8   2.7  $1,000,000 Chance of a Lifetime (1986)
11..2.2..2       8   5.0  $100 Taxi Ride (2001)
....13.311       9   7.1  $100,000 Name That Tune (1984)
3..21...22      10   4.6  $2 Bill (2002)
30010....3      18   2.7  $25 Million Dollar Hoax (2004)
2000010002     111   5.6  $40 a Day (2002)
2000000..4      26   1.6  $5 Cover (2009)
.0..2.0122      15   7.8  $9.99 (2003)
..2...1113       8   7.5  $weepstake$ (1979)
0000000125    3238   8.7   Allo  Allo! (1982)
1....22.12       8   6.5   Allo  Allo! (1982) {A Barrel Full of Airmen (#7.7)

I'm trying to use Java and MySQL together. I'm learning it for a project that I'm planning. I want the desired output to be like this:

distribution = first column
rank = second column
votes = thirst column 
title = fourth column

The first three work fine. I have trouble with the fourth one.

no well there are suppose to be curly brackets this is like the first few entries ill paste a few more it may make it easier to realize what i'm trying to show you. So here they are:

0...001122      16   7.8  "'Allo 'Allo!" (1982) {Gruber Does Some Mincing (#3.2)}
100..01103      21   7.4  "'Allo 'Allo!" (1982) {Hans Goes Over the Top (#4.1)}
....022100      11   6.9  "'Allo 'Allo!" (1982) {Hello Hans (#7.4)}
0....03022      21   8.4  "'Allo 'Allo!" (1982) {Herr Flick's Revenge (#2.6)}
......8..1       6   7.0  "'Allo 'Allo!" (1982) {Hitler's Last Heil (#8.3)}
.....442..       5   6.5  "'Allo 'Allo!" (1982) {Intelligence Officers (#6.5)}
....1123.2       9   6.9  "'Allo 'Allo!" (1982) {It's Raining Italians (#6.2)}
....1.33.3      10   7.8  "'Allo 'Allo!" (1982) {Leclerc Against the Wall (#5.18)}
....22211.       8   6.4  "'Allo 'Allo!" (1982) {Lines of Communication (#7.5)}

The code i'm using:

  stmt.executeUpdate("CREATE TABLE mytable(distribution char(20)," +
      "votes integer," + "rank float," + "title char(250));");
  String regexp ="([\\d\\.]+)\\s+(\\d+)\\s+([\\d\\.]+)\\s+(.*?\\s+\\(\\d{4}\\).*)";
  Pattern pattern = Pattern.compile(regexp);
  String line;
  String data= "";
  while ((line = bf.readLine()) != null) {
    data = line.replaceAll("'", " ");
    String data2 = data.replaceAll("\"", "");
    //System.out.println(data2);
    Matcher matcher = pattern.matcher(data2);
    if (matcher.find()) {
        String distribution = matcher.group(1);
        String votes = matcher.group(2);
        String rank = matcher.group(3);
        String title = matcher.group(4);
        //System.out.println(distribution + " " + votes + " " + rank + " " + title);
        String todo = ("INSERT into mytable " +
            "(Distribution, Votes, Rank, Title) "+
            "values ('"+distribution+"', '"+votes+"', '"+rank+"', '"+title+"')");
        stmt = con.createStatement();
        int r = stmt.executeUpdate(todo);
    }
  }
A: 

No it would not.

  1. [ \t] would have to become [ \t]+ or \s+; your numbers are right-aligned using spaces (in addition to tabs, if any) in the sample input
  2. backslashes must be double-escaped inside string literals

Given that you desire the title result for "'Allo 'Allo" to be Title = Allo Allo! (1982) {Lines of Communication (#7.5)} try:

pattern = "([0-9\\.]+)[ \\t]+([0-9]+)[ \\t]+([0-9\\.]+)[ \\t]+(.*?[ \\t]+\\([0-9]{4}\\).*)";

or (simplified like Fadrian suggested):

pattern = "([\\d\\.]+)\\s+(\\d+)\\s+([\\d\\.]+)\\s+(.*?\\s+\\(\\d{4}\\).*)";

Read more about Backslashes, escapes, and quoting in the section with that name of the Pattern javadoc page.

vladr
yeah i know my bad i had double quotes but removed them cause the text file i'm using the double quotes are not equal so they are placed improperly for example: "hello "hello how are you"
angad Soni
here is the code ill post it up top
angad Soni
A: 

This is a much simpler regex to do what you want to do

([\d\.]*)\s*([\d\.]*)\s*([\d\.]*)\s*(.*)

If you need to cater for the whitespace at the end of the line as well then as \s*

([\d\.]*)\s*([\d\.]*)\s*([\d\.]*)\s*(.*)\s*

I just corrected a small mistake of using \S instead of [\d.]

Fadrian Sudaman
is this part of the regex just for the title part tho nothing else right?
angad Soni
this is for everything. there are four parts(groups) that will extract the four columns that you wanted. try it out.
Fadrian Sudaman
No it doesnt work sorry
angad Soni
I run it through the regex designer with your data and works for all your sample data. I mentioned about adding \s* at the end but my sample forgot that, so I edited and add to the regex string just in case Can you please try this([\d\.]*)\s*([\d\.]*)\s*([\d\.]*)\s*(.*)\s*
Fadrian Sudaman
+3  A: 
/Allo Allo! \(1982\) \{A Barrel Full of Airmen \(\#7\.7\)\}/
twolfe18
well there are like 7000 lines but in this format
angad Soni
which format? you've given no example.
harschware
+1 for doing exactly what I was about to do.
polygenelubricants
well i have a text file im parsing in java and then inserting the parsed text into mysql i need to find a regex to match this string in this format and then insert it into mysql. all in java
angad Soni
Not gonna work. You need to escape the special characters.
zneak
@angad Soni - what fields are in your mysql table that you want to use from the data you got from your parsed text?
Russell
Angad, you need to describe the format before anyone can possibly help you.
allyourcode
http://stackoverflow.com/questions/2360418/would-a-regex-like-this-work-for-these-lines-of-text please go here its in more detail sorry for the confusion
angad Soni
is this answer a joke??? -1!
vladr
originally i escaped the special characters, but i didn't escape the escape backslash! fixed now. @vlad, yes, this is sort of a joke...
twolfe18
Code formatting makes this kind of thing much easier. ;)
Alan Moore
A: 

Maybe: [a-zA-Z ]+\!\(\d{4}\) \{[a-zA-Z0-9 \(\)\#\.]+\}

Not sure what you're trying to accomplish so this is a kinda guess...

For better help you have to give better details: Some more example lines, What kind of data this is, do you just want a match or do you want specific capture groups?

FrustratedWithFormsDesigner
+1  A: 

Remember the #1 rule of programming: keep it simple! Why do you really need a regex for the whole thing?

Seems to me that you have a nicely defined tabular format... is it in tsv?

If not, you could read line by line, split based on the spaces for the first 3 columns, then only your last column would need a regexp to parse.

mlaverd
umm im not sure how to do that
angad Soni
+2  A: 

Can you use split instead and just have it split on the tabs? Or get the opencsv library and use it.

Perhaps something like

....

String[] temp;
String the_line;
BufferedReader in = new BufferedReader(new FileReader("file.txt")); 

while ((the_line = in.readLine()) != null)
{
    temp = the_line.split("\t");
    ....
}

....
jasonbar
umm im not sure how to do that
angad Soni
okay i see what your saying but i need it in variables so i can insert the variables into a database and file is 20Mb in size thats why I thought a regex function would be easier.
angad Soni
@angad Soni: `temp` would be an array of your columns? Assign them to variables or just use the array elements directly?
jasonbar
+1  A: 

Try this

        BufferedReader reader = new BufferedReader(new FileReader("yourFile"));

        Pattern p = Pattern.compile("([0-9\\.]+)[\\s]+([0-9]+)[\\s]+([0-9]\\.[0-9])[\\s]+([^\\s].*$)");

        String line;
        while( (line = reader.readLine()) != null ) {
            Matcher m = p.matcher(line);
            if ( m.matches() ) {
                 System.out.println(m.group(1));
                 System.out.println(m.group(2));
                 System.out.println(m.group(3));
                 System.out.println(m.group(4));
            }

        }

Assuming the third group is only one digit a . and then only one digit

Lombo
okay well i've been trying that but for some reason it wont work i get an error.
angad Soni
it's not the same regexp you providedIt works for me with your file
Lombo
what did you save the file as i mean what format .txt or what?
angad Soni
yes, i saved the file as .txt
Lombo
A: 

Don't use regex to parse text. Regex is intented to match patterns in text, not to parse text in parts/components.

If the text file example in your question is an actual and unchanged example, then the following basic kickoff example of a "parser" should just work (as a bonus, it also instantly executes the needed JDBC code). I've copypasted your data unchanged into c:\test.txt.

public static void main(String... args) throws Exception {
    final String SQL = "INSERT INTO movie (distribution, votes, rank, title) VALUES (?, ?, ?, ?)";
    Connection connection = null;
    PreparedStatement statement = null;
    BufferedReader reader = null;        

    try {
        connection = database.getConnection();
        statement = connection.prepareStatement(SQL);
        reader = new BufferedReader(new InputStreamReader(new FileInputStream("/test.txt")));

        // Loop through file.
        for (String line; (line = reader.readLine()) != null;) {
            if (line.isEmpty()) continue; // I am not sure if those odd empty lines belongs in your file, else this if-check can be removed.

            // Gather data from lines.
            String distribution = line.substring(0, 10);
            int votes = Integer.parseInt(line.substring(12, 18).trim());
            double rank = Double.parseDouble(line.substring(20, 24).trim());
            String title = line.substring(26).trim().replace("\"", ""); // You also want to get rid of those double quotes, huh? I am however not sure why, maybe you initially had problems with it in your non-prepared SQL string...

            // Just to show what you've gathered.
            System.out.printf("%s, %5d, %.1f, %s%n", distribution, votes, rank, title);

            // Now add batch to statement.
            statement.setString(1, distribution);
            statement.setInt(2, votes);
            statement.setDouble(3, rank);
            statement.setString(4, title);
            statement.addBatch();
        }

        // Execute batch insert!
        statement.executeBatch();
    } finally {
        // Gently close expensive resources, you don't want to leak them!
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
        if (statement != null) try { statement.close(); } catch (SQLException logOrIgnore) {}
        if (connection != null) try { connection.close(); } catch (SQLException logOrIgnore) {}
    }
}

See, it just works. No need for overcomplicated regex.

BalusC
you sure counted those spaces... whatcha gonna do if there's tabs too? oops!
vladr
Counted? I don't know what you normally do/use, but my texteditor just shows the column index of the cursor in the statusbar. As to the tabs, I already protected myself by saying that the data is copypasted unchanged from the OP. If you want to play for smartass, please look for someone else.
BalusC