tags:

views:

519

answers:

8

I have an xml like this

<address>
   <street>abc</street>
   <number>123</number>
</address>

<address>
   <street>abc1</street>
   <number>345</number>
</address>

...
...
<address>
   <street>xyz</street>
   <number>999</number>
</address>

I want to be able to convert this to

<address><street>abc</street><number>123</number></address>
<address><street>abc1</street><number>345</number></address>
...
...
<address><street>xyz</street><number>999</number></address>

Can you recommend how can I go about this, I am thinking sed might help but have been unable to get it to work.

EDIT: The XML file has 100K lines of similar kind, edited to reflect the correct input and output.

A: 

this link should help you. their example is a little more complicated, but shouldn't be hard to change to fit your needs: http://www.unix.com/unix-dummies-questions-answers/40871-remove-carriage-return-between-line.html

-don

Don Dickinson
+3  A: 

I'm not sure of the command line syntax for it, but this regex should do it:

// Find:
/>[\n\s]+</
// Replace with:
><

This will only strip whitespace between elements (not inside them, unless it's a CDATA section perhaps), but you might accidentally remove some spaces that you actually want in there, eg:

<p>here's <i>something</i> <b>interesting</b></p>
// becomes:
<p>here's <i>something</i><b>interesting</b></p>

Here's an example of the problem with CDATA I mentioned:

<element><![CDATA[
    this shouldn't <blah>
    <blah> be touched.
]]></element>

// becomes:
<element><![CDATA[
    this shouldn't <blah><blah> be touched.
]]></element>

Of course, the "correct" answer is to use a parser to read the file and then output it again with whitespace and indentation removed.

nickf
+1  A: 

You may try this code (Java):

import java.util.Scanner;
import java.io.File;
import java.io.FileWriter;
public class TrimLines {
  public static void main(String[] args){
 try {
  String source = "employee.xml";
  String result = "no-lines-employee.xml";

  System.out.println("removing lines...");
  Scanner s = new Scanner(new File(source));   
  FileWriter w = new FileWriter(result);   
  while(s.hasNext())    
   w.write(s.nextLine());   
  w.close();   
  System.out.println("remove successfull.");
 }
 catch(Exception ex){
  ex.printStackTrace();
 }
  } 
}

Just specify the source xml filename(source variable) and the destination xml filename(result variable).

jerjer
you also can add trim() after s.nextLine() to remove whitespace between tags.
tulskiy
+2  A: 

You can write a SAX parser and on each event just write elements to another file without new lines. This will remove both new lines and junk whitespace.

tulskiy
A: 

tr is a pretty simple way to replace a newline:

cat addresses.xml | tr -d '\n'

Googling for "shell replace newline" will yield plenty of other options too.

jmdeldin
+3  A: 

[XML::Twig][1] comes with an xml pretty printer xml_pp. If the address lines are right under the root of the document, then you can use it to get real close to your desired output:

xml_pp -s record_c to_compact.xml

<root>
  <address><street>abc</street><number>123</number></address>
  <address><street>abc1</street><number>345</number></address>
  <address><street>xyz</street><number>999</number></address>
  <address><street>abc</street><number>123</number></address>
  <address><street>abc1</street><number>345</number></address>
  <address><street>xyz</street><number>999</number></address>
</root>

Removing the spaces at the beginning of the address lines is quite easy:

xml_pp -s record_c to_compact.xml | perl -p -e's{^\s+}{}'

If the address elements are not right under the root, then let us know, and I'll see what can be done.

mirod
A: 

The regular expression

(?<=>)\r?\n[ \t]*(?!<address)

will match a CRLF + spaces/tags between tags unless followed by <address>. Although I usually would advise against regular expressions and for a parser, in this case it looks like this gets the job done a lot easier.

Tim Pietzcker
Could the downvoter please explain the vote? The solution works on the example data, and a caveat about regex vs. parser is also present.
Tim Pietzcker
+3  A: 

Another option is to use an XSLT stylesheet which copies everything, but only copies elements and attributes in the address elements:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="address">
        <xsl:copy>
            <xsl:apply-templates select="@*|*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Unlike regex approaches this should work for any XML document ( even if the line breaks are encoded as character entities or in CDATA ), and will only format the address elements.

You can run the stylesheet using Java, or from the command line using xsltproc.

Pete Kirkham