ansaurus

Question

Converting XML input from multiple lines to one line

Answer 1

A:

this link should help you. their example is a little more complicated, but shouldn't be hard to change to fit your needs: http://www.unix.com/unix-dummies-questions-answers/40871-remove-carriage-return-between-line.html

-don

Don Dickinson 2009-10-06 04:08:14

Answer 2

+3 A:

I'm not sure of the command line syntax for it, but this regex should do it:

// Find:
/>[\n\s]+</
// Replace with:
><

This will only strip whitespace between elements (not inside them, unless it's a CDATA section perhaps), but you might accidentally remove some spaces that you actually want in there, eg:

<p>here's <i>something</i> <b>interesting</b></p>
// becomes:
<p>here's <i>something</i><b>interesting</b></p>

Here's an example of the problem with CDATA I mentioned:

<element><![CDATA[
    this shouldn't <blah>
    <blah> be touched.
]]></element>

// becomes:
<element><![CDATA[
    this shouldn't <blah><blah> be touched.
]]></element>

Of course, the "correct" answer is to use a parser to read the file and then output it again with whitespace and indentation removed.

nickf 2009-10-06 04:32:33

Answer 3

+1 A:

You may try this code (Java):

import java.util.Scanner;
import java.io.File;
import java.io.FileWriter;
public class TrimLines {
  public static void main(String[] args){
 try {
  String source = "employee.xml";
  String result = "no-lines-employee.xml";

  System.out.println("removing lines...");
  Scanner s = new Scanner(new File(source));   
  FileWriter w = new FileWriter(result);   
  while(s.hasNext())    
   w.write(s.nextLine());   
  w.close();   
  System.out.println("remove successfull.");
 }
 catch(Exception ex){
  ex.printStackTrace();
 }
  } 
}

Just specify the source xml filename(source variable) and the destination xml filename(result variable).

jerjer 2009-10-06 05:29:46

you also can add trim() after s.nextLine() to remove whitespace between tags.

tulskiy 2009-10-06 05:32:05

Answer 4

+2 A:

You can write a SAX parser and on each event just write elements to another file without new lines. This will remove both new lines and junk whitespace.

tulskiy 2009-10-06 05:30:56

Answer 5

A:

tr is a pretty simple way to replace a newline:

cat addresses.xml | tr -d '\n'

Googling for "shell replace newline" will yield plenty of other options too.

jmdeldin 2009-10-06 06:08:15

Answer 6

+3 A:

[XML::Twig][1] comes with an xml pretty printer xml_pp. If the address lines are right under the root of the document, then you can use it to get real close to your desired output:

xml_pp -s record_c to_compact.xml

<root>
  <address><street>abc</street><number>123</number></address>
  <address><street>abc1</street><number>345</number></address>
  <address><street>xyz</street><number>999</number></address>
  <address><street>abc</street><number>123</number></address>
  <address><street>abc1</street><number>345</number></address>
  <address><street>xyz</street><number>999</number></address>
</root>

Removing the spaces at the beginning of the address lines is quite easy:

xml_pp -s record_c to_compact.xml | perl -p -e's{^\s+}{}'

If the address elements are not right under the root, then let us know, and I'll see what can be done.

mirod 2009-10-06 07:02:52

Answer 7

A:

The regular expression

(?<=>)\r?\n[ \t]*(?!<address)

will match a CRLF + spaces/tags between tags unless followed by <address>. Although I usually would advise against regular expressions and for a parser, in this case it looks like this gets the job done a lot easier.

Tim Pietzcker 2009-10-06 07:15:54

Could the downvoter please explain the vote? The solution works on the example data, and a caveat about regex vs. parser is also present.

Tim Pietzcker 2009-10-07 15:05:24

Answer 8

+3 A:

Another option is to use an XSLT stylesheet which copies everything, but only copies elements and attributes in the address elements:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="address">
        <xsl:copy>
            <xsl:apply-templates select="@*|*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Unlike regex approaches this should work for any XML document ( even if the line breaks are encoded as character entities or in CDATA ), and will only format the address elements.

You can run the stylesheet using Java, or from the command line using xsltproc.

Pete Kirkham 2009-10-06 09:02:35

ansaurus

tags:

views:

answers:

Converting XML input from multiple lines to one line

related questions