ansaurus

Question

Shell script to merge two list and remove duplicates

Answer 1

A:

~~Can you word your question a little more clearly? Can you give some short example input and example output that you'd expect? It's a little unclear exactly what you're asking.~~

Edit: Given what you want, this should do the trick:

get_a  > inputfile
get_b >> inputfile
perl -lne '$data{$F[0]} = $F[1] unless exists $data{$F[0]} }{ for $key (keys %data) { print "$key $data{$key}\n"}' inputfile > outputfile

The }{ is because calling perl -n causes the program (given by -e) to be wrapped in an implicit while (<STDIN>) { ... } block. The } closes the while and { opens a new code bock that runs til the implicit }

Calling perl with -l causes it the input to be auto-split into @F, similar to how awk has $1, $2, etc. Then you add the key/value pair to %data unless the key is already there.

Daenyth 2010-06-22 17:21:29

you could have asked this as a comment

TuxGeek 2010-06-22 17:24:13

I have edited question with an example

2010-06-22 17:31:21

@TiNS: I can't comment yet.

Daenyth 2010-06-22 17:56:04

the }{ looks weird, the brackets doesn't seem to match, can you explain?

2010-06-22 18:09:22

there are two input files, a.txt and b.txt, your script only shows one?

2010-06-22 18:11:20

I"m getting Missing right curly or square bracket at -e line 1, at end of linesyntax error at -e line 1, at EOF

2010-06-22 18:32:55

Woops, forgot to close the `for` loop. Fixed

Daenyth 2010-06-22 18:59:00

I have seen `}{` used like that. Ick. Try `stuff_for_each_loop; END { final_statements; }`

Chris Johnsen 2010-06-22 19:25:05

TMTOWTDI. Also you don't need the `;` before a `}` :P

Daenyth 2010-06-22 19:26:45

Answer 2

+1 A:

Would a SQL query along these lines work? (Untested)

SELECT COALESCE(x.name,y.name),COALESCE(x.value,y.value)
FROM mytable_a AS x
FULL JOIN mytable_b AS y
ON x.name = y.name;

Edit: OK, if they're in separate DBs, and the fields are space-separated as you indicate in a comment, I would probably use associative arrays in perl or awk, letting the values from x (a) overwrite the values from y (b). Something like this (still untested):

get_list_a > x.txt
get_list_b > y.txt
cat y.txt x.txt | awk '{ data[$1] = $2; } END { for (i in data) { print i, data[i]; }}'

coneslayer 2010-06-22 17:22:21

No, first, they are from two dbs with different collations, I couldn't join them. secondly, doesn't coalesce simply return the first non null argument? the name or value will never be null.

2010-06-22 17:34:34

In your statement of the problem, there was no indication that they come from different DBs (just "$DB" in both cases). And I think you WILL get NULLs in a full join if the name/value only appears in one of the two tables.

coneslayer 2010-06-22 17:37:23

I have edited my question to reflect the requirement. Just curious how would SQL resolve removing dups with predence of x over y?

2010-06-22 17:53:40

If I wrote things correctly, the precedence was handled by the order of arguments in the COALESCE expressions. That is, if both x.value and y.value were non-NULL (because both tables had a value for that name), the x.value comes first and takes precedence.

coneslayer 2010-06-22 17:54:53

is the END in your script for separating statements?

2010-06-22 17:59:23

In awk, there's an implicit loop that executes for each line of the input. That loop is the '{ data [$1] = $2; }' part. The END introduces another block that runs once, once you've reached the end of the input.

coneslayer 2010-06-22 18:01:12

the result from the script doesn't seem to be correct. Can you explain what the script starting from awk is doing? { data[$1] = $2; }, what are $1 and $2?

2010-06-22 18:05:13

$1 and $2 are the first and second fields on the input line. For example, if the first line of y.txt is "aaa fff" as in your sample, then $1 would be "aaa" and $2 would be "fff". We would assign data["aaa"] = "fff", but that would later be overwritten by data["aaa"] = "bbb" when it got to the first line of x.txt.

coneslayer 2010-06-22 18:10:48

makes sense, but I'm getting the last row printed n number of times

2010-06-22 18:24:41

Try the edit, where I changed "$i" to "i" in the END block. I forgot that awk does not use $ on variable names (but DOES use it for $1, $2, etc.). Sorry I can't test it right now.

coneslayer 2010-06-22 18:36:58

wonderful it works!! thank you!

2010-06-22 18:44:24

Answer 3

A:

Are you removing duplicates only on duplicate keys, or values too?

The command sort -u removes duplicates (letter u is for "unique"). It has options for the sort key value expressed in characeter start and end columns. It could be as simple as (assuming fixed value lengths or fixed column formatting):

get_list_a >$test.txt
get_list_b >>$test.txt
sort -u test.txt -k<startcol>,<endcol> test.txt > output.txt

Of course, I'd rather do a merge in SQL.

kmarsh 2010-06-22 17:35:02

I'm removing duplicates on duplicate keys only. and for duplicate keys, list a has precedence over list b.

2010-06-22 17:40:35

another thing is each column is separated by space, but column widths aren't fixed. how do I use sort -k<startcol>,<endcol> to sort by first column(key)

2010-06-22 17:42:43

Answer 4

A:

Just for "fun", here is a solution using sed and no temporary files (just variables):

x=$(get_list_a) 
y=$(get_list_b)

while read name value
do 
    y=$(echo "$y" | sed "/^$name/ d")
done << EOF
$x
EOF

echo "$x"
echo "$y"

mrucci 2010-06-27 17:34:51

ansaurus

tags:

views:

answers:

Shell script to merge two list and remove duplicates

related questions