views:

40

answers:

4

I have the following code:

get_list_a()
{
    $MYSQL -B -u $USER --passwword="$PW" $DB1 <<EOF
select name, value from mytable_a
EOF
}
get_list_b()
{
    $MYSQL -B -u $USER --passwword="$PW" $DB2 <<EOF
select name, value from mytable_b
EOF
}

get_list_a >$test.txt


Now I need to combine a and b first and remove all dups(key is name, the first column) and then write them to test.txt. List a and list b by itself are assumed to be distinct. If x in a and y in b exist such that x.name=y.name, then I only want to keep x. how do I do it? note: merging in SQL is not an option since they are in different db with different collation.

An example:
get_list_a prints

aaa bbb
ccc ddd

get_list_b prints

aaa fff
ggg hhh

I want the following to be written to the file:

aaa bbb
ccc ddd
ggg hhh
A: 

Can you word your question a little more clearly? Can you give some short example input and example output that you'd expect? It's a little unclear exactly what you're asking.

Edit: Given what you want, this should do the trick:

get_a  > inputfile
get_b >> inputfile
perl -lne '$data{$F[0]} = $F[1] unless exists $data{$F[0]} }{ for $key (keys %data) { print "$key $data{$key}\n"}' inputfile > outputfile

The }{ is because calling perl -n causes the program (given by -e) to be wrapped in an implicit while (<STDIN>) { ... } block. The } closes the while and { opens a new code bock that runs til the implicit }

Calling perl with -l causes it the input to be auto-split into @F, similar to how awk has $1, $2, etc. Then you add the key/value pair to %data unless the key is already there.

Daenyth
you could have asked this as a comment
TuxGeek
I have edited question with an example
@TiNS: I can't comment yet.
Daenyth
the }{ looks weird, the brackets doesn't seem to match, can you explain?
there are two input files, a.txt and b.txt, your script only shows one?
I"m getting Missing right curly or square bracket at -e line 1, at end of linesyntax error at -e line 1, at EOF
Woops, forgot to close the `for` loop. Fixed
Daenyth
I have seen `}{` used like that. Ick. Try `stuff_for_each_loop; END { final_statements; }`
Chris Johnsen
TMTOWTDI. Also you don't need the `;` before a `}` :P
Daenyth
+1  A: 

Would a SQL query along these lines work? (Untested)

SELECT COALESCE(x.name,y.name),COALESCE(x.value,y.value)
FROM mytable_a AS x
FULL JOIN mytable_b AS y
ON x.name = y.name;

Edit: OK, if they're in separate DBs, and the fields are space-separated as you indicate in a comment, I would probably use associative arrays in perl or awk, letting the values from x (a) overwrite the values from y (b). Something like this (still untested):

get_list_a > x.txt
get_list_b > y.txt
cat y.txt x.txt | awk '{ data[$1] = $2; } END { for (i in data) { print i, data[i]; }}'
coneslayer
No, first, they are from two dbs with different collations, I couldn't join them. secondly, doesn't coalesce simply return the first non null argument? the name or value will never be null.
In your statement of the problem, there was no indication that they come from different DBs (just "$DB" in both cases). And I think you WILL get NULLs in a full join if the name/value only appears in one of the two tables.
coneslayer
I have edited my question to reflect the requirement. Just curious how would SQL resolve removing dups with predence of x over y?
If I wrote things correctly, the precedence was handled by the order of arguments in the COALESCE expressions. That is, if both x.value and y.value were non-NULL (because both tables had a value for that name), the x.value comes first and takes precedence.
coneslayer
is the END in your script for separating statements?
In awk, there's an implicit loop that executes for each line of the input. That loop is the '{ data [$1] = $2; }' part. The END introduces another block that runs once, once you've reached the end of the input.
coneslayer
the result from the script doesn't seem to be correct. Can you explain what the script starting from awk is doing? { data[$1] = $2; }, what are $1 and $2?
$1 and $2 are the first and second fields on the input line. For example, if the first line of y.txt is "aaa fff" as in your sample, then $1 would be "aaa" and $2 would be "fff". We would assign data["aaa"] = "fff", but that would later be overwritten by data["aaa"] = "bbb" when it got to the first line of x.txt.
coneslayer
makes sense, but I'm getting the last row printed n number of times
Try the edit, where I changed "$i" to "i" in the END block. I forgot that awk does not use $ on variable names (but DOES use it for $1, $2, etc.). Sorry I can't test it right now.
coneslayer
wonderful it works!! thank you!
A: 

Are you removing duplicates only on duplicate keys, or values too?

The command sort -u removes duplicates (letter u is for "unique"). It has options for the sort key value expressed in characeter start and end columns. It could be as simple as (assuming fixed value lengths or fixed column formatting):

get_list_a >$test.txt
get_list_b >>$test.txt
sort -u test.txt -k<startcol>,<endcol> test.txt > output.txt

Of course, I'd rather do a merge in SQL.

kmarsh
I'm removing duplicates on duplicate keys only. and for duplicate keys, list a has precedence over list b.
another thing is each column is separated by space, but column widths aren't fixed. how do I use sort -k<startcol>,<endcol> to sort by first column(key)
A: 

Just for "fun", here is a solution using sed and no temporary files (just variables):

x=$(get_list_a) 
y=$(get_list_b)

while read name value
do 
    y=$(echo "$y" | sed "/^$name/ d")
done << EOF
$x
EOF

echo "$x"
echo "$y"
mrucci