I have a large table (TokenFrequency) which has millions of rows in it. The TokenFrequency table that is structured like this:
Table - TokenFrequency
- id - int, primary key
- source - int, foreign key
- token - char
- count - int
My goal is to select all of the rows in which two sources have the same token in it. For example if my table looked like this:
id --- source --- token --- count
1 ------ 1 --------- dog ------- 1
2 ------ 2 --------- cat -------- 2
3 ------ 3 --------- cat -------- 2
4 ------ 4 --------- pig -------- 5
5 ------ 5 --------- zoo ------- 1
6 ------ 5 --------- cat -------- 1
7 ------ 5 --------- pig -------- 1
I would want a SQL query to give me source 1, source 2, and the sum of the counts. For example:
source1 --- source2 --- token --- count
---- 2 ----------- 3 --------- cat -------- 4
---- 2 ----------- 5 --------- cat -------- 3
---- 3 ----------- 5 --------- cat -------- 3
---- 4 ----------- 5 --------- pig -------- 6
I have a query that looks like this:
SELECT F.source AS source1, S.source AS source2, F.token,
(F.count + S.count) AS sum
FROM TokenFrequency F
INNER JOIN TokenFrequency S ON F.token = S.token
WHERE F.source <> S.source
This query works fine but the problems that I have with it are that:
- I have a TokenFrequency table that has millions of rows and therefore need a faster alternative to obtain this result.
- The current query that I have is giving duplicates. For example its selecting:
source1=2, source2=3, token=cat, count=4
source1=3, source2=2, token=cat, count=4
Which isn't too much of a problem but if there is a way to elimate those and in turn obtain a speed increase then it would be very useful
The main issue that I have is speed of the query with my current query it takes hours to complete. The INNER JOIN on a table to itself is what I believe to be the problem. Im sure there has to be a way to eliminate the inner join and get similar results just using one instance of the TokenFrequency table. The second problem that I mentioned might also promote a speed increase in the query.
I need a way to restructure this query to provide the same results in a faster, more efficient manner.
Thanks.