After having endlessly tried to handle large (3-35gb) csv files in R, I have moved over to SQL for handling these datasets. So I am using this code within an R environment (using the using the SQlite based RSQLite package) but it should not detract from my SQL question!
My Question: How do I make a selection of one table based on matching values given in another table?
I would Like to explain by example. I have the following table format:
"Data" Table
Symbol| Value| EX
A | 1 | N
A | 1 | N
A | 2 | T
A | 3 | N
A | 4 | N
A | 5 | N
B | 1 | P
B | 2 | P
B | 2 | N
B | 2 | N
B | 3 | P
B | 5 | P
B | 6 | T
...
I want to select all entries for which the symbol and exchange value matches according to a certain condition given in the example table below.
"Symbolexchange" Table:
Ticker| Exchange
A | N
B | P
...
(Note that symbol and ticker refer to the same attribute, also EX and Exchange refer to the same attribute)
So the output I am aiming for is so that it only keeps A entries given exchange is N etc:
Symbol| Value| EX
A | 1 | N
A | 1 | N
A | 3 | N
A | 4 | N
A | 5 | N
B | 1 | P
B | 2 | P
B | 3 | P
B | 5 | P
...
I was able to do this via two methods, although I am not quite satisfied with them.
This method adds the reference table in columns next to the original table, which is redundant.
SELECT *
FROM Data
INNER JOIN Symbolexchange
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER
This Method also gets the job done directly but is slower than the above.
SELECT *
FROM Data
WHERE EX=(SELECT exchange FROM Symbolexchange WHERE ticker = SYMBOL)
Is there a better and faster way way to program this? Speed is quite important because of the size of my data sets. Any other comments on my code welcome!
Thanks
Two things that you can do to improve performance:
First (and most importantly) add a key or index to your tables. I don't know SQLite, but usually there's a command something like this:
CREATE INDEX DataIX1 ON Data(Symbol,EX)
You'll want one on the other table too:
CREATE INDEX SymbolExchangeIX1 ON Symbolexchange(Ticker,Exchange)
You may need to throw in ".." or '..' on the names...
The second thing is that although your first query is probably your best approach, you should only return the columns that you actually need/want:
SELECT Data.*
FROM Data
INNER JOIN Symbolexchange
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER