I am trying to capture real-time streaming financial time data via Python. I want to initially store the information in a database and then at a later date further develop a program to analyze and make trading decisions based on this data. It would be nice to also be able to subsequently display said data in some graphical format either on a website or in a Jupyter notebook.
As a starting point, I figured I would use bitcoin data from either GDAX or Gemini. I wanted to capture tick data and potentially additional order book information too if that is feasible.
In doing some research I am a bit overwhelmed by the options out there and could use some guidance in how to structure the project and what libraries would be most appropriate.
I have looked at the respective docs for each service's API as well as a few Github projects and am still unsure of where to start. Any advice, suggestions or recommended reading would be greatly appreciated.
If I may add a few cents after few decades hands-on experience on due architecture and design reasoning:
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
decide, how much "real-time" is your ambition in reality, this makes the biggest difference -- if you technically have to implement the process control-loop, having due to its stability criterion a threshold under a say 1 kHz RTT-End-to-End ( event-origination + transport + local acquisition + local processing + feedback-loop << 1 [ms]
), 10 kHz ( << 100 [us]
). Knowing this "intention" helps you decide on proper and feasible technology, that can ( or principally cannot ) fulfill said target. Choosing in-adequate technology is a common ( and expensive ) mistake of amateurs ( but also often repeated even by professional software houses ).
after knowing the realistic value of the RTT-E2E threshold, check before going any step further, if the API-provider is indeed allowing you so frequent updates to be served in both technical and commercial sense. If not, you have to find another data-feed provider, who would. If your real-time system cannot be fed from an adequately sampled ( fast ) feeding-source, the control-loop will not work for you within its range of stability -- which is bad -- as bad as if you would try to watch your television which would show up not each and every but just about each 37~46-th picture, from the HD video stream. You would not be able to watch this "sub-sampled choppy something" ( sure, if your local DVB-T decoder would not give up to show anything at all, due to such brutal scale of the DVB-T stream-protocol violations, the checkered artifacts are obnoxious when even just a few CRC-bits over Reed/Solomon-redundant error-tolerant transcoding failed to get correctly through ) but definitely nothing nice compared to the stable and error-free 1:1 FullHD-stream.
having both the realistic value of the RTT-E2E threshold and the matching API-vendor available, start designing the technical steps ( not the technology to implement them ) how do you plan to help mask the real-world real-time latencies -- the actual time-costs [us] of the transport ( + all the associated encryption / decryption schemes, be it a SSL-tunnel or VPN or other ) from the API-provider reference access-point down to your processing-engine input. Not knowing this, all your further decisions would be flawed.
having measured that real-world ISO/OSI-L9-down-stream latency, double that time in [us]
, so as to also reserve reasonable amount of time to also cover the up-stream latency ( which need not have exactly the same sources of latency, but orders of magnitude will be near ).
In case your initial design RTT-E2E target figure has a control-loop threshold of the said 1kHz and your max( DownStream ) + max( UpStream )
ceiling of the latencies observed experimentally, as measured live against the provider's real-world production API-gateway ( that one, that was checked for matching your 1kHz
control-loop threshold ) during at least one 24/7-duty-cycle, now calculate how much time is left for you to start any sort of local-processing - i.e. how many [us]
actually remain for your local-processing such that your 1kHz control-loop does not fail to meet its stability threshold.
If your net local-processing time-budget has become already here negative, you know that your work will not yield anything good. It cannot. Pull stop.
If your net local-processing time-budget yet leaves some decent amount of [us] -- this is your actual process' design target to start designing such the computing steps and measures so as to always safely fit in.
Anybody, who starts coding without knowing this principal design-limit is either naive or irresponsible, a "just-teaching" academician or a headbang-lover ( who loves to fail and does not mind going again and again against a known wall ), while a systematic professional designer now gets focused to start all the careful design efforts, against the collected hard facts and evidence supported design-targets ( and never would spend a second on a task without knowing these principal set of constraint, that just only Alice in the Wonderlands can abstract from -- not knowing where you want to get, any road will lead you there -- which is a nice fairytale, but not a real-world practice to experiment with, is it? )
[us]
) - if it does not fit remarkable amount of times inside your remaining net local-processing time-budget, one may straight forget about using python, that simple ( and objections about multi-threaded processing do not get you better in Python domain:"... in Python, the GIL means that even if you have multiple threads chugging simultaneously on a computation, only one of those threads will actually be running at any given instant, because all of the other ones will be blocked, waiting to acquire the global interpreter lock. That means that the multithreads Python program will actually be slower than the single-threaded version, rather than faster, since only one thread runs at a time -- plus there is the accounting overhead incurred by forcing every thread to wait for, acquire, and then relinquish the GIL (round-robin style) every few milliseconds. ..."
even if all wannabe experts post infinite amounts of a-free-of-charge pieces of advice that this and that package is such a great tool. Sure, many packages indeed are great tools, but sorry, almost never for any tighter real-time application domain under certain principal levels of stability thresholds. Yes, I was quite happy with designing and operating a python/scikit-learn GLAN-distributed fast AI/ML-predictive system, but I was pretty sure to fit under << 1 [ms]
and a 80 ~ 90 [us]
threshold for the local RTT-E2E was well inside my control-loop's stability perimeter )
One may spend literally any amount of just-love-coding in tools, that a professional would never have a reason to start using right due to known principal un-ability to match and meet the control-loop stability threshold, so better take due care in real-time system architecture feasibility / review phase not to repeat some of these or similarly naive fatal decision errors.