It's been cited by many users as the reason for switching to Pytorch, but I've yet to find a justification / explanation for sacrificing the most important practical quality, speed, for eager execution.
Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.
My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?
Looking for a detailed answer - am already familiar with broad concepts. Relevant Git
Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070
Benchmark results:
UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.
As TF devs don't appear around anywhere, I'll be investigating this matter myself - can follow progress in the linked Github issue.
UPDATE 2: tons of experimental results to share, along explanations; should be done today.
Benchmark code:
# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time
batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)
model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)
K.clear_session() # in my testing, kernel was restarted instead
model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)
Functions used:
def timeit(func, iterations, *args):
t0 = time()
for _ in range(iterations):
func(*args)
print("Time/iter: %.4f sec" % ((time() - t0) / iterations))
def make_small_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Conv1D(128, 400, strides=4, padding='same')(ipt)
x = Flatten()(x)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_medium_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(x)
x = Conv1D(128, 400, strides=4, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_data(batch_shape):
return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))
UPDATE 8/1730/2020: TF 2.3 has finally done it: all cases run as fast, or notably faster, than any previous version.
Further, my previous update was unfair to TF; my GPU was to blame, has been overheating lately. If you see a rising stem plot of iteration times, it's a reliable symptom. Lastly, see a dev's note on Eager vs Graph.
This might be my last update on this answer. The true stats on your model's speed can only be found by you, on your device.
UPDATE 5/19/2020: TF 2.2, using same tests: only a minor improvement in Eager speed. Plots for Large-Large Numpy train_on_batch
case below, x-axis is successive fit iterations; my GPU isn't near its full capacity, so doubt it's throttling, but iterations do get slower over time.
Per above, Graph and Eager are 1.56x and 1.97x slower than their TF1 counterparts, respectively. Unsure I'll debug this further, as I'm considering switching to Pytorch per TensorFlow's poor support for custom / low-level functionality. I did, however, open an Issue to get devs' feedback.
UPDATE 2/18/2020: I've benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that's slower, and slower dramatically, is Large-Large - esp. in Graph execution (1.6x to 2.5x slower).
Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested - one not explainable via randomness/compute-parallelism. I can't currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.
Haven't opened a Git issue on these yet, but I did comment on the original - no response yet. I'll update the answer(s) once progress is made.
VERDICT: it isn't, IF you know what you're doing. But if you don't, it could cost you, lots - by a few GPU upgrades on average, and by multiple GPUs worst-case.
THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.
I'll be updating my answer(s) w/ more info if I learn any - can bookmark / "star" this question for reference.
ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source - including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.
The matter, however, is fairly more complex. It isn't just TF1 vs. TF2 - factors yielding significant differences in train speed include:
keras
vs. tf.keras
numpy
vs. tf.data.Dataset
vs. ...train_on_batch()
vs. fit()
model(x)
vs. model.predict(x)
vs. ...Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what'll work best systematically, and with a few shortcuts - as I'll be showing.
WHAT SHOULD I DO? Currently, the only way is - experiment for your specific model, data, and hardware. No single configuration will always work best - but there are do's and don't's to simplify your search:
>> DO:
train_on_batch()
+ numpy
+ tf.keras
+ TF1 + Eager/Graphtrain_on_batch()
+ numpy
+ tf.keras
+ TF2 + Graphfit()
+ numpy
+ tf.keras
+ TF1/TF2 + Graph + large model & data>> DON'T:
fit()
+ numpy
+ keras
for small & medium models and data
fit()
+ numpy
+ tf.keras
+ TF1/TF2 + Eager
train_on_batch()
+ numpy
+ keras
+ TF1 + Eager
[Major] tf.python.keras
; it can run 10-100x slower, and w/ plenty of bugs; more info
layers
, models
, optimizers
, & related "out-of-box" usage imports; ops, utils, & related 'private' imports are fine - but to be sure, check for alts, & whether they're used in tf.keras
Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the "BENCHMARKS" tables in the other answer.
LIMITATIONS of the above DO's & DON'T's:
Conv1D
and Dense
- no RNNs, sparse data/targets, 4/5D inputs, & other configsnumpy
and tf.data.Dataset
, while many other formats exist; see other answerWhy did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn't, clearly - graph is still available. But if the question is "why eager at all":
.__dict__
. Graph, in contrast, requires familiarity with special backend functions - greatly complicating the entire process of debugging & introspection.HOW TO ENABLE/DISABLE EAGER?
tf.enable_eager_execution() # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds
Misleading in TF2; see here.
ADDITIONAL INFO:
_on_batch()
methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally - i.e. it's to be fixed. See other answer for details.REQUESTS TO TENSORFLOW DEVS:
train_on_batch()
, and the performance aspect of calling fit()
iteratively; custom train loops are important to many, especially to me. ACKNOWLEDGEMENTS: Thanks to
UPDATES:
11/14/19 - found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras
and tf.keras
, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* - except Eager, for which TF2 OOM'd)
11/17/19 - devs updated on_batch()
methods in a recent commit, stating to have improved speed - to be released in TF 2.1, or available now as tf-nightly
. As I'm unable to get latter running, will delay benching until 2.1.
2/20/20 - prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes