Welcome to TensorFlow Runtime Tracer documentation!¶
TensorFlow Runtime Tracer is a web application to monitor and trace TensorFlow scripts in the runtime on the op
level.
It starts a web server upon the execution of the script. The web interface keeps track of all the session runs and can trace the execution on demand.
The goal of this tool is to facilitate the process of performance tuning with minimal code changes and insignificant runtime overhead. Both Higher-level (tf.estimator.Estimator) and Low-level (tf.train.MonitoredTrainingSession and co) APIs are supported. It also supports horovod and IBM Distributed Deep Learning (DDL). The tracing session can be saved, reloaded, and distributed effortlessly.
Quick Start¶
Install tensorflow-tracer and run an example:
$ pip3 install tensorflow-tracer $ git clone https://github.com/xldrx/tensorflow-tracer.git $ python3 ./tensorflow-tracer/examples/estimator-example.py
Browse to:
http://0.0.0.0:9999
How to Use¶
Add
tftracer
to your code:Estimator API:
from tftracer import TracingServer ... tracing_server = TracingServer() estimator.train(input_fn, hooks=[tracing_server.hook])
Low-Level API:
from tftracer import TracingServer ... tracing_server = TracingServer() with tf.train.MonitoredTrainingSession(hooks=[tracing_server.hook]): ...
Run your code and browse to:
http://0.0.0.0:9999
How to Trace an Existing Code¶
If you want to trace an existing script without any modification use tftracer.hook_inject()
. Please note that
this is experimental and may cause unexpected errors:
Add the following to the beggining of the main script:
import tftracer tftracer.hook_inject() ...
Run your code and browse to:
http://0.0.0.0:9999
Command line¶
Tracing sessions can be stored either through the web interface or by calling tftracer.TracingServer.save_session()
.
To reload a session, run this in the terminal:
tftracer filename
Then browse to:
http://0.0.0.0:9999
Full Usage¶
usage: tftracer [-h] [--port PORT] [--ip IP] session_file
positional arguments:
session_file Path to the trace session file
optional arguments:
-h, --help show this help message and exit
--port PORT To what TCP port web server to listen
--ip IP To what IP address web server to listen
Examples¶
- Higher-Level API <estimator-example.py>
- Example of using
tftracer.TracingServer
with TensorFlowestimator
API. - Low-Level API <monitoredtrainingsession-example.py>
- Example of using
tftracer.TracingServer
with TensorFlowMonitoredTrainingSession
API. - Monkey Patching <monkey_patching-example.py>
- Example of using
tftracer.hook_inject()
to trace a script without any modifications. - Horovod: One Process <horovod-example.py>
- Example of using
tftracer.TracingServer
withhorovod
. In this example only the one process is being traced. - Horovod: All Processes <horovod-all-example.py>
- Example of using
tftracer.TracingServer
withhorovod
. In this example all processes are being traced. - Timeline <timeline-example.py>
- Example of using
tftracer.Timeline
to trace and visualize onesession.run
call without a tracing server. - Load Session <load_session-example.py>
- Example of saving and loading tracing sessions.
- TracingServer Options <options-example.py>
- Example of setting tracing options.
API Reference¶
tftracer.TracingServer¶
-
class
tftracer.
TracingServer
(**kwargs)¶ This class provides a
tf.train.SessionRunHook
to track session runs as well as a web interface to interact with users. By default, the web interface is accessible on http://0.0.0.0:9999. The web server stops at the end of the script. Usetftracer.TracingServer.join()
to keep the server alive.Example
Estimator API:
tracing_server = TracingServer() estimator.train(input_fn, hooks=[tracing_server.hook])
Low-Level API:
tracing_server = TracingServer() with tf.train.MonitoredTrainingSession(hooks=[tracing_server.hook]): ...
Parameters: - start_web_server_on_start (bool) – If true a web server starts on object initialization. (default: true)
- server_port (int) – TCP port to which web server listens (default: 9999)
- server_ip (str) – IP Address to which web server listens (default: “0.0.0.0”)
- keep_traces (int) – Number of traces per run which the tracing server should keep. the server discards the oldest traces when exeeced the limit. (default: 5)
-
hook
¶ Returns a
tensorflow.train.SessionRunHook
object. This object is meant to pass to tensorflowestimator
API orMonitoredSession
.
-
join
()¶ Wait until the web server is stopped.
-
load_session
(filename, gziped=None)¶ Loads a tracing session into the current tracing server.
- Caution:
- This action discards the current data in the session.
Parameters: - filename – path to the trace session file.
- gziped (bool) – when set, determines if the trace file is gziped. when None, use gzip if the filename ends with “.gz”;
-
save_session
(filename)¶ Stores the tracing session to a pickle file.
Parameters: filename – path to the trace session file.
-
start_web_server
()¶ Start a web server in a separate thead.
Note
The tracing server keeps track of session runs even without a running web server.
-
stop_web_server
()¶ Stop the web server.
Note
The tracing server keeps track of session runs even after the web server is stopped.
tftracer.Timeline¶
-
class
tftracer.
Timeline
(run_metadata=None, **kwargs)¶ This class traces a session run and visualizes the execution timeline.
Example
with Timeline() as tl: sess.run(fetches, **tl.kwargs)
Parameters: run_metadata (tensorflow.RunMetadata) – If set a web server starts on object initialization. (default: true) -
classmethod
from_pickle
(pickle_file_name, **kwargs)¶ Load a timeline form a pickle file.
Parameters: - pickle_file_name (str) – pickle file path.
- **kwargs – same as Timeline class.
Returns: a Timeline object with the content of pickle_file_name.
-
kwargs
¶ Returns a dict of config_pb2.RunOptions. This object should be unpacked and passed to session.run.
Example:
session.run(fetches, **timeline.kwargs)
-
step_time
(device_search_pattern=None)¶ Calculate the step time. :param device_search_pattern: a regex pattern used to choose which device to be included. :type device_search_pattern: str :param If None, all devices are used.:
Returns: the time in seconds. Return type: float
-
to_pickle
(pickle_file_name)¶ Save the timeline in a pickle file. :returns: None.
Raises: Exception
– if the timeline trace is empty.
-
visualize
(output_file=None, device_pattern=None)¶ Visualizes the runtime_metadata and saves it as a HTML file. :param output_file: the output file path. If is None, returns the HTML content instead. :type output_file: str :param device_pattern: a regex pattern used to choose which device to be included. :type device_pattern: str :param If None, all devices are used.:
Returns: If output_file is None returns the HTML content, otherwise returns None. Return type: str
-
wall_clock_elapsed
¶ Time elapsed in
with Timeline()
statement.Returns: the time in seconds. Return type: float
-
classmethod
tftracer.hook_inject¶
-
tftracer.
hook_inject
(*args, **kwargs)¶ (Experimental) Injects a tracing server hook to all instances of
MonitoredSession
by by monkey patching the initializer. This function is an alternative to adding hooks to estimator or sessions. Be aware, monkey patching could cause unexpected errors and is not recommended.This function should be called once in the main script preferably before importing anything else.
Example:
import tftracer tftracer.hook_inject() ... estimator.train(input_fn)
Parameters: **kwargs – same as tftracer.TracingServer
.Note
Monkey Patching (as
tftracer.TracingServer
) works only with subclasses ofMonitoredSession
. For otherSession
types, usetftracer.Timeline
.
Known Bugs/Limitations¶
- Only Python3 is supported.
- The web interface loads javascript/css libraries remotely (e.g.
vue.js
,ui-kit
,jquery
,jquery-ui
,Google Roboto
,awesome-icons
, … ). Therefore an active internet connection is needed to properly render the interface. The tracing server does not require any remote connection. - All traces are kept in the memory while tracing server is running.
- Tracing uses
tf.train.SessionRunHook
and is unable to trace auxiliary runs such asinit_op
. - The tracing capability is limited to what
tf.RunMetadata
offers. For example, CUPTI events are missing when tracing a distributed job. - HTTPS is not supported.
Frequently Asked Questions¶
How to trace/visualize just one session run?¶
Use tftracer.Timeline
. for example:
from tftracer import Timeline
...
with tf.train.MonitoredTrainingSession() as sess:
with Timeline() as tl:
sess.run(fetches, **tl.kwargs)
...
tl.visualize(filename)
Comparision to TensorBoard?¶
The nature of this project is a short-lived light-weight interactive tracing interface to monitor and trace execution on the op
-level. In comparison TensorBoard
is a full-featured tool to inspect the application on many levels:
tftracer
does not make any assumption about the dataflow DAG. There is no need to add any additionalop
to the data flow dag (i.e.tf.summary
) or having aglobal step
.tftracer
runs as a thread and lives from the start of the execution and lasts until the end of it.TensorBoard
runs as a separate process and can outlive the main script.