The performance statistics is a display rendered above the Ventuz output. It can be enabled from the designer or the live options to analyze performance problems in a scene. This has been enhanced as of Ventuz 4.04.00.
The performance statistics contain a lot of information. In many cases displaying it all can be a bit overwhelming and requires a lot of screen real-estate. So individual sections can be enabled and disabled and with the Frame Count the amount of visible frames can be adjusted:
With all sections enabled, it can looks like this:
The display is updated every frame. By pressing the pause key, the update of all sections can be paused.
Here is an explanation of these sections. This is done with the drop-down button.
This gives a detailed breakdown of what happens on the main CPU thread and the GPU. If SDI boards or Webcams are in use, it will also display the timing of the Video IO thread.
The percentages are averaged over the last 64 frames. The colored bars show in more detail which part of Ventuz is performing worst. By hovering over the bars, a tool-tip explains what the bars stand for. The vertical white lines represent frames. The display is configured to show two frames, assuming you are most interested in performance when things are bad.
Please note that the performance statistics are about frames, not fields. So when running at 1080i50, expect a framerate of 25Hz, and the bars show information about rendering two fields. This is especially interesting for the video engine, as it does most of its work per frame and not per field.
In general:
By default, bars for two frames are displayed. If the scene runs slower than two frames, the bars are truncated. To display more or less then two frames, the number of frames can be changed with the FrameCount setting in the performance pull-down menu.
To see spikes in performance better, enable the History Mode with the drop-down box. Pressing the pause key on keyboard will pause and continue the update of the bars. When the history mode is on and pause mode is on, use the page-up and page-down keys to scroll. Pressing the control key will scroll faster.
The numbers before the lines are the last three digits of the cluster clock
When analyzing performance, the first thing to notice is whether it is the CPU or GPU that is limiting Ventuz performance.
Unfortunately, the GPU measurement is sometimes incorrect. Especially when the GPU isn't doing much, it might stall in the middle of the measurement, displaying a high GPU utilization while it is actually doing little. Enable the Stall GPU option: this will negatively effect frame-rate, but correctly tell whether CPU or GPU is the bottleneck.
By default, the display uses accumulated bars. When this is disabled, the bars show the exact timing as it happens on the CPU or GPU, and not just accumulated values. Also, one can see the various periods where the CPU is possibly waiting for the GPU (blue "idle" bars). In this mode, one can also see how the bars are stacked: ContentValidation and HierarchyValidation are stacked on top of Validation, which shows the total time including what is stacked on top.
Such detailed performance measurement is not for free, and especially the "Stall GPU" option slows things down. When the performance monitor is completely disabled, or the timing section is disabled, no data is recorded and therefore impact to performance is negligible. You can see the difference by switching the timing section ("Performance" in the menu) on and off while keeping the system information on.
More information about how to interpret this below.
Clock is the internal cluster-synchronized clock of Ventuz. It counts video frames, or fields when in interlaced mode. This line also shows the Cluster clock converted into seconds, minutes, hours and days. Then follows the Cluster ID of this machine, and the Machine ID from which the clock is taken (in a cluster).
The last few numbers count the frames that were skipped or duplicated because of cluster synchronization, and the reason for the last skipping or duplication.
Framerate is the frame-rate and frame duration, accumulated over the last 64 frames (or fields). Again, in interlaced mode this is the rate of full frames with two fields. It shows the actual measured frame-rate and the expected frame-rate. This information is displayed again, converted into milliseconds.
When tweaking performance, think in milliseconds, not fps. You can easily do math in milliseconds, like 3ms + 7ms = 10ms. You can't do that with FPS. 333fps + 143 fps = 100 fps? That's correct, the same numbers as above, but it just doesn't make sense. Also, getting from 40 fps to 50 fps is 5ms, getting from 50 to 60 is 3.3 ms. Squeezing a millisecond is always the same, a difference in fps is essentially meaningless.
CPU Load Tells the total CPU load by the Ventuz process. The Performance Timing bars above only capture the main thread and the video thread. The most important other threads are:
CPU Memory shows the percentage of physical memory Ventuz takes up, and how much memory is swapped on disk. The values come directly from the operating system, so they include memory that is allocated indirectly, for instance by DirectX.
GPU Memory shows the percentage of GPU memory Ventuz uses. The values are counted manually by Ventuz whenever it creates or destroys a resource. Since the driver may have some overhead that is not knowable to Ventuz, the actual amount of GPU memory used may be larger.
For instance, the driver may need to round up the width of a texture to the next power of two, so a 640 pixel wide texture gets rounded to 1024, a substantial difference. Ventuz can not know this and reports the incorrect amount.
GPU: Indices is the number of Vertices drawn ignoring reuse. Each triangle contributes 3 indices, each line only 2. DrawCalls are the number of times something is drawn. With font rendering, multiple draw calls can be combined to a batch. The number of Batches is the most important indicator on how much time is spend in the graphics driver. Another trick used by font rendering to reduce the number of batches is Instancing.
Gfx-State: Each material is represented as a gfx-state when it is actually drawn. Each frame the states are set and possibly need an update. Updating is especially expensive if the shader-signature has changed and the exact shader variant has to be determined. this gets even more expensive if the shader is new and needs to be compiled. You should make sure your Signature values are near 0 and Compiles are always 0.
Dynamic Geometry: Ventuz will do some FastUploads for interal purposes. When a geometry is animated by binding it, this will issue a SlowUpload when a value changes. How slow such an update is depends on how complex the vertex format is: Most vertex formats will undergo no conversion or a QuickVertexConvert. Sometimes a SlowVertexConvert has to be done, this should be avoided on animated geometries.
Creation Calls are an exceptionally expensive part of the DirectX API and should only happen during scene loading.
Persistent Resources are resources that accumulate and are never deleted until Ventuz is quit. VertexFormats and VertexConverters are cheap internal resources that should stay below 25. Shaders is the number of different shaders that have been used since Ventuz started.
GPU Memory: this shows how GPU memory is divided into geometry (vertex and index buffers), textures and Color/Depth Target (textures used as rendertarget or depth buffer).
This is a list of all Audio and Video Devices and their states.
The states mean:
The bar shows how many buffers are allocated and if they are full. This is only shown when Ventuz manually manages buffers, like for SDI streams. For an output stream, the buffers should be as full as possible, and when Ventuz is too slow the buffers are drained, a drop is only visible when the queue gets empty. For input streams, the buffers should be as empty as possible, when Ventuz is too slow the queue fills up so that Ventuz can process the images later on. Input frames are only dropped if the queue overflows.
Next is a list of audio cross-points. It shows the amount of delay that is associated with the cross-point. Delay is needed for windows audio devices or asynchronous inputs.
For each enabled movie related node, this shows information about the movie. This is especially important since most of the CPU load done by movies is not captured in the performance timing bars.
For each movie, the state, position, lead, buffering and file-name is shown.
The states are:
When multiple scenes are load into Ventuz, all the inactive scenes have still their audio player initalized. This may waste lots of GPU and CPU memory for the frame-buffers. Movies from inactive scenes have gray text instead of white. Ventuz will automatically take care to put movie clips from inactive scenes into pause mode, so that at least no effort is wasted in decoding frames that are never displayed.
The lead works like this: When the clip is not in async mode, and Ventuz runs in a cluster, it can happen that seeking to a position is not good enough because when the seek is complete, time has passed and the video is already too late. To avoid this problem, always pause the video, seek, wait a bit and then un-pause. If this is not done, Ventuz will notice the problem and automatically seek to a position slightly in the future, like when shooting skeet. A lead of 0 means no problem, 1 means that Ventuz considers leading but hasn't done it already, and higher numbers indicate increasing amounts of lead-time.
The buffers show how many buffers Ventuz has allocated and how many frames in the future are available. If these buffers run dry, increase the number of buffers for that movie. At the end of each line you can see the exact Cluster Clock when each Movie Started. Like the lead this is mainly useful to check syncronization of movies in a cluster.
In the Ventuz Designer, one can jump directly to a movie node by right-clicking on the name of a movie. A message is displayed to hint this:
Here the top movie is in an inactive scene (therefore it is displayed in gray, by right-clicking on it one can see where the node is hidden.
When a camera tracking system is used, debug information can be shown. Ventuz will display the view and projection matrix, plus the K1 and K2 values and lens center shift in x and y for the lens distortion, all received from the tracking system.
The layer engine will magically convert your layering into an optimized set of shaders. Each shader call can blend up to 8 layers together, if conditions are right.
To check how well the engine does in combining effects into single shaders, you can look ad the layer processing.
Each line represents:
The log will reference textures and rendertargets with !#00, where 00 is replaced by a number assigned to the texture / rendertarget. Blits reference their sources by referencing to other lines in the log, using the notation @000. In the example you can see how a scene is rendered to color buffer !#06 and depth buffer !#07, then the color buffer is resolved to !#08 and the whole thing is referenced as line @003 in the first blit.
Some effect will create only a small shader snipped in one of the blits, like the grayscale and the color correct effect. Other effects, like the blur, will create multiple shaders, even changing depending on parameters. In the example above, the two downsample and the two blur blits are generated by a blur effect
Layer Blending will create Blend.XXXX blits, with XXXX being the blend-mode. The last blit line shows how multiple blends and other effects are merged into a single blit, the two nops being leftovers from the merge process.
As a rule of thumb: the less blits the better, and everything with blur in the name and the downsample box are especially expensive.
The performance monitor allows high performance logging. This is divided in two sections: Frame logging is replaced every frame, and can be used for watching values as they changed. This will look like the other performance statistics.
The console will work like a normal log file where messages printed will scroll endlessly. This is useful to capture events and state changes.
Use the Log Node or scripts to do frame and console logging.
The strings [sb:10/100], [mb:10/100], [lb:10/100] will paint a small, medium or long bar which will be filled for 10 of 100 parts (10 %). This bar looks like the Memory bars in the System Information section. The numbers will be printed as written, and the bar-length will be calculated by the division. So 33/100 will yield approximately the same bar-length as 1/3, but display different numbers.
[! will print text brighter until ] is encountered, line in dim [! bright] dim.
The text-color can be changed with [tc:rgb] and [tc:rrggbb], where r, g and b must be replaced by web-style hex-colors. Again, the text in the brackets will be effected. Example: white [tc:ff0000 red] white.
When logging, c# will convert \n to linefeed (ASCII 10) and \f to formfeed (ASCII 12). Linefeeds will work normally, no \r carriage return (ASCII 13) is needed. Formfeeds will insert a half-sized linefeed. This is used between the sections of the performance statistics.
The codes \\, \[ and \] are parsed by the logging system and can be used for escaping. Other uses of the backslash will print the backslash normally. So in a c# script, one can write \\[ to escape a [.
When using resolutions larger than HD, it becomes noticeable that the console logging area is limited to 238 x 96 characters. Frame logging is limited to 8095 characters total.
For frame-logging use Ventuz.Kernel.Tools.PerfMon.PrintFrame . This information will be overwritten every frame.
For console-style logging use Ventuz.Kernel.Tools.PerfMon.PrintConsole.
All this logging is high-performance and when the performance monitor is disabled, the only cost of this is the formatting of the string.
This explains how to add custom colored-bars to the performance timing.
First, a token must be created as a static member:
public static Ventuz.Kernel.Tools.PerfMon.Token Script1 = new Ventuz.Kernel.Tools.PerfMon.Token("Script1", 0xff8000);
The token has a name and a color. The color is given as hex web-color in RRGGBB format.
To measure something, use the token:
using(Script1.MeasureCpu) { System.Threading.Thread.Sleep(3); }
Here is a complete example:
using System; using Ventuz.Kernel; using Ventuz.Kernel.Tools; public class Script : ScriptBase, System.IDisposable { private bool changed; public Script() { } public virtual void Dispose() { } // Token for Performance Timing public static PerfMon.Token Script1 = new PerfMon.Token("Script1", 0xff8000); public override void Validate() { changed = true; // Logging that will be overwritten every frame PerfMon.PrintFrame("Hello World\n"); // Logging that will spam the console PerfMon.PrintConsole("LogLogLog\n"); // wait a few milliseconds, and add a bar to timing using(Script1.MeasureCpu) { System.Threading.Thread.Sleep(3); } } public bool OnEnable(int arg) { Ventuz.Kernel.Tools.PerfMon.ConsoleLogEnabled = true; Ventuz.Kernel.Tools.PerfMon.FrameLogEnabled = true; return false; } public bool OnDisable(int arg) { Ventuz.Kernel.Tools.PerfMon.ConsoleLogEnabled = false; Ventuz.Kernel.Tools.PerfMon.FrameLogEnabled = false; return false; } public override bool Generate() { if (changed) { changed = false; return true; } return false; } }
To make this example work, add an integer input to the C# script node and connect it to the cluster-clock output of the SystemID node. This will update the script every frame, although the value of cluster-clock is ignored.
If working in Designer with RenderSetups of high resolution in Machine or Cluster Preview modes, the Performance Statistics rendering might get very tiny. To scale it up to the preview window size, enable this option.
Assume a computer with a 4 core processor with hyper-threading. That makes 8 logical threads.
If a Ventuz scene takes 60% on the Main Thread, and no other time-consuming tasks are done (movie replay, heavy load on the DirectX driver), the CPU percentage should show ca. 7%, as the CPU is able to run 8 threads at the same time, and one of these threads is at 60% load.
Now, hyper-threading is not perfect. One can expect a 50% performance boost through hyper-threading at best, making it feel more like a 5-6 core than an 8 core machine. Still the CPU percentage display will consider all threads the same, as the actual performance boost is highly depending on the application.
The movie playback does not show on the Main Thread. But it shows in the following way:
When playing low frame-rate video (24 or 30 fps) on full frame-rate outputs (50 or 60 fps), the GPU processing is not performed every frame. Turning on the History option for the performance timing reveals this: The GPU processing is done only every second or third frame. Be aware that while the average GPU load may be below 100%, individual frames may spike beyond the 100% mark.
In the AV Inputs/Outputs section, the delay of the audio crosspoints is displayed. This delay is inserted automatically by Ventuz to have some buffer when audio devices do not deliver data completely synchronous. Lets consider an example where an audio crosspoint goes from SDI-In to SDI-Out for a refresh rate of 59.94 Hz (Interlaced). The delay will be quite larget. What's going on here?
So for shortest delays:
Red(ish): Validation
Green(ish): Rendering
Some shaders get special colors:
Blue(ish): Idle
The Idle markers are only shown when Accumulated Bars is switched off.
Yellow(ish): Postprocessing
Cyan(ish): Resource Generation
Video Process:
Other