Demo UI Definition

Masao SomekiAbout 3 min

Demo UI Definition

This page explains the current built-in demo UI asset system.

The main implementation is:

espnet3/publication/demo/assets.py

Important

The current demo UI layer is asset-based.

Core idea

Each UI spec uses a type such as audio or text. That type is resolved through UIAssetRegistry.

Each asset knows how to build:

an input component
an output component

Built-in assets

The default registry currently registers:

audio
text

These are installed through DEFAULT_UI_ASSETS.

Spec shape

Typical input spec:

ui:
  inputs:
    - key: speech
      type: audio
      label: "Input Audio"

Typical output spec:

ui:
  outputs:
    - key: hyp
      type: text
      label: "Transcription"

The important fields are:

key: runtime field name
type: registered asset type
label: optional UI label

How components are built

The recipe launcher calls:

session.build_input_component(spec)
session.build_output_component(spec)

Those methods ask the registry for the asset named by spec["type"].

Writing a custom app.py

The standard app.py reads your YAML specs and builds a Gradio UI automatically. When you need something the YAML can't express — a different layout, an extra input, custom components — you write your own app.py. Here's how it works from the inside.

Step 1 — Load the session

This reads demo.yaml and your model.

from espnet3.publication.demo.session import load_demo_session

session = load_demo_session(demo_dir, demo_config_path)

# session now holds:
#   session.input_specs   — list of dicts from ui.inputs in your YAML
#   session.output_specs  — list of dicts from ui.outputs in your YAML
#   session.model         — your loaded InferenceModel
#   session.registry      — UIAssetRegistry (audio, text built-ins)

Step 2 — Create the inference function

Wires Gradio values → model → outputs.

# Standard: keys come entirely from YAML specs
inference_fn = session.create_inference_fn(
    session.input_specs,
    session.output_specs,
)

# Custom: override input_keys to add extra inputs not in the YAML
inference_fn = session.create_inference_fn(
    input_keys=[
        *[spec["key"] for spec in session.input_specs],
        "image",  # ← not in YAML, added here
    ],
    output_keys=[spec["key"] for spec in session.output_specs],
)

# At call time, inference_fn builds this dict and calls your model:
#   item = {"speech": <audio>, "image": <numpy>}   # positional, in key order
#   result = model(item, **call_args)

The rule: the positional order of inputs passed to btn.click() must exactly match the order of input_keys passed to create_inference_fn(). That's how each Gradio value gets mapped to the right key in item.

Step 3 — Build the Gradio layout

Mix session components with custom ones.

import gradio as gr

with gr.Blocks(title=session.title) as app:
    if session.title:
        gr.Markdown(f"# {session.title}")

    inputs = []
    with gr.Row():
        with gr.Column():
            # build components for the YAML-defined inputs
            for spec in session.input_specs:
                inputs.append(session.build_input_component(spec))

            # add a custom gr.Image — not declared in YAML at all
            image_input = gr.Image(label="Reference Image", type="numpy")
            inputs.append(image_input)

            btn = gr.Button("Run")

        with gr.Column():
            outputs = [session.build_output_component(s) for s in session.output_specs]

    btn.click(fn=inference_fn, inputs=inputs, outputs=outputs)

Model side: making InferenceModel accept extra keys

Adding "image" to input_keys in app.py is only half the story. InferenceModel._build_single_inputs() filters the incoming sample dict down to only the keys listed in input_key before passing anything to the backend. If "image" is not declared in input_key inside inference.yaml, it will be silently dropped — the model never sees it.

conf/inference.yaml — declare image in input_key

# conf/inference.yaml (inside model_pack)
input_key:
  - speech
  - image   # ← without this, InferenceModel silently drops "image"

provider:
  _target_: src.provider.ImageAwareProvider

src/provider.py — subclass InferenceProvider to return an image-aware model

from espnet3.parallel.inference_provider import InferenceProvider

class ImageAwareProvider(InferenceProvider):
    # both build_dataset and build_model must be static methods
    @staticmethod
    def build_dataset(config):
        return None  # not needed for demo

    @staticmethod
    def build_model(config):
        # wrap the base ASR model to accept an image argument
        asr_model = build_asr_model(config)
        return ImageAwareWrapper(asr_model)


class ImageAwareWrapper:
    def __init__(self, model):
        self.model = model

    def __call__(self, speech, image=None, **kwargs):
        # add image-based pre/post-processing here
        result = self.model(speech, **kwargs)
        return result

Two separate layers. input_keys in create_inference_fn() controls what the UI passes to the inference function. input_key in inference.yaml controls what the inference function passes to the model. Both need to include "image" for it to reach the backend.

Adding a new asset type

Subclass UIAsset, implement build_input(...) and/or build_output(...), then register it.

Adding to the global registry

from espnet3.publication.demo.assets import UIAsset, DEFAULT_UI_ASSETS

class PromptTextAsset(UIAsset):
    def build_input(self, spec):
        return self.gradio_module.Textbox(
            label=spec.get("label", "Prompt"),
            lines=spec.get("lines", 3),
        )

DEFAULT_UI_ASSETS.register("prompt_text", PromptTextAsset)

Use it in demo.yaml:

ui:
  inputs:
    - key: prompt
      type: prompt_text
      label: "Prompt"
      lines: 4

Adding to a specific demo only

from espnet3.publication.demo.assets import UIAsset

class ImageUI(UIAsset):
    def build_input(self, spec):
        return self.gradio_module.Image(
            label=spec["label"],
            type="numpy",
        )

# register before building the layout in app.py
session = load_demo_session(demo_dir, demo_config_path)
session.registry.register("image", ImageUI)

Now type: image resolves through the registry just like audio or text:

ui:
  inputs:
    - type: image
      key: image
      label: "Reference Image"

Rule of thumb

Use UIAsset when:

the new behavior is a reusable component type
you want to keep demo.yaml declarative

Edit app.py when:

the page layout itself changes
you need extra buttons or control flow
you want custom logic before or after calling InferenceModel

Runtime

See how UI specs are used during inference.

Demo Config

See the `ui:` section in the current demo config.