Demo UI Definition
Demo UI Definition
This page explains the current built-in demo UI asset system.
The main implementation is:
espnet3/publication/demo/assets.py
Important
The current demo UI layer is asset-based.
Core idea
Each UI spec uses a type such as audio or text. That type is resolved through UIAssetRegistry.
Each asset knows how to build:
- an input component
- an output component
Built-in assets
The default registry currently registers:
audiotext
These are installed through DEFAULT_UI_ASSETS.
Spec shape
Typical input spec:
ui:
inputs:
- key: speech
type: audio
label: "Input Audio"Typical output spec:
ui:
outputs:
- key: hyp
type: text
label: "Transcription"The important fields are:
key: runtime field nametype: registered asset typelabel: optional UI label
How components are built
The recipe launcher calls:
session.build_input_component(spec)session.build_output_component(spec)
Those methods ask the registry for the asset named by spec["type"].
Writing a custom app.py
The standard app.py reads your YAML specs and builds a Gradio UI automatically. When you need something the YAML can't express β a different layout, an extra input, custom components β you write your own app.py. Here's how it works from the inside.
Step 1 β Load the session
This reads demo.yaml and your model.
from espnet3.publication.demo.session import load_demo_session
session = load_demo_session(demo_dir, demo_config_path)
# session now holds:
# session.input_specs β list of dicts from ui.inputs in your YAML
# session.output_specs β list of dicts from ui.outputs in your YAML
# session.model β your loaded InferenceModel
# session.registry β UIAssetRegistry (audio, text built-ins)Step 2 β Create the inference function
Wires Gradio values β model β outputs.
# Standard: keys come entirely from YAML specs
inference_fn = session.create_inference_fn(
session.input_specs,
session.output_specs,
)
# Custom: override input_keys to add extra inputs not in the YAML
inference_fn = session.create_inference_fn(
input_keys=[
*[spec["key"] for spec in session.input_specs],
"image", # β not in YAML, added here
],
output_keys=[spec["key"] for spec in session.output_specs],
)
# At call time, inference_fn builds this dict and calls your model:
# item = {"speech": <audio>, "image": <numpy>} # positional, in key order
# result = model(item, **call_args)The rule: the positional order of
inputspassed tobtn.click()must exactly match the order ofinput_keyspassed tocreate_inference_fn(). That's how each Gradio value gets mapped to the right key initem.
Step 3 β Build the Gradio layout
Mix session components with custom ones.
import gradio as gr
with gr.Blocks(title=session.title) as app:
if session.title:
gr.Markdown(f"# {session.title}")
inputs = []
with gr.Row():
with gr.Column():
# build components for the YAML-defined inputs
for spec in session.input_specs:
inputs.append(session.build_input_component(spec))
# add a custom gr.Image β not declared in YAML at all
image_input = gr.Image(label="Reference Image", type="numpy")
inputs.append(image_input)
btn = gr.Button("Run")
with gr.Column():
outputs = [session.build_output_component(s) for s in session.output_specs]
btn.click(fn=inference_fn, inputs=inputs, outputs=outputs)Model side: making InferenceModel accept extra keys
Adding "image" to input_keys in app.py is only half the story. InferenceModel._build_single_inputs() filters the incoming sample dict down to only the keys listed in input_key before passing anything to the backend. If "image" is not declared in input_key inside inference.yaml, it will be silently dropped β the model never sees it.
conf/inference.yaml β declare image in input_key
# conf/inference.yaml (inside model_pack)
input_key:
- speech
- image # β without this, InferenceModel silently drops "image"
provider:
_target_: src.provider.ImageAwareProvidersrc/provider.py β subclass InferenceProvider to return an image-aware model
from espnet3.parallel.inference_provider import InferenceProvider
class ImageAwareProvider(InferenceProvider):
# both build_dataset and build_model must be static methods
@staticmethod
def build_dataset(config):
return None # not needed for demo
@staticmethod
def build_model(config):
# wrap the base ASR model to accept an image argument
asr_model = build_asr_model(config)
return ImageAwareWrapper(asr_model)
class ImageAwareWrapper:
def __init__(self, model):
self.model = model
def __call__(self, speech, image=None, **kwargs):
# add image-based pre/post-processing here
result = self.model(speech, **kwargs)
return resultTwo separate layers.
input_keysincreate_inference_fn()controls what the UI passes to the inference function.input_keyininference.yamlcontrols what the inference function passes to the model. Both need to include"image"for it to reach the backend.
Adding a new asset type
Subclass UIAsset, implement build_input(...) and/or build_output(...), then register it.
Adding to the global registry
Register in DEFAULT_UI_ASSETS to make the type available in all demos:
from espnet3.publication.demo.assets import UIAsset, DEFAULT_UI_ASSETS
class PromptTextAsset(UIAsset):
def build_input(self, spec):
return self.gradio_module.Textbox(
label=spec.get("label", "Prompt"),
lines=spec.get("lines", 3),
)
DEFAULT_UI_ASSETS.register("prompt_text", PromptTextAsset)Use it in demo.yaml:
ui:
inputs:
- key: prompt
type: prompt_text
label: "Prompt"
lines: 4Adding to a specific demo only
Register on session.registry instead of DEFAULT_UI_ASSETS to scope it to one app:
from espnet3.publication.demo.assets import UIAsset
class ImageUI(UIAsset):
def build_input(self, spec):
return self.gradio_module.Image(
label=spec["label"],
type="numpy",
)
# register before building the layout in app.py
session = load_demo_session(demo_dir, demo_config_path)
session.registry.register("image", ImageUI)Now type: image resolves through the registry just like audio or text:
ui:
inputs:
- type: image
key: image
label: "Reference Image"Rule of thumb
Use UIAsset when:
- the new behavior is a reusable component type
- you want to keep
demo.yamldeclarative
Edit app.py when:
- the page layout itself changes
- you need extra buttons or control flow
- you want custom logic before or after calling
InferenceModel
