espnet2.fileio.read_text.RandomTextReader
espnet2.fileio.read_text.RandomTextReader
class espnet2.fileio.read_text.RandomTextReader(text_and_scp: str)
Bases: Mapping
Reader class for random access to text.
This class provides a simple text reader for non-pair text data, particularly useful for unsupervised automatic speech recognition (ASR). Instead of loading the entire text into memory (which can be large for unsupervised ASR), the reader uses memory mapping (mmap) to access text stored in byte-offsets within each text file. This allows for random selection of unpaired text for training.
text_mm
Memory-mapped object for the text file.
- Type: mmap
scp_mm
Memory-mapped object for the SCP file.
- Type: mmap
first_line_offset
The byte offset of the first line in the SCP file.
- Type: int
max_num_digits
The maximum number of digits per line in the SCP file.
- Type: int
stride
The total number of bytes per line in the SCP file.
- Type: int
num_lines
The total number of lines in the text file.
Type: int
Parameters:text_and_scp (str) – A string containing the paths to the text file and the SCP file, separated by a hyphen (e.g., “text.txt-scp.txt”).
####### Examples
Suppose you have a text file with the following content: : text1line text2line text3line
And an SCP file that looks like this: : 11 00000000000000000010 00000000110000000020 00000000210000000030
You can create an instance of the RandomTextReader like this:
reader = RandomTextReader(“text.txt-scp.txt”) Then, you can access random lines from the text: random_line = reader[0] # Access a random line
NOTE
The SCP file format must follow the specified structure for the reader to function correctly.
- Raises:
- AssertionError – If the SCP file does not contain valid data or if
- the number of bytes is not consistent. –
keys()
Reader class for random access to text.
This class provides a simple text reader for non-paired text data, which is useful for unsupervised automatic speech recognition (ASR). Instead of loading the entire text into memory (which can be large for UASR), the reader utilizes memory-mapped files to efficiently access text stored in byte offsets. This allows for random selection of unpaired text for training.
text_mm
Memory-mapped object for the text file.
- Type: mmap
scp_mm
Memory-mapped object for the SCP file.
- Type: mmap
first_line_offset
Offset of the first line in the SCP file.
- Type: int
max_num_digits
Maximum number of digits in the SCP file.
- Type: int
stride
The number of bytes for each line in the SCP file.
- Type: int
num_lines
The total number of lines in the text file.
Type: int
Parameters:text_and_scp (str) – A string containing the paths to the text file and the SCP file, separated by a hyphen (e.g., ‘text.txt-scp.txt’).
####### Examples
Given a text file with lines: : text1line text2line text3line
And a corresponding SCP file: : 11 00000000000000000010 00000000110000000020 00000000210000000030
You can create a RandomTextReader instance and access lines as follows:
>>> reader = RandomTextReader('text.txt-scp.txt')
>>> print(reader[0]) # Outputs one of the text lines randomly
>>> print(len(reader)) # Outputs 3, the number of text lines
NOTE
The SCP file format requires that the number of bytes specified for each line in the SCP file corresponds correctly to the lines in the text file.
- Raises:AssertionError – If the maximum number of digits read from the SCP file is less than or equal to zero or if the number of text bytes is not divisible by the stride.