俄文開啟語音轉換文字

發行項
09/03/2024

衍伸自各種音訊來援的語音樣本集合。此資料集包含簡短的俄文音訊剪輯。

注意

Microsoft 依「現況」提供 Azure 開放資料集。針對　貴用戶對資料集的使用方式，Microsoft 不提供任何明示或默示的擔保、保證或條件。在　貴用戶當地法律允許的範圍內，針對因使用資料集而導致的任何直接性、衍生性、特殊性、間接性、附隨性或懲罰性損害或損失，Microsoft 概不承擔任何責任。

此資料集是根據 Microsoft 接收來源資料的原始條款所提供。資料集可能包含源自 Microsoft 的資料。

此俄文語音轉換文字 (STT) 資料集包括：

~1,600 萬個表達
~20,000 小時
2.3 TB (未壓縮 .wav 格式 int16)，opus 格式為 356G
除了驗證資料集之外，所有檔案都已轉換為 opus

資料集的主要用途是訓練語音轉換文字模型。

資料集編譯

資料集大小是以 .wav 檔案來估算。

資料集	表達	小時	GB	秒/字元	COMMENT	註釋	品質/雜訊
radio_v4 (*)	7,603,192	10,430	1,195	5 秒/68 個	選項	Align	95%/清晰
public_speech (*)	1,700,060	2,709	301	6 秒/79 個	公開演講	Align	95%/清晰
audiobook_2	1,149,404	1,511	162	5 秒/56 個	書籍	Align	95%/清晰
radio_2	651,645	1,439	154	8 秒/110 個	選項	Align	95%/清晰
public_youtube1120	1,410,979	1,104	237	3 秒/34 個	YouTube	翻譯字幕	95%/~清晰
public_youtube700	759,483	701	75	3 秒/43 個	YouTube	翻譯字幕	95%/~清晰
tts_russian_addresses	1,741,838	754	81	2 秒/20 個	地址	TTS 4 語音	100%/清晰
asr_public_phone_calls_2	603,797	601	66	4 秒/37 個	通話	ASR	70%/有雜訊
public_youtube1120_hq	369,245	291	31	3 秒/37 個	YouTube HQ	翻譯字幕	95%/~清晰
asr_public_phone_calls_1	233,868	211	23	3 秒/29 個	通話	ASR	70%/有雜訊
radio_v4_add (*)	92,679	157	18	6 秒/80 個	選項	Align	95%/清晰
asr_public_stories_2	78,186	78	9	4 秒/43 個	書籍	ASR	80%/清晰
asr_public_stories_1	46,142	38	4	3 秒/30 個	書籍	ASR	80%/清晰
public_series_1	20,243	17	2	3 秒/38 個	YouTube	翻譯字幕	95%/~清晰
asr_calls_2_val	12,950	7,7	2	2 秒/34 個	通話	手動標註	99%/清晰
public_lecture_1	6,803	6	1	3 秒/47 個	授課	翻譯字幕	95%/清晰
buriy_audiobooks_2_val	7,850	4.9	1	2 秒/31 個	書籍	手動標註	99%/清晰
public_youtube700_val	7,311	4.5	1	2 秒/35 個	YouTube	手動標註	99%/清晰

(*) txt 檔案只會提供資料樣本。

註釋方法

資料集使用開放原始碼編譯而成。系統會使用語音活動偵測和比對，將較長的序列分割為音訊區塊。系統會自動標註部分音訊類型，並以統計方式/使用啟發學習法加以驗證。

資料量和更新頻率

資料集的大小總計為 350 GB。含公開共用標籤的資料集大小總計為 130 GB。

不太可能為了回溯相容性，而更新資料集本身。依據原始存放庫來進行效能評定，並排除檔案。

日後可能會新增網域和語言。

音訊正規化

所有檔案都會經過正規化以更輕鬆/快速地實現執行階段增強。處理方式如下：

視需要轉換成單聲道；
視需要轉換成 16-kHz 取樣率；
儲存為 16 位元整數；
轉換為 OPUS；

磁碟上的 DB 方法

每個音訊檔案 (wav、二進位檔) 都會經雜湊處理。此雜湊會用來建立資料夾階層，以確保最佳的 fs 作業效果。

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15] + '.' + target_format)

下載

資料集會以下列 2 種形式來提供：

透過 Azure Blob 儲存體和/或直接連結提供的封存；
可透過 Azure Blob 儲存體取得的原始檔案；所有項目都儲存在「https://azureopendatastorage.blob.core.windows.net/openstt/」中

資料夾結構：

└── ru_open_stt_opus                                            <= archived folders
│   │
│   ├── archives
│   │    ├── asr_calls_2_val.tar.gz                             <= tar.gz archives with opus and wav files
│   │    │   ...                                                <= see the below table for enumeration
│   │    └── tts_russian_addresses_rhvoice_4voices.tar.gz
│   │
│   └── manifests
│        ├── asr_calls_2_val.csv                                <= csv files with wav_path, text_path, duration (see notebooks)
│        │   ...
│        └── tts_russian_addresses_rhvoice_4voices.csv
│
└── ru_open_stt_opus_unpacked                                   <= a separate folder for each uploaded domain
    ├── public_youtube1120
    │    ├── 0                                                  <= see "On disk DB methodology" for details
    │    ├── 1
    │    │   ├── 00
    │    │   │  ...
    │    │   └── ff
    │    │        ├── *.opus                                   <= actual files
    │    │        └── *.txt
    │    │   ...
    │    └── f
    │
    ├── public_youtube1120_hq
    ├── public_youtube700_val
    ├── asr_calls_2_val
    ├── radio_2
    ├── private_buriy_audiobooks_2
    ├── asr_public_phone_calls_2
    ├── asr_public_stories_2
    ├── asr_public_stories_1
    ├── public_lecture_1
    ├── asr_public_phone_calls_1
    ├── public_series_1
    └── public_youtube700

資料集	GB，wav	GB，封存檔案	ARCHIVE \(英文\)	來源	資訊清單
定型
廣播和公開演講樣本	-	11.4	opus+txt	-	資訊清單
audiobook_2	162	25.8	opus+txt	網際網路 + 比對	資訊清單
radio_2	154	24.6	opus+txt	選項	資訊清單
public_youtube1120	237	19.0	opus+txt	YouTube 影片	資訊清單
asr_public_phone_calls_2	66	9.4	opus+txt	網際網路 + ASR	資訊清單
public_youtube1120_hq	31	4.9	opus+txt	YouTube 影片	資訊清單
asr_public_stories_2	9	1.4	opus+txt	網際網路 + 比對	資訊清單
tts_russian_addresses_rhvoice_4voices	80.9	12.9	opus+txt	TTS	資訊清單
public_youtube700	75.0	12.2	opus+txt	YouTube 影片	資訊清單
asr_public_phone_calls_1	22.7	3.2	opus+txt	網際網路 + ASR	資訊清單
asr_public_stories_1	4.1	0.7	opus+txt	公開劇本	資訊清單
public_series_1	1.9	0.3	opus+txt	公開系列	資訊清單
public_lecture_1	0.7	0.1	opus+txt	網際網路 + 手動	資訊清單
Val
asr_calls_2_val	2	0.8	wav+txt	網際網路	資訊清單
buriy_audiobooks_2_val	1	0.5	wav+txt	書籍 + 手動	資訊清單
public_youtube700_val	2	0.13	wav+txt	YouTube 影片 + 手動	資訊清單

下載指示

直接下載

如需如何直接下載資料集的指示，請參閱 GitHub 下載指示頁面。

其他資訊

如需協助或對此資料有任何疑問，請透過 aveysov@gmail.com 連絡資料作者

此授權允許再利用者以任何媒介或格式散布、重混、修改及依本素材建立新素材，但僅供用於非商業用途，並且必須註明創作者的姓名標示。這包括下列元素：

BY – 必須歸功於創作者
NC – 僅允許用於非商業用途

CC-BY-NC，取得資料集作者同意之後可用於商業用途。

資料存取

Azure Notebooks

azure-storage

協助程式函式/相依性

建置 libsndfile

在 Python 中讀取不產生重大額外負荷的 opus 檔案時，有效率的方式是用 pysoundfile (在 libsoundfile 周圍使 Python CFFI 包裝函式)。

Opus 支援已實作上游，但尚未正確發行。因此，我們選擇自訂組建 + 拼湊修補。

一般而言，您必須在殼層中使用 sudo 存取來執行此作業：

apt-get update
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y

cd /usr/local/lib
git clone https://github.com/erikd/libsndfile.git
cd libsndfile
git reset --hard 49b7d61
mkdir -p build && cd build

cmake .. -DBUILD_SHARED_LIBS=ON
make && make install
cmake --build .

協助程式函式/相依性

安裝下列程式庫：

pandas
numpy
scipy
tqdm
soundfile
librosa

資訊清單是具有下列資料行的 csv 檔案：

音訊的路徑
文字檔的路徑
期間

這些已證明是存取資料的最簡單格式。

為了便於使用，所有資訊清單都已重新設定根目錄。其中的所有路徑都是相對的，您必須提供根資料夾。

# manifest utils
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from urllib.request import urlopen


def reroot_manifest(manifest_df,
                    source_path,
                    target_path):
    if source_path != '':
        manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: x.replace(source_path,
                                                                              target_path))
        manifest_df.text_path = manifest_df.text_path.apply(lambda x: x.replace(source_path,
                                                                                target_path))
    else:
        manifest_df.wav_path = manifest_df.wav_path.apply(lambda x: os.path.join(target_path, x))
        manifest_df.text_path = manifest_df.text_path.apply(lambda x: os.path.join(target_path, x))    
    return manifest_df


def save_manifest(manifest_df,
                  path,
                  domain=False):
    if domain:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
    else:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']

    manifest_df.reset_index(drop=True).sort_values(by='duration',
                                                   ascending=True).to_csv(path,
                                                                          sep=',',
                                                                          header=False,
                                                                          index=False)
    return True


def read_manifest(manifest_path,
                  domain=False):
    if domain:
        return pd.read_csv(manifest_path,
                        names=['wav_path',
                               'text_path',
                               'duration',
                               'domain'])
    else:
        return pd.read_csv(manifest_path,
                        names=['wav_path',
                               'text_path',
                               'duration'])


def check_files(manifest_df,
                domain=False):
    orig_len = len(manifest_df)
    if domain:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration']
    else:
        assert list(manifest_df.columns) == ['wav_path', 'text_path', 'duration', 'domain']
    wav_paths = list(manifest_df.wav_path.values)
    text_path = list(manifest_df.text_path.values)

    omitted_wavs = []
    omitted_txts = []

    for wav_path, text_path in zip(wav_paths, text_path):
        if not os.path.exists(wav_path):
            print('Dropping {}'.format(wav_path))
            omitted_wavs.append(wav_path)
        if not os.path.exists(text_path):
            print('Dropping {}'.format(text_path))
            omitted_txts.append(text_path)

    manifest_df = manifest_df[~manifest_df.wav_path.isin(omitted_wavs)]
    manifest_df = manifest_df[~manifest_df.text_path.isin(omitted_txts)]
    final_len = len(manifest_df)

    if final_len != orig_len:
        print('Removed {} lines'.format(orig_len-final_len))
    return manifest_df


def plain_merge_manifests(manifest_paths,
                          MIN_DURATION=0.1,
                          MAX_DURATION=100):

    manifest_df = pd.concat([read_manifest(_)
                             for _ in manifest_paths])
    manifest_df = check_files(manifest_df)

    manifest_df_fit = manifest_df[(manifest_df.duration>=MIN_DURATION) &
                                  (manifest_df.duration<=MAX_DURATION)]

    manifest_df_non_fit = manifest_df[(manifest_df.duration<MIN_DURATION) |
                                      (manifest_df.duration>MAX_DURATION)]

    print(f'Good hours: {manifest_df_fit.duration.sum() / 3600:.2f}')
    print(f'Bad hours: {manifest_df_non_fit.duration.sum() / 3600:.2f}')

    return manifest_df_fit


def save_txt_file(wav_path, text):
    txt_path = wav_path.replace('.wav','.txt')
    with open(txt_path, "w") as text_file:
        print(text, file=text_file)
    return txt_path


def read_txt_file(text_path):
    #with open(text_path, 'r') as file:
    response = urlopen(text_path)
    file = response.readlines()
    for i in range(len(file)):
        file[i] = file[i].decode('utf8')
    return file 

def create_manifest_from_df(df, domain=False):
    if domain:
        columns = ['wav_path', 'text_path', 'duration', 'domain']
    else:
        columns = ['wav_path', 'text_path', 'duration']
    manifest = df[columns]
    return manifest


def create_txt_files(manifest_df):
    assert 'text' in manifest_df.columns
    assert 'wav_path' in manifest_df.columns
    wav_paths, texts = list(manifest_df['wav_path'].values), list(manifest_df['text'].values)
    # not using multiprocessing for simplicity
    txt_paths = [save_txt_file(*_) for _ in tqdm(zip(wav_paths, texts), total=len(wav_paths))]
    manifest_df['text_path'] = txt_paths
    return manifest_df


def replace_encoded(text):
    text = text.lower()
    if '2' in text:
        text = list(text)
        _text = []
        for i,char in enumerate(text):
            if char=='2':
                try:
                    _text.extend([_text[-1]])
                except:
                    print(''.join(text))
            else:
                _text.extend([char])
        text = ''.join(_text)
    return text

# reading opus files
import os
import soundfile as sf



# Fx for soundfile read/write functions
def fx_seek(self, frames, whence=os.SEEK_SET):
    self._check_if_closed()
    position = sf._snd.sf_seek(self._file, frames, whence)
    return position


def fx_get_format_from_filename(file, mode):
    format = ''
    file = getattr(file, 'name', file)
    try:
        format = os.path.splitext(file)[-1][1:]
        format = format.decode('utf-8', 'replace')
    except Exception:
        pass
    if format == 'opus':
        return 'OGG'
    if format.upper() not in sf._formats and 'r' not in mode:
        raise TypeError("No format specified and unable to get format from "
                        "file extension: {0!r}".format(file))
    return format


#sf._snd = sf._ffi.dlopen('/usr/local/lib/libsndfile/build/libsndfile.so.1.0.29')
sf._subtypes['OPUS'] = 0x0064
sf.SoundFile.seek = fx_seek
sf._get_format_from_filename = fx_get_format_from_filename


def read(file, **kwargs):
    return sf.read(file, **kwargs)


def write(file, data, samplerate, **kwargs):
    return sf.write(file, data, samplerate, **kwargs)

# display utils
import gc
from IPython.display import HTML, Audio, display_html
pd.set_option('display.max_colwidth', 3000)
#Prepend_path is set to read directly from Azure. To read from local replace below string with path to the downloaded dataset files
prepend_path = 'https://azureopendatastorage.blob.core.windows.net/openstt/ru_open_stt_opus_unpacked/'


def audio_player(audio_path):
    return '<audio preload="none" controls="controls"><source src="{}" type="audio/wav"></audio>'.format(audio_path)

def display_manifest(manifest_df):
    display_df = manifest_df
    display_df['wav'] = [audio_player(prepend_path+path) for path in display_df.wav_path]
    display_df['txt'] = [read_txt_file(prepend_path+path) for path in tqdm(display_df.text_path)]
    audio_style = '<style>audio {height:44px;border:0;padding:0 20px 0px;margin:-10px -20px -20px;}</style>'
    display_df = display_df[['wav','txt', 'duration']]
    display(HTML(audio_style + display_df.to_html(escape=False)))
    del display_df
    gc.collect()

使用資料集播放

播放檔案範例

大部分的平台瀏覽器都支援原生音訊播放。因此，我們可以使用 HTML5 音訊播放程式來檢視我們的資料。

manifest_df = read_manifest(prepend_path +'/manifests/public_series_1.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')

sample = manifest_df.sample(n=20)
display_manifest(sample)

讀取檔案

!ls ru_open_stt_opus/manifests/*.csv

展示如何以最佳方式讀取 wav 和 opus 檔案的一些範例。

針對 wav 使用 Scipy 是最快速的。整體而言，Pysoundfile 最適用於 opus。

%matplotlib inline

import librosa
from scipy.io import wavfile
from librosa import display as ldisplay
from matplotlib import pyplot as plt

讀取 wav

manifest_df = read_manifest(prepend_path +'manifests/asr_calls_2_val.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')

sample = manifest_df.sample(n=5)
display_manifest(sample)

from io import BytesIO

wav_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+wav_path)
data = response.read()
sr, wav = wavfile.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav =  wav / absmax

# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)

讀取 opus

manifest_df = read_manifest(prepend_path +'manifests/asr_public_phone_calls_2.csv')
#manifest_df = reroot_manifest(manifest_df,
                              #source_path='',
                              #target_path='../../../../../nvme/stt/data/ru_open_stt/')

sample = manifest_df.sample(n=5)
display_manifest(sample)

opus_path = sample.iloc[0].wav_path
response = urlopen(prepend_path+opus_path)
data = response.read()
wav, sr = sf.read(BytesIO(data))
wav.astype('float32')
absmax = np.max(np.abs(wav))
wav =  wav / absmax

# shortest way to plot a spectrogram
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max)
plt.figure(figsize=(12, 6))
ldisplay.specshow(D, y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# shortest way to plot an envelope
plt.figure(figsize=(12, 6))
ldisplay.waveplot(wav, sr=sr, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None)

下一步

檢視開放資料集目錄中的其餘資料集。

共用方式為

俄文開啟語音轉換文字

資料集編譯

註釋方法

資料量和更新頻率

音訊正規化

磁碟上的 DB 方法

下載

下載指示

直接下載

其他資訊

資料存取

Azure Notebooks

協助程式函式/相依性

建置 libsndfile

協助程式函式/相依性

使用資料集播放

播放檔案範例

讀取檔案

讀取 wav

讀取 opus

下一步

意見反應

其他資源