Amazon Transcribe ハンズオンを試してみた

今回は、生成AI系のサービスに興味がある&敢えて普段の業務で触れていないサービスに触れてみよう、という2つの理由から、Amazon Transcribe についてのハンズオンを試してみました！

Amazon Transcribeとは何？？

AWS公式によれば、

フルマネージドかつ継続的にトレーニングされた自動音声認識（ASR）サービス

文字起こし機能とAPIを使って簡単に音声対応アプリケーションの構築ができる

サービスであるとのことでした。

コールセンターでの通話内容の書き起こしや字幕生成、会議での議事録など、さまざまなケースで活用ができます。

「文字起こし機能」についてですが、文字起こしの方法として、事前に記録された音声を取り込んで音声処理をし文字起こしをする形式であるバッチ処理、リアルタイムで話者の音声を取り込んで音声処理をし文字起こしをする形式であるストリーミング処理、という2種類の方式による文字起こしがサポートされています。今回のハンズオン内では両者について学べる内容になっています。

今回使用したハンズオン教材

Speech to Text hands-on labという無料のハンズオンラボを使用しました。

なお、今回のハンズオンラボには手順書（英語）があったため、それをもとにハンズオンを進めていく形になります。

初期画面はこんな感じです。

AWS環境は不要で、AWS Builder IDを発行（無料）するだけでAmazon Transcribeに関するハンズオンを無料で試すことができます。

ハンズオンスタート！

事前準備: AWS Builder IDの発行

「Start Free lab」 というボタンを押すと、下記のような画面が出てきます。ハンズオンラボに関する注意点が色々書かれていますが「Agree」を押下します。

その後、「AWS Builder ID の作成」という画面に遷移するので、個人のメールアドレスで、AWS Builder IDを作成します。（すでにAWS Builder IDを持っている方はこの部分はスキップしていただいて問題ないです）

パスワードを入力します。

サインインした状態で「Start Free Lab」というボタンを押すと、下記のように「Provisioning AWS resources」という状態になります。（ハンズオンに必要なAWS環境の作成が行われている状態）大体、1分ほどかかるかと思います。

リソースの準備が終わり、「Lab is ready!」という表示が出たら、「Open AWS Console」 というボタンをクリックし、AWSコンソールの画面が表示されれば、事前準備は完了です！

ここからハンズオンの中身に入っていきます。

手順① S3に格納されている音声ファイルを分析し、バッチ文字起こしジョブで音声文字起こしを作成する

ここでは、事前に用意された音声ファイルを取り込んで、文字起こしを行います。

まずは、Amazon Transcribe のページにアクセスします。

そして、左側のメニューから、「トランスクリプションジョブ」　を選択します。

「ジョブを作成」クリックします。ジョブとは、文字起こし作業の単位のことです。

ジョブの詳細を指定する画面では、ジョブの名前をアカウント内で固有の名前にして記載します。今回はハンズオンの指示通り「MyTranscriptionJob」という名前にしました。

また今回のハンズオンでは音声入力で使われる言語は英語となります。

画面を下にスクロールすると入力データを追加できる画面が出てきます。

今回は既にデータが登録されており、ハンズオン手順書の指示通りに

「transcribe-lab-data」という名前がついているバケットを選択します。

バケット名を選択すると、ファイルが出てきますが「Auto1_CUST….」というファイルを選択します。

その他の設定は特に変更せず、次へを押下します。

次に、ジョブの設定に移ります。

音声設定の部分において、音声識別のトグルをONにして、チャネル識別に☑️を入れ、ジョブの作成を行います。

※チャネルとは、簡単にいうと「音声データの中において、別々に録音されている個々の音声のこと」です。

例えば、会議で複数の人が話している場合、それぞれの人の声を別々のチャネルとして扱うことができます。

チャネルを使うことで、複数の話者の声を識別し、「誰が何を話したか」を正確に書き起こしをすることができます。

すると、トランスクリプションジョブが正常に作成されました。という表示が出るので、ステータスが「完了」になるまで少し待ちます。

ジョブのページに移動すると、ジョブの詳細を閲覧することができます。トランスクリプションのプレビューという部分で、実際の書き起こしを確認することができます。

「ダウンロード」ボタンから、JSON形式で書き起こし結果をダウンロードすることができます。

ダウンロードしたJSONデータの中身はこんな感じです。（非常に長いため一部抜粋）

{
"jobName": "MyTranscriptionJob",
"accountId": "XXXXXXX",
"status": "COMPLETED",
"results": {
"transcripts": [
{
"transcript": "Car. Yeah, hi. Um, my name is Viola. I just left. I'm actually still in your parking lot. Um, and I was pulling the car I just bought out and I, I bumped into that low wall outside of your front door and, like, dented the back fender of my car. Um, Uh Oh. Oh, no. yeah, like I've had it for like, 15 minutes. Um, but like, it sucks and I wanna know if I can talk to somebody because there's like, no warning for that back wall. There's like, no, um, like, what do you call those things in a parking lot to stop you from, like, hitting things or whatever? Like, there's nothing and it seems like a really bad, like, it's set up just really sucky and like, I'm not paying, like, I'm not paying for this repair but I'm not leaving without it getting repaired. So, Yeah. Right. Ok. Ok. Um, yeah. got it. Uh, Yeah. so I'm not exactly sure. Um, who you want me to put you through to? I'm happy to get you, um, the number of the tow service that could come and handle the car. Is it still drivable or is it? Uh, no, it's still drivable. It just has like a big ugly ass, like, gash on it. Like, it's like, Ok. Yeah. Ok, I got it. no, I, I wanna talk to, like, the manager, the owner, whoever, like, you can't just, like, have a wall, like, in a car dealership, like, have a wall that you could literally back into like super easy with like no, like there's no reflectors on it. Like there's no nothing and I'm like hit it and I literally just like paid in cash for this car. So, Yeah. Um, well, I can pass that information along and, um, someone can review it and maybe, uh, in the future they could install reflectors or something like that or maybe they'll, um, change, change the, the way that the, the wall is set up in the parking lot so that this doesn't happen in the future. I'm happy to pass that information, um, along for you. Um, can I get a little bit more information about, uh, what happened? Just because they're just gonna want all the pieces. So, what was your name, ma'am? yeah. Um yeah, ok. Um Viola Viola? Ok. And your last name? Viola Snell Snelle Ok, perfect. And, um, and the, the car that you purchased today? What, uh, what car was that? um yeah, so I just bought the um Mercedes E class. It's um and uh it's gold Mhm. Ok. Yep. Got you. and um I was just trying to like leave like the parking lot and like when you have to like, do that little like funky pull out like 27 point turn or whatever it is like outside of the front door and like, you can't see, like I can't see. Um, so like, it's great if you, like, if they put in some sort of like reflectors or like something like in the cement, but it's like, I don't know, something like up so that if you're like, not 7 ft tall you can see it but Uh huh. Ok. Right. Ok. Yeah. like, it's too late for my car now. And so that's why like, I'm super annoyed, like I wanna give you guys a good review. Like, I wanna be like, all nice about it and everything but like, I'm not gonna go home with this brand new car. I just thought it was like a big gash on the back. Like that's a stuff. Yeah. Ok. Absolutely. Yeah, that is really unfortunate. Um, uh, just just so I fully understand and I can pass it along correctly. Um Did you adjust the mirrors for yourself before you, um, pulled out with the car? I know sometimes the mirrors can all be set differently if someone else drove it before you purchased it and did the mirrors all look good and, and you could see everything out of your rear view and your side mirrors because usually that's where you can kind of see that wall a little bit better. Um, so I'm just curious. Ok. Yeah. Yeah. Uh huh. Yeah. Yeah. No, I honestly I don't even know like it but it shouldn't matter. Like, I just, I mean, like, yeah, sure. I'll do that and like thank you for like driving 101 but like really I just need to talk to like is there a manager in, is the owner in like whatever like Mr Mercedes, like whoever is in charge? Ok. Ok. Um yeah, uh just give me one second here and I'll see if I can put you through to the manager. Ok. Just I'm just gonna put you on hold for just one second. Ok. Ok. Sure. Yeah. Ok. Ok. Ok. Viola you still there? Mhm. Yeah. Ok. So um I am going to put you through to um the head of the um Cosmetics like repairs department and um he'll see like if he can just pop on out and have a look at the car and then we'll go from there. Ok. And I, I have all this information down so I'm just gonna go ahead and um put you on through to him now. Ok. Ok. Yeah. Ok. Sure. Yeah. Ok. All right. Ok. Um thanks for thanks a lot. Ok. Ok. Have a good one. Ok. Yeah. Yeah, you too. Bye."
}
],
"channel_labels": {
"number_of_channels": 2,
"channels": [
{
"channel_label": "ch_0",
"items": [
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.998", "content": "Yeah" }],
"start_time": "2.119",
"end_time": "2.46",
"channel_label": "ch_0"
},
{
"type": "punctuation",
"alternatives": [{ "confidence": "0.0", "content": "," }],
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.999", "content": "hi" }],
"start_time": "2.47",
"end_time": "3.279",
"channel_label": "ch_0"
},
{
"type": "punctuation",
"alternatives": [{ "confidence": "0.0", "content": "." }],
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.998", "content": "Um" }],
"start_time": "3.289",
"end_time": "3.92",
"channel_label": "ch_0"
},
{
"type": "punctuation",
"alternatives": [{ "confidence": "0.0", "content": "," }],
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.999", "content": "my" }],
"start_time": "4.8",
"end_time": "5.139",
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.999", "content": "name" }],
"start_time": "5.15",
"end_time": "5.36",
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.999", "content": "is" }],
"start_time": "5.369",
"end_time": "5.599",
"channel_label": "ch_0"
},
{
"type": "pronunciation",
"alternatives": [{ "confidence": "0.242", "content": "Viola" }],
"start_time": "5.809",
"end_time": "6.15",
"channel_label": "ch_0"
},
{
"type": "punctuation",
"alternatives": [{ "confidence": "0.0", "content": "." }],
"channel_label": "ch_0"
},

itemsの配列の中に各単語レベルの書き起こしに関する情報が記載されています。

何が記載されているかを簡単に整理すると、以下の通りになります。

type→認識されたitemの種類で、pronunciation（発音）もしくはpunctuation（句読点）のいずれか
alternatives→認識された単語についての情報
- confidence→単語の正確性（Amazon Transcribeがその単語をどれだけ正確に認識したか）を示す値。0-1の範囲で示される。
- content→認識された単語そのもの
start_time/end_time→音声データ内でその単語が発音された/発音が終了した時間。（N秒時点でその単語が発音された/発音が終わったことを示す）
channel_label→その単語が発話されたチャネル

手順③カスタムボキャブラリーを使用して、Amazon Transcribeに新しい単語を学習させる

Amazon Transcribeはそもそも、多言語にわたる幅広い音声のデータでトレーニングされていますが、以下のようなニーズがある場合、カスタムボキャブラリー、という機能を使うことでAmazon Transcribeに新しい単語を学習させることができます。

会社名や製品名などの固有のものを認識出来るようにしたい
人名などの単語を指定したスペルで表記して書き起こせるようにしたい

左側のメニューの、「カスタムボキャブラリー」を選択し、「ボキャブラリーを作成」をクリックします。

ボキャブラリーの設定のページで、実際に登録したい単語を登録します。ハンズオンにおいてはCodeWhisPererという単語を登録します。設定値は下記の画面の通りです。

ポイントとしては、登録したい単語・その単語の聞こえ方・書き起こしの際の表示を記載することです。

登録したのちに、「ボキャブラリーの作成」をクリックします。

カスタムボキャブラリーのページにおいて先ほど作成したボキャブラリーの表示が保留中から完了になればOKです。

手順④リアルタイム文字起こしの検証

ここでは、リアルタイムで発話された音声の文字起こしの検証を行います。それと同時に、手順③で作成したカスタム言語が期待通りに書き出されるか、という部分の検証も行います。

リアルタイムトランスクリプションのページに移動し、カスタマイズという部分に表示されている、「カスタムボキャブラリー」のトグルをONにし、ボキャブラリーの選択という部分で、手順③で作成したカスタムボキャブラリーを選択します。（ここでは「my_vocabulary」という名前です）

他の部分の設定値は下記画像のとおりです。

設定が完了したら、ストリーミングを開始ボタンを押下して、録音を開始することができます。

文字起こしの結果は以下になります。

<読み上げた文章>

Amazon CodeWhisperer is a general purpose, machine learning-powered code generator that provides you with code recommendations in real time. As you write code, CodeWhisperer automatically generates suggestions based on your existing code and comments.

<文字起こしされた文章（赤字は原文と相違がある箇所）>

Amazon CodeWhisperer is a general purpose(,) machine learning power code generator that provides you with code recommendations in real time. As you write code, CodeWhisperer automatically generates suggestions based on your existing code and comments.

読み上げた文章と、文字起こしされた文章を比較すると、多少のミスはあるものの、ほぼ原文に近い形で、精度高く文字起こしが行われています。
また、ここでのポイントとしては、手順③で作成した「CodeWhisperer」がしっかり音声認識されて書き起こしが行われている点です。

以上でハンズオンは終了になります！！

ハンズオン実施上の注意点

ハンズオンには8時間の制限時間がついています。内容としては8時間もかかる内容ではないですが、8時間を過ぎるとそのアカウントではハンズオンを再びすることができなくなりますのでご注意ください。

制限時間を超えてしまった場合は、新たにBuilder IDを発行して、再度ラボを実施する必要があります。

参照文献

Speech to Text hands-on lab

AWS Black Belt Online Seminar

最後に

Amazon Transcribeのハンズオンを通して、text-to-speechの概念の面白さを実感することができました。

最後のリアルタイム文字起こしで、自分の発音したカスタムの「CodeWhisper」という言葉がしっかり書き起こされた時は少し嬉しかったです（笑）

今回は全て英語での書き起こしのハンズオンでしたが、日本語だと精度はどのくらい違うのか、また音声認識の精度を高めるためにどういった工夫の手法があるのかなど、さまざまな疑問が湧き、今後も継続的に学習していきたいな、と感じました。

Amazon Transcribeでどんなことができるんだろう、ということを知るにはもってこいのハンズオンなので、興味がある方はぜひやってみてください！

明日のもくもく会ブログリレーもぜひご覧ください！

Amazon Transcribe ハンズオンを試してみた

Amazon Transcribeとは何？？

今回使用したハンズオン教材