Transcribing an Audio – A Virtual Rube Goldberg Machine
Note: I wrote this article via the voice app described here, then edited it — total time from recording to posting – under an hour.
The goal of this “transcription machine” is that I can stand in my living room or go for a walk, record my thoughts exactly as I am doing right now on my voice recorder, save it, and then come back to my desk a few minutes later and edit the “article” in Google docs.
Like a physical Rube Goldberg machine, the path is convoluted and full of many different parts, but in the end, the result is pretty damn useful.
While I won’t specifically be presenting code, I will be showing the components needed to get the system to work – if you’d like more details or discuss a project in setting up a similar system for yourself, feel free to drop me a line.
For the system to work, what has to happen is the recording has to be sent up to amazon transcribe, and then the results of that have to go to google docs. Simple, right?
As you’ll see, it’s not quite that straightforward, and there are many moving pieces. I call this a Goldberg machine because there are so many moving parts, but it was a fun little project, and I’ll explain what I did to set up the service.
So looking at the diagram below, you will see the following computer components.
[Daigram Here When Available]
- Android phone
- Easy Voice Recorder Pro
- Zapier account (I’m looking to replace this with something else)
- AWS Account – we’ll use AWS Lambda and S3 file storage
- Sendgrid for sending an email (you could just as easily use Amazon SES and skip this, or even talk SMTP directly)
- Google Account (for Docs)
- three custom Amazon Lambdas to move the files through the system. I use Node.js for this piece.
Here’s how it works, from a technical perspective:
The S3 bucket has a trigger set on it – when S3 sees a new file, the Lambda is triggered to create the metadata needed for Amazon Transcribe and calls the Transcribe API to generate the transcription. The Amazon Transcribe service stores a JSON file representing the transcription into a second S3 bucket a few minutes later – triggering the third (and final) Lambda process.
This final Lambda has the results of the transcription passed as a link. Loading (then parsing) the JSON, the system runs through what Amazon Transcribe provides and does a few minor transformations. I remove “umm’s” and “ahh’s,” plus add a few more features: the phrase “new line” and “new paragraph” adds spaces, and I’m working to detect long pauses for adding additional breaks.
Once I’ve completed the post-processing, the program ships the transcription over to Google Docs, creating a new document with the results. I then send myself two emails, one with a link to the document and the actual transcription, a second with the raw JSON file so I’ll have easy access to that for further development.
Whew! This goes through multiple steps – just like a physical Rube Goldberg machine, but the result is pretty pleasing. My walks have been super productive as I record my thoughts and have them available for editing when I get back to the desk.
For the next iteration, I’m adding speaker tags and timestamps for multi-person conversations.
If you’re good at creating flow diagrams, I’d love to have a visual representation of the machine described in this article. Or, if you’ve implemented something similar, I’d like to hear about it. Feel free to drop me a line.