TikTok to text: Pyktok, FFmpeg, and Whisper

I recently wanted to record the text from a TikTok video (as an example of an effective social-search-request)¹: ““HotGirls on TikTok, why do I look stupid at the gym?”"

I used Deen Freelon (website | Twitter)’s Pyktok (“A simple module to collect video, text, and metadata from Tiktok.”) to download the video.

>>> import pyktok as pyk
We strongly recommend you run 'specify_browser' first, which will allow you to run pyktok's functions without using the browser_name parameter every time. 'specify_browser' takes as its sole argument a string representing a browser installed on your system, e.g. "chrome," "firefox," "edge," etc.
>>> pyk.specify_browser('chrome')
>>> pyk.save_tiktok('https://www.tiktok.com/@hannahabrown0/video/7298808114189585695?is_copy_url=1&is_from_webapp=v1', True, 'video_data.csv')

I then converted the video to audio with FFmpeg (“A complete, cross-platform solution to record, convert and stream audio and video.”). (Note: This step is not necessary, Whisper can transcribe from video. It did reduce the file size I was working with from 4.5 MB to 160 KB.)

% ffmpeg -i "@hannahabrown0_video_7298808114189585695.mp4" -vn -acodec copy "@hannahabrown0_video_7298808114189585695.m4a"

I finally used OpenAI’s Whisper (“a general-purpose speech recognition model”) to generate a transcript.

% whisper "@hannahabrown0_video_7298808114189585695.m4a" --model medium
/Users/dsg/nidicolous/nido/whisper_venv/lib/python3.8/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:03.200]  HotGirls on TikTok, why do I look stupid at the gym?
[00:05.160 --> 00:06.760]  Why do I look stupid at the gym?
[00:08.800 --> 00:09.840]  Don't be nice to me.
[00:09.840 --> 00:10.920]  Don't worry about my feelings.
[00:10.920 --> 00:12.920]  Give it to me straight because I know
[00:12.920 --> 00:16.240]  that I don't look as hot at the gym as I could.
[00:16.240 --> 00:19.040]  But I don't know what the problem is.
[00:19.040 --> 00:20.860]  Is it that I need a set?
[00:20.860 --> 00:22.420]  Is it the socks?
[00:22.420 --> 00:23.260]  Is it the shoes?
[00:23.260 --> 00:25.520]  I feel like these sneakers look fucking stupid.
[00:25.520 --> 00:27.380]  Is it the hairstyle?
[00:27.420 --> 00:30.600]  Is it the jewelry or lack thereof in this region?
[00:32.140 --> 00:33.740]  Why do I look stupid at the gym?
[00:35.300 --> 00:36.420]  Someone tell me.
[00:37.420 --> 00:39.380]  Because I want to look hot all the time.

Footnotes

This is an example of what appears to be a very effective packaging of a question or request for help——simple message, signalling preparation for responses, and highlighting possible issues (while I generated a transcript, the video itself is part of the packaging). I discuss packaging of questions in my dissertation: Ch. 5. Repairing searching: Due diligence and packaging questions. I’m not suggesting this approach might work for everyone: this searcher is an actor.↩︎