TikTok to text: Pyktok, FFmpeg, and Whisper

    November 10th, 2023

    I recently wanted to record the text from a TikTok video (as an example of an effective social-search-request)1: ““HotGirls on TikTok, why do I look stupid at the gym?”"

    I used Deen Freelon (website | Twitter)’s Pyktok (“A simple module to collect video, text, and metadata from Tiktok.”) to download the video.

    >>> import pyktok as pyk
    We strongly recommend you run 'specify_browser' first, which will allow you to run pyktok's functions without using the browser_name parameter every time. 'specify_browser' takes as its sole argument a string representing a browser installed on your system, e.g. "chrome," "firefox," "edge," etc.
    >>> pyk.specify_browser('chrome')
    >>> pyk.save_tiktok('https://www.tiktok.com/@hannahabrown0/video/7298808114189585695?is_copy_url=1&is_from_webapp=v1', True, 'video_data.csv')

    I then converted the video to audio with FFmpeg (“A complete, cross-platform solution to record, convert and stream audio and video.”). (Note: This step is not necessary, Whisper can transcribe from video. It did reduce the file size I was working with from 4.5 MB to 160 KB.)

    % ffmpeg -i "@hannahabrown0_video_7298808114189585695.mp4" -vn -acodec copy "@hannahabrown0_video_7298808114189585695.m4a"

    I finally used OpenAI’s Whisper (“a general-purpose speech recognition model”) to generate a transcript.

    % whisper "@hannahabrown0_video_7298808114189585695.m4a" --model medium
    /Users/dsg/nidicolous/nido/whisper_venv/lib/python3.8/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
      warnings.warn("FP16 is not supported on CPU; using FP32 instead")
    Detecting language using up to the first 30 seconds. Use `--language` to specify the language
    Detected language: English
    [00:00.000 --> 00:03.200]  HotGirls on TikTok, why do I look stupid at the gym?
    [00:05.160 --> 00:06.760]  Why do I look stupid at the gym?
    [00:08.800 --> 00:09.840]  Don't be nice to me.
    [00:09.840 --> 00:10.920]  Don't worry about my feelings.
    [00:10.920 --> 00:12.920]  Give it to me straight because I know
    [00:12.920 --> 00:16.240]  that I don't look as hot at the gym as I could.
    [00:16.240 --> 00:19.040]  But I don't know what the problem is.
    [00:19.040 --> 00:20.860]  Is it that I need a set?
    [00:20.860 --> 00:22.420]  Is it the socks?
    [00:22.420 --> 00:23.260]  Is it the shoes?
    [00:23.260 --> 00:25.520]  I feel like these sneakers look fucking stupid.
    [00:25.520 --> 00:27.380]  Is it the hairstyle?
    [00:27.420 --> 00:30.600]  Is it the jewelry or lack thereof in this region?
    [00:32.140 --> 00:33.740]  Why do I look stupid at the gym?
    [00:35.300 --> 00:36.420]  Someone tell me.
    [00:37.420 --> 00:39.380]  Because I want to look hot all the time.

    Footnotes

    1. This is an example of what appears to be a very effective packaging of a question or request for help——simple message, signalling preparation for responses, and highlighting possible issues (while I generated a transcript, the video itself is part of the packaging). I discuss packaging of questions in my dissertation: Ch. 5. Repairing searching: Due diligence and packaging questions. I’m not suggesting this approach might work for everyone: this searcher is an actor.↩︎