Video Transcribe using Google Vertex

Im trying to build a transcription bot, by replacing chatgpt wispher with Google Gemini Vertex. Im facing issue with the sharable link.


here is the blueprint:
blueprint.json (48.8 KB)

Output bundle for Google drive, Get sharable link element:

[
    {
        "fileId": "1hMP4g_9tCvy_FZRMDsCRTGqkCyCv5T9F",
        "kind": "drive#permission",
        "id": "anyoneWithLink",
        "type": "anyone",
        "role": "reader",
        "allowFileDiscovery": false,
        "shareLink": "https://drive.google.com/file/d/1hMP4g_9tCvy_FZRMDsCRTGqkCyCv5T9F",
        "webContentLink": "https://drive.google.com/uc?id=1hMP4g_9tCvy_FZRMDsCRTGqkCyCv5T9F&export=download"
    }
]

Input bundle for Google Gemini Vertex:

[
    {
        "topK": 32,
        "topP": 1,
        "model": "gemini-pro-vision",
        "messages": [
            {
                "role": "user",
                "prompt": "transcribe the video",
                "fileUri": "https://drive.google.com/file/d/1hMP4g_9tCvy_FZRMDsCRTGqkCyCv5T9F",
                "mimeType": "video/mp4",
                "videoMetadata": {
                    "endOffset": {},
                    "startOffset": {}
                },
                "fileUploadType": "fileUri"
            }
        ],
        "projectId": "make-vertex-436917",
        "temperature": 0.4,
        "serviceEndpointLocationId": "us-central1"
    }
]

I think you want to be using the webContentLink (which is the actual raw video URL), not the shareLink.

Separately, though, I’m interested to see how your use case goes. I just tried something similar, with a working video URL, and it just spat out complete hallucination for the transcript.

When I asked it to identify speaker names from captions in the lower third, it got the two speakers’ names correct, and then hallucinated a load of others that weren’t present in the video.

Is gemini-pro-vision even a real model? Maybe the first version? The doc page at Explore vision capabilities with the Gemini API  |  Google AI for Developers suggests transcribing video with gemini-1.5-pro.

I got fantastic results in Google AI Studio, but don’t know how to achieve the equivalent through this module. Via the same interface, Gemini 1.5 Pro denies to me that its API supports multimodality.

Thanks for your reply Mr Robert. Basically im just trying to transcribe videos using free resoureces and yeah google is not working like we expected.