[SOLVED] Fireflies.ai bulk transcript downloader

No more manual downloading of fireflies.ai transcripts. Run this script to bulk download. Thanks Claude.ai

Nov 26, 2024

For the past two years, I've been using Fireflies.ai to transcribe my MS Team's, Google Meet and Zoom calls. I have hundreds of transcripts BUT Fireflies has no bulk download capability. I often run these transcripts through ChatGPT to summarise project discussions or mine them for workshop improvements that I can make. My solution was to ask my VA to download the files manually and work from there.

I’m working on a project that requires 40 to 50 files, so I needed an alternative to my VA.

Enter Claude.AI

To solve this problem, I asked Claude to suggest options. Claude suggested a Python script 👀 (despite having zero Python experience). Working on my Mac, Claude guided me through multiple iterations over about 45 minutes until we had a working solution for automatic bulk downloads. While this automation has made my virtual assistant's transcription management role redundant, it's solved my problem, saved me money and paid for my Claude, ChatGPT and Fireflies subscription!

The script has been performing excellently, and I wanted to share it with you. Try it out and let me know if you discover any bugs or potential improvements.

Here are the complete step-by-step instructions to download your Fireflies.ai transcripts:

Initial Setup (one-time only):

bash
# Create directory 
mkdir fireflies_downloader 
cd fireflies_downloader
 
# Create the Python script 
nano fireflies_downloader.py 

# Get your API key from Fireflies.ai: 
# 1. Log into Fireflies.ai 
# 2. Go to Settings → Developer Settings → API 
# 3. Copy your API key 
# Set your API key in Terminal 
export FIREFLIES_API_KEY="your_api_key_here"
 
# Install required Python package 
pip3 install requests

Script Content:

Open fireflies_downloader.py
Press Control + K repeatedly to clear any existing content
Paste the full script into fireflies_downloader.py

import requests
import json
import os
from datetime import datetime, timedelta
import time

class FirefliesDownloader:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.fireflies.ai/graphql"
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }
    
    def get_transcript_content(self, transcript_id):
        """
        Fetch detailed content for a specific transcript
        """
        query = """
        query GetTranscriptContent($id: String!) {
            transcript(id: $id) {
                title
                id
                transcript_url
                duration
                date
                participants
                sentences {
                    text
                    speaker_id
                    start_time
                }
                summary {
                    keywords
                    action_items
                }
            }
        }
        """
        
        variables = {
            "id": transcript_id
        }
        
        try:
            print(f"Fetching content for transcript {transcript_id}...")
            response = requests.post(
                self.base_url,
                headers=self.headers,
                json={"query": query, "variables": variables}
            )
            
            if response.status_code != 200:
                print(f"Error getting transcript content: {response.status_code}")
                print(response.text)
                return None
                
            data = response.json()
            if "errors" in data:
                print("API returned errors:", json.dumps(data["errors"], indent=2))
                return None
                
            return data["data"]["transcript"]
            
        except requests.exceptions.RequestException as e:
            print(f"Error fetching transcript content: {e}")
            return None

    def get_transcripts(self, limit=25, to_date=None):
        """
        Fetch list of transcripts with date filtering
        """
        query = """
        query GetTranscripts($limit: Int, $toDate: DateTime) {
            transcripts(limit: $limit, toDate: $toDate) {
                title
                id
                transcript_url
                duration
                date
                participants
            }
        }
        """
        
        variables = {
            "limit": min(limit, 25),
            "toDate": to_date.strftime("%Y-%m-%dT%H:%M:%S.000Z") if to_date else None
        }
        
        try:
            print(f"Making API request for transcripts up to date {to_date if to_date else 'now'}...")
            response = requests.post(
                self.base_url,
                headers=self.headers,
                json={"query": query, "variables": variables}
            )
            
            data = response.json()
            if "errors" in data:
                print("API returned errors:", json.dumps(data["errors"], indent=2))
                return None
                
            return data
            
        except requests.exceptions.RequestException as e:
            print(f"Error fetching transcripts: {e}")
            return None

    def save_transcripts(self, output_dir="transcripts", to_date=None):
        """
        Save all transcripts with their content
        """
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        
        # Get existing transcripts to avoid duplicates
        existing_files = set()
        existing_ids = set()
        for filename in os.listdir(output_dir):
            if filename.endswith('.json'):
                existing_files.add(filename)
                try:
                    with open(os.path.join(output_dir, filename)) as f:
                        data = json.load(f)
                        existing_ids.add(data['id'])
                except:
                    pass
        
        last_date = to_date or datetime.now()
        while True:
            result = self.get_transcripts(limit=25, to_date=last_date)
            if not result or "data" not in result or "transcripts" not in result["data"]:
                print("No more transcripts found or error in API response")
                break
                
            transcripts = result["data"]["transcripts"]
            if not transcripts:
                print("No more transcripts found")
                break
                
            print(f"\nFound {len(transcripts)} transcripts")
            
            earliest_date = None
            for transcript in transcripts:
                # Skip if we already have this transcript
                if transcript['id'] in existing_ids:
                    print(f"Skipping existing transcript: {transcript['title']}")
                    continue
                
                print(f"\nProcessing: {transcript['title']}")
                
                # Get detailed content
                content = self.get_transcript_content(transcript['id'])
                if content:
                    transcript = content  # Replace with full content
                
                # Save the file
                date_obj = datetime.fromtimestamp(int(transcript["date"]) / 1000)
                safe_title = ''.join(c for c in transcript['title'] if c.isalnum() or c in (' ', '-', '_', '.'))
                filename = f"{date_obj.strftime('%Y-%m-%d')}_{safe_title[:50]}.json"
                
                filepath = os.path.join(output_dir, filename)
                
                with open(filepath, 'w', encoding='utf-8') as f:
                    json.dump(transcript, f, indent=2, ensure_ascii=False)
                
                print(f"Saved transcript: {filename}")
                
                # Track earliest date we've seen
                if earliest_date is None or date_obj < earliest_date:
                    earliest_date = date_obj
                
                time.sleep(1)  # Be nice to the API
            
            if earliest_date:
                # Subtract 1 second to avoid getting the same transcript again
                last_date = earliest_date - timedelta(seconds=1)
                time.sleep(1)  # Extra pause between batches
            else:
                break

def main():
    api_key = os.getenv("FIREFLIES_API_KEY")
    if not api_key:
        print("Please set your FIREFLIES_API_KEY environment variable")
        return
    
    print(f"Using API key: {api_key[:5]}...{api_key[-5:]}")
    
    # Find the earliest date from existing files
    earliest_date = None
    if os.path.exists("transcripts"):
        files = os.listdir("transcripts")
        if files:
            try:
                for file in files:
                    if file.endswith('.json'):
                        with open(os.path.join("transcripts", file)) as f:
                            data = json.load(f)
                            date = datetime.fromtimestamp(int(data["date"]) / 1000)
                            if earliest_date is None or date < earliest_date:
                                earliest_date = date
                
                if earliest_date:
                    print(f"Starting from before date: {earliest_date}")
            except Exception as e:
                print(f"Error reading files: {e}")
    
    downloader = FirefliesDownloader(api_key)
    downloader.save_transcripts(to_date=earliest_date)

if __name__ == "__main__":
    main()

Save with Control + X, then Y, then Enter

Running the Script:

bash
# Navigate to the directory (if you're not already there) 
cd fireflies_downloader 

# Run the script 
python3 fireflies_downloader.py

What to Expect:

First run will download the most recent transcripts
Each subsequent run will fetch older transcripts
Script will automatically:
- Skip duplicates
- Show progress
- Create a 'transcripts' folder
- Save each transcript as a JSON file named with date and title

To Get More Transcripts:

Just run the script again
Keep running until you see "No more transcripts found"
Each run will go further back in time

Files Location:

All transcripts are saved in the transcripts folder
Files are named like: YYYY-MM-DD_Title.json
Each file contains:
- Full conversation text
- Speaker information
- Summary and keywords
- Meeting metadata

Troubleshooting:

If you get "API key not found" error:

bash
export FIREFLIES_API_KEY="your_api_key_here"

If you close Terminal and come back later:

bash
cd fireflies_downloader 
export FIREFLIES_API_KEY="your_api_key_here" 
python3 fireflies_downloader.py

Best Practices:

Run during off-peak hours
Let each run complete
Keep the script and files for future use
Make backups of your transcripts folder

The script is designed to be safe to run multiple times - it won't duplicate downloads and will automatically continue where it left off.

The Experimenter’s Edge with Leslie Barry

Discussion about this post