How to optimize YouTube API data for Azure OpenAI?

Sathvik Daraboina 0 Reputation points

Hi everyone,

I am building a web app that takes video transcripts using the YouTube API and sends them to Azure OpenAI to automatically create course modules.

I am facing two main problems as the data grows:

1 Token & Cost Issues: Long video transcripts take up too many tokens. What are the easiest ways to summarize the text before sending it to the LLM so I don't hit token limits or run up a big bill?

2 API Rate Limits: The YouTube API has strict daily quotas, and Azure OpenAI has rate limits (TPM/RPM). What is the best way to handle this without slowing down the frontend? Should I use a basic caching layer or a queue?

0 comments No comments

Sign in to comment

2 answers

  1. SRILAKSHMI C 19,195 Reputation points Microsoft External Staff Moderator

    Hello @Sathvik Daraboina

    Thank you for reaching out.

    Based on your scenario, you are facing two constraints: token/cost limitations and rate/quota throttling when processing YouTube transcripts with Azure OpenAI.

    Below are the recommended approaches.

    1. Token & Cost Optimization

    Azure OpenAI enforces strict token limits across:

    • Input prompt (including transcript)
    • System messages
    • Conversation history
    • Model response

    Recommended approach

    Use a chunking and hierarchical summarization pattern:

    • Split transcript into smaller chunks
    • Summarize each chunk independently
    • Combine chunk summaries into final course modules

    Benefits

    • Avoids context limit errors
    • Reduces token consumption
    • Improves structured output generation

    Also, remove non-essential text, avoid sending chat history in repeated calls, Store and reuse intermediate summaries

    2. API Rate Limits

    You are dealing with two separate throttling systems:

    YouTube API

    • Daily quota restrictions

    Recommendation:

    • Cache transcripts by Video ID
    • Avoid repeated API calls for same video

    Azure OpenAI

    Controlled by:

    • Requests Per Minute (RPM)
    • Tokens Per Minute (TPM)

    Exceeding limits results in HTTP 429 throttling.

    Recommendation:

    • Implement exponential backoff retry logic
    • Control concurrency in backend processing layer
    • Monitor TPM/RPM utilization in Azure Monitor

    3. Recommended Architecture

    To ensure scalability and prevent frontend blocking, use asynchronous processing,

    Suggested pattern:

    • Frontend submits request
    • Backend places job in queue
    • Worker processes transcript
    • Azure OpenAI processes in chunks
    • Results stored in cache/database
    • Frontend retrieves final output

    4. Caching Strategy

    Implement multi-layer caching:

    • Transcript cache per YouTube Video ID
    • Chunk-level summary cache
    • Final output cache

    This reduces both YouTube API usage and Azure OpenAI token consumption.

    5. Monitoring & Scaling

    In Azure OpenAI:

    • Monitor HTTP 429 errors
    • Track TPM/RPM usage
    • Monitor latency and throttling patterns

    If workload increases, consider:

    • Increasing quota
    • Using Provisioned Throughput (PTU) for predictable performance

    Please refer this

    Azure OpenAI On Your Data (classic) best practices (token limits counted, avoid long prompts, streaming): https://learn.microsoft.com/en-us/azure/foundry-classic/openai/concepts/use-your-data?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider&tabs=ai-search%2Ccopilot#setting-chunk-size-for-your-use-case

    Azure OpenAI in Microsoft Foundry Models v1 API (Responses API and base URL patterns): https://learn.microsoft.com/en-us/azure/foundry/openai/api-version-lifecycle?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider&tabs=python#code-changes

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    1. SRILAKSHMI C 19,195 Reputation points Microsoft External Staff Moderator

      Hi @Sathvik Daraboina,

      Following up to see if the above answer was helpful. If this answers your query, please do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

      Thank you!

    2. SRILAKSHMI C 19,195 Reputation points Microsoft External Staff Moderator

      Hi @Sathvik Daraboina,

      Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

      If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

      Looking forward to your response and appreciate your time on this.

      If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

      Thank you!


    Sign in to comment
  2. Amira Bedhiafi 42,941 Reputation points MVP Volunteer Moderator

    Hello Sathvik !

    Thank you for posting on MS Learn Q&A.

    A good pattern is not to send the full YouTube transcript directly from the frontend to Azure OpenAI. Treat it as an asynchronous ingestion and summarization pipeline.

    For your frontend :

    • user submits YouTube URL
    • return a jobId immediately.
    • UI polls job status or uses SignalR or WebSocket for progress

    For your backend pipeline :

    You need to check cache first :

    • key by videoId + language + transcriptVersion/hash.
    • store transcript, cleaned transcript, chunk summaries, final course module JSON
    • use Azure Blob Storage / Cosmos DB / SQL for persisted results
    • use Azure Cache for Redis for hot cache/job status.

    Then fetch the transcript once and be careful with YouTube quota. captions.list costs 50 units and captions.download costs 200 units and requires permission to edit the video. YouTube default quota is 10K units per day and quota increases require a compliance or audit process.

    https://developers.google.com/youtube/v3/docs/captions/list

    https://developers.google.com/youtube/v3/docs/captions/download

    https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits

    You can clean and reduce before LLM by removing timestamps, duplicate lines, filler words, sponsor sections and split by transcript timestamps, headings or semantic chunks. You can se token-aware chunking, for example 1K–2K tokens per chunk with small overlap.

    Then summarize each chunk with a cheaper model then combine chunk summaries into a video level outline and you can generate course modules, lessons, quiz questions, learning objectives... and you can store every intermediate result so repeated requests do not reprocess the same transcript.

    At the end, generate final course module from summaries not raw transcript which means only send raw chunks again when detail is needed and course creation a structured JSON output works well: modules, lessons, key concepts, activities, assessment questions.

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer