How to optimize YouTube API data for Azure OpenAI?
Hi everyone,
I am building a web app that takes video transcripts using the YouTube API and sends them to Azure OpenAI to automatically create course modules.
I am facing two main problems as the data grows:
1 Token & Cost Issues: Long video transcripts take up too many tokens. What are the easiest ways to summarize the text before sending it to the LLM so I don't hit token limits or run up a big bill?
2 API Rate Limits: The YouTube API has strict daily quotas, and Azure OpenAI has rate limits (TPM/RPM). What is the best way to handle this without slowing down the frontend? Should I use a basic caching layer or a queue?
2 answers
-
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Hello @Sathvik Daraboina
Thank you for reaching out.
Based on your scenario, you are facing two constraints: token/cost limitations and rate/quota throttling when processing YouTube transcripts with Azure OpenAI.
Below are the recommended approaches.
1. Token & Cost Optimization
Azure OpenAI enforces strict token limits across:
- Input prompt (including transcript)
- System messages
- Conversation history
- Model response
Recommended approach
Use a chunking and hierarchical summarization pattern:
- Split transcript into smaller chunks
- Summarize each chunk independently
- Combine chunk summaries into final course modules
Benefits
- Avoids context limit errors
- Reduces token consumption
- Improves structured output generation
Also, remove non-essential text, avoid sending chat history in repeated calls, Store and reuse intermediate summaries
2. API Rate Limits
You are dealing with two separate throttling systems:
YouTube API
- Daily quota restrictions
Recommendation:
- Cache transcripts by Video ID
- Avoid repeated API calls for same video
Azure OpenAI
Controlled by:
- Requests Per Minute (RPM)
- Tokens Per Minute (TPM)
Exceeding limits results in HTTP 429 throttling.
Recommendation:
- Implement exponential backoff retry logic
- Control concurrency in backend processing layer
- Monitor TPM/RPM utilization in Azure Monitor
3. Recommended Architecture
To ensure scalability and prevent frontend blocking, use asynchronous processing,
Suggested pattern:
- Frontend submits request
- Backend places job in queue
- Worker processes transcript
- Azure OpenAI processes in chunks
- Results stored in cache/database
- Frontend retrieves final output
4. Caching Strategy
Implement multi-layer caching:
- Transcript cache per YouTube Video ID
- Chunk-level summary cache
- Final output cache
This reduces both YouTube API usage and Azure OpenAI token consumption.
5. Monitoring & Scaling
In Azure OpenAI:
- Monitor HTTP 429 errors
- Track TPM/RPM usage
- Monitor latency and throttling patterns
If workload increases, consider:
- Increasing quota
- Using Provisioned Throughput (PTU) for predictable performance
Please refer this
Azure OpenAI On Your Data (classic) best practices (token limits counted, avoid long prompts, streaming): https://learn.microsoft.com/en-us/azure/foundry-classic/openai/concepts/use-your-data?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider&tabs=ai-search%2Ccopilot#setting-chunk-size-for-your-use-case
Azure OpenAI in Microsoft Foundry Models v1 API (Responses API and base URL patterns): https://learn.microsoft.com/en-us/azure/foundry/openai/api-version-lifecycle?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider&tabs=python#code-changes
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!
-
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Following up to see if the above answer was helpful. If this answers your query, please do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Thank you!
-
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Just checking in to see if you have got a chance to see my response to your question in resolving the issue.
If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.
Looking forward to your response and appreciate your time on this.
If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.
Thank you!
Sign in to comment -
Amira Bedhiafi 42,941 Reputation points • MVP • Volunteer Moderator
Hello Sathvik !
Thank you for posting on MS Learn Q&A.
A good pattern is not to send the full YouTube transcript directly from the frontend to Azure OpenAI. Treat it as an asynchronous ingestion and summarization pipeline.
For your frontend :
- user submits YouTube URL
- return a jobId immediately.
- UI polls job status or uses SignalR or WebSocket for progress
For your backend pipeline :
You need to check cache first :
- key by videoId + language + transcriptVersion/hash.
- store transcript, cleaned transcript, chunk summaries, final course module JSON
- use Azure Blob Storage / Cosmos DB / SQL for persisted results
- use Azure Cache for Redis for hot cache/job status.
Then fetch the transcript once and be careful with YouTube quota. captions.list costs 50 units and captions.download costs 200 units and requires permission to edit the video. YouTube default quota is 10K units per day and quota increases require a compliance or audit process.
https://developers.google.com/youtube/v3/docs/captions/list
https://developers.google.com/youtube/v3/docs/captions/download
https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits
You can clean and reduce before LLM by removing timestamps, duplicate lines, filler words, sponsor sections and split by transcript timestamps, headings or semantic chunks. You can se token-aware chunking, for example 1K–2K tokens per chunk with small overlap.
Then summarize each chunk with a cheaper model then combine chunk summaries into a video level outline and you can generate course modules, lessons, quiz questions, learning objectives... and you can store every intermediate result so repeated requests do not reprocess the same transcript.
At the end, generate final course module from summaries not raw transcript which means only send raw chunks again when detail is needed and course creation a structured JSON output works well: modules, lessons, key concepts, activities, assessment questions.
