VOOZH about

URL: https://thenewstack.io/how-to-get-dns-right-a-guide-to-common-failure-modes/

⇱ How To Get DNS Right: A Guide to Common Failure Modes - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-12-24 08:00:11
How To Get DNS Right: A Guide to Common Failure Modes
sponsor-catchpoint,sponsored-post-contributed,
Backend development / Cloud Native Ecosystem / Operations

How To Get DNS Right: A Guide to Common Failure Modes

Demystifying the most common DNS problems, from simple misconfigurations to major security attacks.
Dec 24th, 2025 8:00am by Sheldon Pereira and Denton Chikura
👁 Featued image for: How To Get DNS Right: A Guide to Common Failure Modes
Featured image from ParinPix on Shutterstock.
Catchpoint sponsored this post.

This is the first in a two-part series.

If you’ve spent any time diagnosing outages or performance issues, you know that when nothing seems to work, “It’s probably DNS.” The Domain Name System (DNS) remains the backbone of digital connectivity, quietly enabling every web transaction, application call and end user experience.

Every click, app and transaction depends on DNS. It translates names to addresses so users can reach your services.

But while the basics of DNS are well-known, monitoring and troubleshooting this critical layer demands ongoing vigilance and advanced tooling. This two-part series walks through why DNS problems are so hard to see, then shows how to monitor, test and validate DNS performance from the user’s point of view.

The DNS Risk Landscape

DNS plays a vital role in directing users to their intended destinations. Since most organizations depend on external DNS providers, they often have limited visibility into the service’s overall reachability, performance and the security of records in real time. Understanding the main failure modes will help you decide what to monitor.

1. Micro‑Outages

Micro‑outages briefly prevent users from resolving a domain. They may last for minutes up to an hour and affect only certain regions or networks. Anycast, a routing method that directs queries to multiple geographically distributed servers, can mask underlying problems because a node may continue advertising its Border Gateway Protocol (BGP) route even when some paths or sites are unhealthy. Common causes include:

  • Data center or pop outages.
  • Routing or connectivity incidents between networks.
  • Server performance saturation.
  • Capacity limits that trigger timeouts during bursts.
  • ISP-specific routing or packet loss issues affecting only certain user segments.

To users, this looks like a random failure to load your site, then a normal experience on retry. To operations teams, it can be hard to reproduce without continuous, distributed testing.

2. Misconfigurations

Configuration mistakes are a frequent root cause of resolution failures. A few high‑impact examples:

  • CNAME at the apex
    CNAME (Canonical Name) records create aliases that let you use different domain name variations to point users to the same location on your website. For example, `help.mystore.com` and `support.mystore.com` can both direct visitors to the same destination. While CNAME records are commonly used to create aliases for existing A (address) records, referred to as the CNAME’s owner record, they should never be configured as the apex domain. This restriction exists because of the way CNAME records interact with their owner and target records. A CNAME replaces all DNS records associated with its owner by directing queries to those of the target record. When both an A record and a CNAME exist at the apex, a conflict occurs: The apex A record cannot be both the CNAME owner and its target. This conflict leads to resolution failures.

For instance, www.ggle.com can point to google.com using a CNAME, but google.com itself should not be a CNAME since it represents the apex domain.

  • Missing glued records
    A records link a website’s domain or subdomain to an IPv4 address, allowing users to reach the correct server. Most websites use a single A record, although larger sites that implement round-robin load balancing may configure multiple A records for the same name.​

Glue records are A records that are paired with corresponding nameserver (NS) records, so the nameserver has an IP address. This lets the server resolve its own fully qualified domain name. Without glue records, operations like delegation, dynamic DNS updates and normal query resolution can run into issues or fail outright.

Glue issues typically occur only when the nameserver is inside the zone being delegated (ns1.example.com for example.com); adding glue for external nameservers is unnecessary and can itself become a misconfiguration.

  • Incorrect TTL values
    DNS time to live (TTL) values define how long a response stays in cache. Setting them improperly can be the difference between a near-instant cached lookup and a much slower query that has to traverse the internet to get a fresh answer. How long to cache responses should be guided by the characteristics of your environment. Highly dynamic systems will run into problems with a 24-hour TTL because records change too frequently, while more static environments may not need a 5-minute TTL and can even gain performance benefits by increasing it. Overly long TTLs can also slow down failovers or cutovers because resolvers may continue serving stale IP addresses.
  • Lame delegation
    Domain names are typically required to use at least two nameservers. When a query is made, each nameserver that responds can be either properly authoritative or “lame,” meaning it is listed as authoritative but does not actually hold authoritative zone data for that domain. To avoid lame delegation and ensure reliable resolution, configure every nameserver so it is correctly authoritative for the appropriate zone associated with the domain. Lame delegations often occur when the NS records at the parent zone list servers that no longer host the zone, causing those servers to return nonauthoritative responses.

3. DNS Poisoning

DNS poisoning, also called cache poisoning or spoofing, occurs when an attacker injects forged DNS data so that resolvers cache and serve malicious answers. Misconfigurations and lack of validation increase exposure. Poisoning can spread downstream when an affected resolver feeds internet service providers, home routers and device caches. The result is traffic redirected to malicious hosts, phishing sites or person‑in‑the‑middle infrastructure.

👁 Attackers alter a DNS record as part of a DNS poisoning attack

Attackers alter a DNS record as part of a DNS poisoning attack

Domain Name System Security Extensions (DNSSEC) is the strongest defense against cache poisoning because it allows resolvers to verify that DNS records are digitally signed and have not been tampered with.

4. Denial of Service (DoS) Attacks

Attackers can try to make your web resources unavailable by overwhelming a specific URL with excessive requests, in what is known as a denial of service (DoS) attack. This floods the service with bogus traffic, crowding out legitimate users and causing severe slowdowns or complete outages.

A distributed denial of service (DDoS) attack uses the same idea but relies on thousands of compromised machines, or botnets, across the internet to take the service offline at scale. A more recent variation uses memcaching-based techniques to amplify DDoS traffic even further.

  • Amplification DDoS attacks
    In an amplification attack, attackers exploit small queries that trigger much larger responses. By repeatedly sending these lightweight requests, they force DNS or other services to return disproportionately heavy replies, quickly exhausting the target’s bandwidth and resources.
  • Reflection DDoS attacks
    In reflection attacks, attackers send large, spoofed queries that appear to originate from the victim’s IP address. The victim then receives the oversized responses and is flooded with traffic, while the recursive nameserver and authoritative server can also be strained by the amplified load.

The Business Impact

DNS issues reduce availability and degrade performance. They also undermine security controls that depend on name resolution. Symptoms include elevated error rates, checkout abandonment, login failures, stuck API clients and misrouted email. Because DNS sits before everything else, problems multiply across services.

What Comes Next

Now that you have the context for why DNS fails, the next step is learning how to detect these conditions before users do. Part 2 in this series explains how to monitor DNS for performance, integrity and resilience with tests that reflect real user experience.

Today’s digital world requires resilience and exceptional performance. Digital enterprises turn to the Catchpoint IPM platform and expertise to proactively identify and resolve issues across the Internet Stack before they impact customers or workforce. The Internet Relies on Catchpoint.
Learn More
The latest from Catchpoint
TRENDING STORIES
Sheldon Pereira is a solutions engineer at Catchpoint, a LogicMonitor company. He specializes in synthetic monitoring, network performance and observability. He works closely with global enterprises to improve digital experience across web, application, API and network layers.
Read more from Sheldon Pereira
Denton Chikura is an observability advocate focused on helping site reliability engineers and engineering teams discover the tools and capabilities that strengthen internet resilience. With a background at the intersection of monitoring, performance and infrastructure, he works to make complex...
Read more from Denton Chikura
Catchpoint sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
The annual research report on all things reliability – uncover trends and insights to shape your reliability strategy.