What’s this about? “This is an Auto-Response¹ for the very common question/issue about Bad/Wrong/Non-Existent URLs being Requested/Crawled by Google.
- GoogleBot requesting URLs/pages that don’t exist
- Google reporting 404 Errors for URLs/pages I don’t have
- Why is GoogleBot crawling URLs/pages that don’t exist?
- Why is GoogleBot requesting URLs that don’t exist?
- Where is Google getting these bad URLs/pages from?
- Page not found Errors – for URLs that are wrong
- Page not found Errors – for URLs/pages that don’t exist
- Crawling Error reported – URL doesn’t exist – why is Google crawling it
Request a Call-Back
Google crawls publicly available information on the web
Well, that’s the basic principle of a web-crawler: it crawls things.
This means that G has obtained a URL, somehow, noted it down, and then tried to crawl it at some point in time.
How does Google get URLs?
Primarily from Links, those on your site and on other sites.
Additionally, there are Sitemaps, and it may also try to “extract” URLs from code such as JavaScript, iFrames etc.
Further still, it may attempt to “guess” URLs. If Google sees you have 20 pages in sequence (page1, page2, page3) it may go looking to see if there is a page21 and page22 etc. It may also be looking at any Forms on your site, which is to say, Googlebot may use a form to explore your site.
Something else to keep in mind is that “Google Remembers things!” So if there “used to be” something at that URL – G may well remember it, and be trying to revisit it.
But 404s are BAD!
Actually, not directly… Google is not going to penalize/punish you/your site simply because it sees one or more 404s.
Indirectly, however, it may have a knock-on effect, or even several.
Encountering a bunch of 404s just means that G is using up its crawl budget to your site/domain/server for URLs that are wasted, rather than for those that do exist. This, in turn may slow down how often other URLs are crawled¹.
Further, if a URL used to have content and is now gone, then it may mean the value that URL gave to the rest of your site is now being lost/wasted.
So what about “these” Bad URLs?
Well there are a number of places where you can start looking for info to see the bad URLs, and where they are coming from.
Google Search Console
Click on “Pages” and scroll down, and you’ll reach the “Why pages aren’t indexed” section.
Click on “Not found (404)” and you’re given a list of all the 404 errors within your site.
From there you can “Inspect URLs” to ascertain where the nonexistent page was linked from.
Note: Google Search Console makes this process extremely cumbersome these days, instead of presenting a comprehensive list of broken inbound links, the way it used to. This was one on purpose.
Server Access Logs
You can pore over your server’s Raw Access Logs, which should be available via your Hosting control panel. If they are not, then ask your host about how to access them. If your host has no server access logs, get a new host immediately.
Look for bad response codes by doing a search for “404,” then look for the referrer’s URL. You may see it as originating on your own site, or someone else’s, but this – once again – gives you a breakdown of where the errors originate.
Request a Call-Back
But what can I do about it?
Well, that depends entirely on how the bad URLs are occuring.
External Sources
If the Bad URLs are originating elsewhere – on some other site – you have three main options:
- You can contact the owners and ask them to correct it, or if it’s from somewhere you submitted it, like a trade directory, go and edit it yourself, and be more careful next time!
- You can setup server/scripted 301 Redirects to point the broken links at existing URLs
- You can simply live with it.
Internal Sources
If the Bad URLs originate from your own site, then you have the following options:
- Go and fix them! Sorry, but this is kind of obvious don’t you think?
- Go and fix them! Sorry, but this is kind of obvious don’t you think? 2) Revise how you use/supply URLs within your HTML. For instance, use absolute URLs instead of relative ones. There’s less room for screwups that way!
- Test each link on your site when you create them!
- Go to the page that seems to be the source of the error/bad link and use the TAB key. Keep an eye on the status (bottom right of the browser) to see the URL that shows up. You may tab through the entire page to find the dodgy link. If that doesn’t yield any results do it with JS disabled, as sometimes it’s the JS that causes the issue.
I’ve looked but I cannot find the fault/error!
Not being funny, but go look again, harder/properly. 😀
Some common causes include:
- Invalid code – if you don’t close up your tags correctly, you may generate incorrect URLs.
- Relative URLs – these are a common problem, and may result in attaching part of the path to a previous URL.
- Incorrect Base Href element – this would result in attaching the relative path to the incorrect root.
- Incorrect Redirects – if you have ReWrites or Redirects, make sure you have them working to correct URLs.
- Incorrect Canonical Link Element – if you have the wrong URL in the href of the CLE, then G is going to get it wrong as well.
- JavaScript – you should either make it External, or wrap it in CDATA (inside the opening/closing script tag, use //<![CDATA[ your JS //]]>)
- Check your Form Action URL – make sure you aren’t letting Google get the wrong idea.
I corrected it/them – Why is Google still trying to crawl it/them?!
I did say that G remembers URLs, didn’t I?
Unless you have set up 301 Redirects, Google is going to try to access the broken URLs for sometime before it gets the hint… and GoogleBot isn’t that quick on the uptake at times.
I corrected it/them. Why is Google still showing the Error?!
Because the Errors in Google Search Console hang around for some time, up to 4 weeks in fact.
Only after an extended time period (so long as it hasn’t re-encountered the errors), will the error disappear from GSC.
If the error(s) reoccur, the date/time will update, and you’ll have to wait another month (approx.)
No 404s – but the URLs are wrong/don’t exist!
This happens in some rare cases.
Basically, rather than giving a correct 404 response for a URL that doesn’t exist, people (and bots) are presented with:
- a page with content and a 200 response
- a page with content and a 200 response, but often without the correct CSS/images not working etc.
- a temporary redirect response to some other page, such as the homepage
This can happen for several reasons:
In some cases, the servers are set up so as to not give a proper 404. This is called a “Soft 404, and should be changed asap.
Serving a 302 to some other page, such as the homepage, isn’t overly smart either, and can cause issues like duplication, and in the case of robots.txt files, if the URL is not found and Googlebot is shown html instead, the bot may refuel the crawl.
In other cases it’s due to scripts and to the script accepting any old parameters, correct or false, and loading up the same content regardless. This is actually still a very common issue, even in 2024.
In such cases, coding a 404 may be hard work, especially if you don’t do programming. With that said, a Canonical Link element will really help in such cases!
For others it could be bad/dodgy ReWrites/Redirects, taking partial URLs and tacking them onto links incorrectly. Always be sure to double check things when you create Redirects/ReWrites.
Similar effects may be achieved with incorrect relative URLs or incorrect base href elements being used.
The solution here is to check, as posted before : Consult the server logs for the bad requests and look at the referrers and/or check Search Console and/or Screaming Frog, to see if you can find out where or what links to those Bad URLs.
Follow me on Twitter/X for more solid SEO advice or contact us today to arrange a free consultation.