7 Sneaky Types of Duplicate Content
(and How to Stay Safe from Each)
Duplicate content is a big topic in the SEO space. When we hear about duplication issues, it's mostly in the context of Google penalties. But this potential side effect of content duplication is not only exaggerated, but also hardly the gravest consequence of the issue.
There’s no duplicate content penalty as such.
Google hardly ever penalizes sites for duplicate content per se. Back in 2013, Matt Cutts pointed out that around 25 to 30% of all web’s content was duplicate. So there is no way to treat it all as spam, because duplicates occur every time. Duplicates may happen on terms and copyright policy pages, when republishing excerpts for link-building or advertising, giving annotations and quotes. Definitely, Google is not going to treat them all as spam.
We don’t have a duplicate content penalty. It’s not that we would demote a site for having a lot of duplicate content.
Google is able to tell the difference between pages with some content copied, as long as they have some other unique pieces that add value to users. Thus, both pages partially copied content have equal chances to rank and show up if relevant to a searcher query.
Still, duplicate content issues may impact your rankings negatively. In fact, there is no duplicate content penalty, but rather a filter for sites with duplication.
The problem is that the search engine needs to figure out which of the duplicates has to rank up on the SERP, and the rest of identical or similar pages shall be hidden or demoted.
Since there is no duplicate content penalty as such, you needn’t worry much, unless you do excessive content fraud (for which sooner or later someone at Google will surely give out a manual action).
However, duplicate content is bad for SEO rankings overall. The three far more likely problems that may be caused by an SEO duplicate page are the following:
- Wasted crawl budget. If content duplication appears internally on your site, it's guaranteed to send some of your crawl budget (aka the number of your pages search engines crawl per unit of time) to waste. This means that the important pages on your site are going to be crawled less frequently.
- Link juice dilution. For both external and internal content duplication, link juice dilution is one of the biggest SEO downsides. Over time, both URLs may build up backlinks pointing to them, and unless one of them has a canonical link (or a 301 redirect) pointing to the original piece, the valuable links that would have helped the original page rank higher get distributed between both URLs.
- Only one of the pages ranking for target keywords. When Google finds duplicate content or copied content instances, it will typically show only one of them in response to search queries — and there's no guarantee it's going to be the one you want to rank.
But all of these cases are preventable if you know where duplicate content may hide, how to detect it, and how to deal with duplicate content. In this article, I'm going to outline first and foremost 'What is duplicate content', along with the 7 common types of content duplication — and then dealing with duplicate content.
1. Scraped content
Scraped content is basically an unoriginal piece of content on a site that has been copied from another website without permission. As I said earlier, Google may not always be able to tell between the original content and the duplicate content, so it's often the site owner's task to be on the lookout for scrapers and know what to do if their content gets stolen.
Alas, this isn't always easy or straightforward. But here's a little trick that I personally use.
If you track how your content gets shared and linked to online (and if you have a blog, you really should) via a social media/Web monitoring app, like Awario, you can hit two birds with one stone here. In your monitoring tool, you would typically use your post's URL and title as keywords in your alert. To also search for scraped versions of your content, all you need to do is add another keyword – an extract from your post. Ideally, it should be pretty long, e.g., a sentence or two. Surround the piece with double quotes to make sure you're searching for an exact match. It's going to look like this:
With this setup, the app is going to look for both mentions of your original article (like shares, links and such) and the potential scraped or copied content of the versions found on other sites.
Another quick way to check for duplicate content of yours all over the web is to run the Copyscape Plagiarism Checker. Just paste in the URL and let the tool research the web for any copy of your article.
If you do find duplicate website content, it's a good idea to first contact the webmaster and kindly request to remove the piece (or put a canonical link to the original if that works for you). If that's not effective, you may want to report the scraper using Google's copyright infringement report.
Bear in mind that reporting infringement of copyright is a matter of law, not SEO. That is to say, an infringement report can be filed in case the scraper tries to pass the scraped content as the original source, claiming intellectual property over the stolen copyright-protected content.
If you run your website on Wordpress that submits news to an RSS feed, make sure that you check For each article in a feed show: Summary in the settings for your RSS feed. This way, you at least reduce scrapers chances to have your content copied automatically.
Noticed a thief once in a while? Be ready to benefit from it and leave as it is. A good trick is to do really good internal linking. Besides, you can add a self-relating 'rel=canonical' tag pointing the page to itself. First, internal links are generally good for your site SEO: your pages become better linked, the site better structured, and the bounce rate will drop. Second, automatic scrapers will probably borrow all your links. This way the republished content can bring you some backlinks and traffic; accidentally, it might be good for your site authority.
2. Syndicated content
Syndicated content is content republished on a different website with the permission of the original piece's author. This is what duplicate content generally refers to, so while it is a legit way to get your content in front of a new audience, it's important to set writing guidelines for the publishers you work with to make sure syndication doesn't turn an SEO duplicate page into an SEO problem.
Ideally, the publisher should use the 'rel=canonical' tag on the article to indicate that your site is the original source of the content, avoiding duplicate content issues. Another option is to use a 'noindex' tag on the syndicated content. It's always best to manually check this whenever a syndicated piece of your content goes live on another site.
3. HTTP and HTTPS pages
One of the most common internal duplication problems are identical HTTP and HTTPS URLs on a site, even when both feature the same original content. Usually, these issues arise when the switch to HTTPS isn't implemented with the thorough attention the process requires. The two most common scenarios when this happens are:
1. Part of your site is HTTPS and uses relative URLs. It's often fair to use a single secure page or directory (think login pages and shopping carts) on an otherwise HTTP site. However, it's important to keep in mind that these pages may have internal links pointing to relative URLs rather than absolute URLs:
- Absolute URL: https://www.link-assistant.com/rank-tracker/
- Relative URL: /rank-tracker/
Relative URLs don't contain protocol information; instead, they use the same protocol as the parent page they are found on. If a search bot finds an internal link like this and decides to follow it, it'd go to an HTTPS URL. It could then continue the crawling by following more relative internal links, and may even crawl the entire website in the secure format, and thus index two completely identical versions of your site's pages. In this scenario, you'd want to use absolute URLs instead of relative URLs in internal links. If there already are duplicate HTTP and HTTPS pages on your site, permanently redirecting the secure pages to the correct HTTP versions is the best solution.
2. You've switched your entire site to HTTPS, but its HTTP version is still accessible. This can happen if there are backlinks from other sites pointing to HTTP pages, or because some of the internal links on your site still contain the old protocol, and the non-secure pages do not redirect visitors to the secure ones. To avoid dilution of link equity and wasting your crawl budget, use the 301 redirect on all your HTTP pages, and make sure that all internal links on your site are specified via relative URLs.
You can quickly check if your site has an HTTP/HTTPS duplication problem in SEO PowerSuite's WebSite Auditor. All you need to do is create a project for your website; when the app is done crawling, click on Issues with HTTP/HTTPS site versions in your site audit to see where you stand.
4. WWW and non-WWW pages
One of the oldest causes of duplicate content in the book is when the site's domain WWW and non-WWW versions are both accessible. Like with HTTPS causing internal content duplication, this duplicate content generally can fixed by implementing 301 redirects.
To check if there are instances of such duplication on your site, look at Fixed www and non-www versions (under Redirects) in your WebSite Auditor project.
5. Dynamically generated URL parameters
Dynamically generated parameters are often used to store certain information about the users (such as session IDs), or to display a slightly different version of the same page (such as one with sorting or filtering adjustments made). This results in creating an alternate version of the same URL that will look like this:
- URL 1: /rank-tracker.html?newuser=true
- URL 2: /rank-tracker.html?order=desc
The problem is not only that parameters create unfriendly URLs that are not descriptive. While these pages will typically contain the same (or highly similar) content, both are fair game for Google to crawl. Often, dynamic parameters will create not two, but dozens of different versions of the URL, which can result in massive amounts of crawl budget spent in vain.
To check for duplicate URLs on your site, go to your WebSite Auditor project and click Rebuild Project. At Step 1, check the Enable expert options box. At the next step, select Googlebot to in the Follow robots.txt instructions for… option.
Then, switch to the URL Parameters tab and uncheck the Ignore URL parameters box.
This setup will let you crawl your site like Google would (following robots.txt instructions for Googlebot) and treat URLs with unique parameters as separate pages. Click Next and proceed with the next steps like usual for the crawl to start. When the WebSite Auditor is done crawling, switch to the Pages dashboard and sort the results by the Page column by clicking on its header. This should let you easily spot duplicate pages or copied content with parameters in the URL.
If you do find such SEO-unfriendly URLs on your site, make sure to use the Parameter Handling Tool in Google Search Console. This way, you will be telling Google which of the parameters need to be ignored during crawls.
6. Mobile-friendly URL
Duplicate content issues may occur when you create a mobile-friendly version for a desktop site. An address having example.com/page and m.example.com/page will be treated as duplicate URLs. Just as in the example above, you can check for duplicates with the Website Auditor. If it’s your case, indicate the relation between the pages by adding the 'rel=alternate' tag.
Try to prevent the causes of duplicate content from the very beginning. If you are only about to launch a brand-new site, think of implementing responsive design instead, which reduces duplicate content issues to the minimum, as regards to mobile usability.
Some websites use AMP technology (Accelerated Mobile Pages): the stripped pages of the main content are lighter, which let them load faster on the search result page. The technology also results in duplicate content unless the AMP pages are implemented appropriately. The 'rel=amphtml' tag has to be added to the non-AMP page. The AMP version has to include the 'rel=canonical' tag pointing to the main content.
7. Printer-friendly pages
If your multiple pages on your site have printer-friendly versions accessible via separate URLs, it will be easy for Google to find and crawl those through internal links. Obviously, the content on the page itself and its printer-friendly version is going to be identical — thus wasting your crawl budget once again.
If you do offer printer-friendly pages to your site's visitors, it's best to close them from search engine bots via a noindex tag. If they are all stored in a single directory, such as https://www.link-assistant.com/news/print, you can also add a disallow rule for the entire directory in your robots.txt.
8. Duplicate tag pages
Sometimes you may want to use tags on your site to improve its accessibility and usability. Tags are aimed for linking related articles according to some common topics. Unlike categories, tags are not compulsory, it's just an option to enhance your site structure. For each tag, a separate page is created where all the tagged articles are listed. There are sometimes cases when duplicate tag pages are created.
The advice is not to use an extensive number of tags. Just have a limited set of tags each corresponding to quite a large number of articles on your site. If still you’ve found a duplicate tag page, you can disallow it in the robot.txt file or close it with the noindex meta tag.
9. Similar content
When people talk about content duplication, they usually imply completely identical content. However, pieces of very similar content also fall under Google's definition of near-duplicate content:
"If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city."
Such issues can frequently occur with e-commerce sites, with product descriptions for similar products that only differ in a few specs. To tackle this and avoid trouble with search engine rankings, try to make your product pages diverse in all areas apart from the description: user reviews are a great way to achieve this.
Thin pages and duplicate content
On blogs, duplicate content issues may arise when you take an older piece of content, add some updates, and rework its text into a new post. In this case, using a canonical link (or a 301 redirect) on the older article is the best solution.
For example, we had a technical issue that was a reason for losing a 301 redirect. And that quickly led to an abrupt drop in rankings for the target keywords on the updated version of the text. The two copies competed for approximately the same sets of keywords, Google saw them as copies, and finally both the pages dropped on the SERP even lower than the first text had ranked before the update.
Speaking of duplicate content issues as a penalty appeared after the Panda algorithm rolled out in 2011. Panda was intended to reward sites with greater value for users. The update hit-hard not only spammy sites, though. Panda made site owners really concerned about having low-quality content, such as thin pages, copied content, duplicate title tags and meta descriptions.
For checking duplicate content all over your site as well as for rewriting content, you can run the Website Auditor as well. A quick site-wide SEO audit will discover duplicate meta tags and meta descriptions.
On top of that, the Content Analysis module lets you carry out an individual page audit for writing a better copy of it and then republishing (or creating a totally new page from scratch).
In a purely SEO-related world, one of the old-school practices had been article spinning with the help of automatized tools. Spinners automatically modify articles by replacing specific words with some alternate versions. Although, modern AI is quite apt to write articles of admissible quality that even pass human moderation, this is something that you really should not do.
First of all, Google still may detect spun articles as near duplicate content. On the darkest side of it, content that is massively rewritten many times here and there brings little to no value to users. Which means there will be low traffic, low authority, and finally the site will come to a standstill.
10. Localized pages
If you have identical content on several domain names under oneTLD (top-level domain), for example, .com, Google usually would detect duplicate content. But when you have several near duplicate pages in multiple TLDs, they will not in any case be classified as spam. Google treats it as a localized version to be shown in a particular location, so the search engine will try to figure out which page variant matches a particular searcher. In this case, it is recommended to add localization features. For example, e-commerce stores may have different currency units for different countries, or spelling variations.
You can implement localization with the help of html tags, adding an hreflang attribute on localized pages. And to add the hreflang tags quickly, use the Sitemap generation in the Website Auditor. Go to Webmaster Tools and select to Create a Sitemap. Then pick the pages that you want to include in your Sitemap, and add localization to each appropriate page, defining the language and the country. Here you go, download your Sitemap and submit it to crawl.
Specific duplicate content SEO can be a pain for those who work with SEO, as it dilutes your pages' link juice (aka ranking power) and drains crawl budget, preventing new pages from getting crawled and indexed. Remember that your best tools to combat the problem are canonical tags, 301 redirects, and robots.txt, and incorporate duplicate content checks into your site auditing routine to improve indexation and rankings.
What are the instances of duplicate content you've seen on your own site, and which techniques do you use to avoid duplicate content issues? Please share your thoughts and questions in the comments below.