Dealing with duplicate content coming from faceted search
If your website is using faceted search you’ve probably come across this issue : this type of research produces a huge amount of URLs, due to the combination of different parameters. And very often those pages will display the same content, with different URLs : in Google’s eyes, this is duplicate or near duplicate content which is not good for your ranking ! There are multiple ways to deal with this issue : rewriting, disallow, noindex or canonical.
Faceted search creates user-generated values, and millions of different URLs : it comes from the combination of different parameters, but also sometimes from the order of selection. For example if the customer selects the color and then the gender you can have : ?color=blue&gender=woman. If he selects gender and then color, you can have ?gender=woman&color=blue. Two different URLs, crawled by robots, that will display the exact same content. Imagine the amount of URL this can represent : Google will spent time crawling those URLs, not the rest of your site, and will probably penalize you for duplicate content. Something you surely do not want. Sure, if you have time and money it is possible to spend it on developing an optimized faceted search, just like Zalando did for example:
Each new parameter is added with “_” and two parameters from the same type are separate with “.”.But if you don’t have time nor money, here are some ways to deal with duplicate content coming from faceted search.
Some pages coming from faceted search can be useful and relevant for users and for your SEO : you want them to appear on the SERPs. Therefore, the aim is to select category pages with the most value for user and indexation. Those can be the most selected filters when your users come across the search on your website : “Boots”, “Leather Boots”, “Pink Sneakers”… And rewrite the URLs :
- RewriteEngine on
- RewriteRule ^/women-shoes/all?type=boots&mat=leather http://example.com/women-shoes/boots/leather
This also require to set up a 301 redirection. To create a very unique content, the different filters chosen to be categories should also have a unique <title>, <h1> and some quality text content : it allows to make them SEO landing page material.
Disallowed in the robots.txt
You can keep robots away from crawling those pages in the robots.txt :
Disallow : ?p=
Advantage: you won’t lose crawl space anymore because of the duplicate content. Still, URLs will be indexed but won’t have a good ranking. However those pages were not selected for the URLs rewriting, so it probably means there is no query about it anyway.
Noindex in <head> or robots.txt
<meta name=”robots” content=”NOINDEX,FOLLOW” />
Noindex : ?p=
With this method, pages with low value won’t appear in the SERPs anymore. But you will still lose crawl space.
Set up a canonical URL
Another option for dealing with duplicate content is to use the rel=canonical tag. The rel=canonical tag passes the same amount of link juice (ranking power) as a 301 redirect.
<link href=”http://www.example.com/canonical-version/” rel=”canonical” />
This way, you tell Google that the page is like a copy of the URL www.example.com/canonical-version and that he should credit links and content metrics toward the provided URL. This approach will solve multiple indexation, not multiple crawling.
For further information about canonical, read the Google Support page.