TF-IDF is something we've been hearing about for quite a while. Google has long been using it for information retrieval alongside other metrics.
SEOs have also seen its potential. They started to use this metric instead of keyword density to evaluate content optimization as it helped level down the influence of function words.
However, I won't talk about this particular function of TF-IDF. Moreover, Google's John Mueller has recently emphasized that this optimization strategy is of no use today. Instead, I'd like to demonstrate how TF-IDF helps optimize a page for a topic.
Let's start from the beginning, though.
What is TF-IDF (and what search engines have to do with it)?
TF-IDF (term frequency — inverse document frequency) is a statistical measure usually used in information retrieval and text mining to evaluate how important a term is to a particular document in a collection of documents. It has a long history in different research fields, such as linguistics and information architecture, due to its ability to facilitate the analysis of massive sets of documents in a short amount of time.
Search engines often use different variants of the TF-IDF algorithm as a part of their ranking mechanism. By giving documents a relevance score, they manage to deliver "rubbish-free" search results in milliseconds.
For example, TF-IDF has long been a part of Google's ranking mechanism. Google uses TF-IDF to determine which terms are topically relevant (or irrelevant) by analyzing how often a term appears on a page (term frequency — TF) and how often it's expected to appear on an average page, based on a larger set of documents (inverse document frequency — IDF).
To determine how relevant a given page is, Google analyzes the pages in its index against a number of specific features it considers relevant to the query.
Since most online content is text, these features, most probably, are the presence or absence of certain terms and phrases on the page. And not only their presence, but their prominence on this page as opposed to other pages across the web.
This is where the TF-IDF algorithm might come in handy. It measures the average use frequency for this particular term on the whole web as well as sets a benchmark to stop words to provide even greater prominence.
Let's see how the TF-IDF formula works.
The mechanics of TF-IDF
By now you've noticed that there are two terms in the notion. While term frequency is more or less clear, what is that mysterious inverse document frequency?
TF-IDF can be calculated according to the following formula:
Don't worry, you do not have to calculate everything yourself; there are tools to do that for you. However, before using any tool, you should understand that TF-IDF value is not just a crafty form of keyword density. Here's how it works:
- Term Frequency (TF)
At first glance, the metric is clear: how frequently a term appears in a document. It's calculated according to the following formula (and don't worry, I will do the math for you):
For example, if you have a page of 1,000 words where your keyword appears 10 times, its term frequency will be 4.32/9.97=0.43 (if you use log base 2 in the formula).
If you make your keyword appear twice as much in the same document, its term frequency won't change much, it will be 5.32/9.97=0.53 (log base 2 again).
Term frequency reflects whether you are using a particular keyword too often or too rarely. However, on its own, it's pretty useless because you need to measure term's importance, not just the frequency of its uses. Otherwise, function words would rule the search. To prevent it, we need IDF.
- Inverse Document Frequency (IDF)
This metric helps understand the real value of a particular keyword. It measures the ratio of the total number of documents in a set to the number of documents that actually contain this keyword. The formula goes like this:
If the keyword is a common word, most probably it will be used in a large amount of documents. As a result, its IDF value will be tiny, and if we multiply TF by it, the value won't change much. And vice versa, if the term is found only in a few documents, its IDF value will be much larger resulting in a larger TDF-IDF score.
So you see, unlike keyword density that only reflects how stuffed your text is with a particular keyword, TF-IDF comes as a more advanced and sophisticated metric that reflects the importance of a given keyword to a given page. It scales down the prominence of unimportant words and phrases, while rare, meaningful terms are scaled up in importance.
Having this thought in mind, let's check out what TF-IDF has to do with SEO.
How to use TF-IDF tools for SEO
TF-IDF is a secret weapon once you need to increase the relevance of your pages in semantic search. How? It helps look beyond exact keywords and into content to ensure it's relevant to the topic being searched.
As I've mentioned before, it's crazy to try to calculate TF-IDF for your pages yourself — use tools that can do that effortlessly. With most TF-IDF tools, you can analyze top-ranking search results for your own keywords and see which terms and phrases most of them use and how well your pages perform for them.
As a result, you will have a list of topically relevant keywords that will let you:
- Optimize your content for the whole topics, not for single words;
- Spot gaps in the current content;
- Create new content that will rank higher and faster.
If you wonder how to incorporate TF-IDF in your SEO strategy, first of all, try it with the pages where TF-IDF will get you the most benefit:
- High-potential content that can't get out of the 2nd page: content that you have on your site for a while which is well-optimized and gained a good amount of authority. TF-IDF optimization is a great way for pushing such content to the first page.
- High-ranking content that is slowly losing positions: Google's algorithm is ever-changing, which influences how SERPs look every day. TF-IDF will help such pages to stay relevant and maintain their rankings.
- Product pages that do not rank high: if your product pages are struggling to rank for money terms, then TF-IDF can help identify critical content missing from this page.
How to optimize content with TF-IDF tools
Follow this guide to turn TF-IDF into an essential part of your content development strategy.
- Optimize pages for topical relevance.
If you plan to optimize the content of the existing pages, try the TF-IDF tool in WebSite Auditor which has the original TF-IDF formula in its core. Mind that it's not actual reverse-engineering of Google's ranking mechanism. While Google looks at all the pages existing online, the TF-IDF tool discovers terms associated with your target keywords by looking at your top 10 competitors.
Even though the tool doesn't take into account the whole set of documents on the web, it still can help reverse-engineer competitors' content strategies by giving you a quick idea of topics missing from your content.
Follow the TF-IDF workflow below or watch the video tutorial on how to use the TF-IDF tool in WebSite Auditor to properly visualize the whole process.
1. Get a list of terms.
Create a project for your site or open an existing one. Go to Content Analysis > TF-IDF, add or select a page you'd like to analyze, and enter a target keyword.
Once the analysis is complete, you get the list of topically relevant terms sorted by the number of competitor pages that use them. You can also choose between the dashboards of multi- and single-word keywords.
2. Analyze the list.
First, narrow down the list using your common sense, i.e. eliminate non-related terms (e.g., brand names of your competitors will be useless unless you do something like a product comparison).
Afterwards, pay close attention to the Recommendation column. It gives usage advice for each term that appears on the pages of 5 or more competitors:
- Add — if you do not use this term at all;
- Use more — if the term's TF-IDF on your page is below the competitors' lowest value;
- Use less — if the term's TF-IDF is above the competitors' highest value.
Such recommendations as Add and Use more can be indicators of a missing topic on your site. However, it does not always mean you need to write a brand-new page to address it. It can be a paragraph covering small details you initially missed.
However, do not take these recommendations as gospel. The tool does its work and gives you the best terms and their usage advice, but it's based on the algorithm. If you see that those terms are not natural and do not add any value to your content, then use your better judgment and don't overstuff.
3. Compare to your competitors.
Alongside the list of terms, the tool builds a chart where you can compare your page's TF-IDF values to those of your competitors'.
4. Optimize your content.
Now you see which topics you are missing and which you are not covering deep enough. Use this information along with usage recommendations and refine your content to make it more relevant.
You can do it right in WebSite Auditor's Content Editor module where you can edit your pages in a WYSIWYG editor or in HTML. Do not forget that your goal is not to overuse keywords but to naturally add missing parts.
Once done, save the list of changes to your hard drive to further apply to your site. After some time, run a TF-IDF analysis once again to see the positive effect of your optimization.
- Run TF-IDF keyword research.
If you need to create brand-new content, then TF-IDF should go hand in hand with your keyword research. Why? While you can find millions of keyword ideas with different keyword research tools, competitive TF-IDF analysis reveals terms semantically related to your keywords. The top ranking pages are not necessarily rank for them, but these terms are needed to cover search intent that is growing so important in the age of semantic search.
Rank Tracker has the Competition TF-IDF Explorer tool that uncovers competitors' most meaningful keywords on the basis of TF-IDF content analysis.
In your project, go to Keyword Research > Domain Research, select Competition TF-IDF Explorer, enter your target keywords, and start the search.
Analyze the keyword list for plausible terms and topics, filter them by their weight (TF-IDF Avg) and such important metrics as Number of Searches, Competition, Keyword Difficulty, etc. to find the best candidates for your keyword short-list.
That's pretty much it. Not that complicated after all, is it? :)
To recap, the TF-IDF optimization process should look like this:
- Discover the keywords of your top-ranking competitors;
- Compare them to your content (or your keyword list) and identify soft spots and opportunities;
- Optimize your content;
- Monitor performance of your pages…
…and enjoy being ahead of your competitors!
Apparently, TF-IDF is not just some curious acronym, it's an essential part of a content development strategy. However, try not to treat it as a magic formula that will instantly improve the pages' rankings. Instead, treat it as a way to come a bit closer to how machines see your pages and then reverse-engineer this knowledge to tweak and improve your content.
By: Valerie Niechai