TF-IDF: the SEO Secret Ingredient

What if I told you there was a secret equation that was in the very earliest computers, that remains a core part of Google’s algorithm? What if I then told you that it could let you choose and deploy keywords with incredible precision that your competitors won’t be able to match? You might think I’m mad, but you’d be wrong. Today, we’re talking about TF-IDF.

SEO Secret Ingredient

Invented by British computer scientist Karen Spärck Jones in 1957, TF-IDF is now a core part of almost every modern search engine. It’s a combination of two numbers: term frequency (TF), and inverse document frequency (IDF). It looks like this:

inverse document frequency (IDF)

Which, if you’re not a maths geek, is pretty intimidating. We’re going to break down the two bits you really need to understand:

Term frequency is pretty straightforward: it’s how often a particular word occurs in a single document, measured against the length of the document. So, if we use the term ‘optimization’ 20 times in a 10,000 word document, it has the same TF as if we used it once in a 500-word blog post.

The problem that arises from using TF alone is that English has a lot of common words that aren’t useful for distinguishing what a document is about. Words like ‘the’, ‘a’ and ‘is’ have extremely high TF, but saying they’re the most important words in a given document doesn’t really help us analyse it. That’s where IDF comes in.

Inverse document frequency is a measure of how often a word comes up in other documents, with a low number indicating high frequency. You’d be hard-pressed to find any English-language content that doesn’t use the word ‘the’, so its IDF is very low. This helps to offset its frequency when weighing its importance to the document.

I’m not going to deep-dive into the maths here: there are free tools online that can calculate it, and you’re going to drive yourself insane if you try to do it all alone. I personally like Text Tools, though feel free to shop around and see what works for you.

Why Should I Care?

Because Google uses TF-IDF to figure out what your content is about. Understanding TF-IDF is the key to understanding how keywords actually work under the hood. Google Analytics doesn’t explicitly allow you to search by TF-IDF, but other SEO and data tools do.

TF-IDF is a data source that allows you to do a lot of fine-tuning of keywords that you can’t do otherwise. It also allows you to find keywords that other tools can’t, because it scans whole documents and whole collections of documents (not just the immediate context) and uncovers semantically-linked common terms that the scanned articles weren’t ranking for. There’s a whole strata of keyword potential not being mined, because it’s invisible to tools that aren’t using TF-IDF.

Wait, Isn’t This Just Keyword Stuffing?

No. Well, not if you’re doing it right. On some level, keyword stuffing was a blind attempt to game the TF-IDF of Google PageRank, but it has since been absolutely pounded into the ground by a decade of Google updates. ‘Just use the word a lot’ isn’t effective, nor does it require any particularly complex maths. TF-IDF tells you what the natural frequency of a word is, which lets you calibrate how often you use it to stand out without just jamming the word in everywhere you think it might fit. Hmm, you think,‘optimization’ has an average TF-IDF of 5.25, which means if I’m closer to 5.25, Google has an easier time of understanding what my content is about. If I’m lower than that, I could stand to use a few more keywords. If I push too much higher, I’m at risk of getting a penalty. Keyword stuffing was a sledgehammer, and proper TF-IDF analysis and keyword deployment is a scalpel.

Remember, Google isn’t looking for the content that uses the keyword the most, it’s looking for the most relevant content. Being able to precisely analyze which sites Google considers relevant lets you understand its underlying logic better, which lets you make smarter and more motivated SEO decisions.

It’s important to note that TF-IDF isn’t the only part of PageRank, not by a long shot. It’s still in the mix there somewhere, but it’s a very different beast from the one that was trialled in the 50s. I don’t want to oversell it: it’s not about to turn your whole SEO strategy on its head overnight and give all your pages a 200% traffic increase. What it will give you is an edge, and we all know how much that matters in this business. It’s a subtle tool, but that subtle edge might be what your site needs to finally make it onto the front page of Google.

Okay I’m Sold! What’s Next?

Well, you clearly care about your website, and getting your name out there. Fortunately, help is just around the corner. My company CodeClouds have several guides that you’ll find valuable—I’d recommend reading these articles discussing the Google SEO 3-pack and How to Avoid Google Penalties For Better SEO. If you’re looking to do more than read, consider hiring our expert web developers and creative designers to take your site to the next level.