Consider yourself about being part of the following scenario.

You are a successful journalist, your new article has been already ready and you are so impatient to publish it immediately. However, you have to get through the process of summarizing your article into 4-5 keywords, in order to have a catchy description, which will make your article more visible by the different Search Engines. Why though? Is there any algorithm to help you figuring out those keywords automatically, by pressing just a button? The answer is yes and it’s called tf-idf.

More specifically, this algorithm assigns a percentage (tf-idf) to every word indicating the possibility to be used as a keyword. Another key to remember is that this algorithm takes as input a sample of articles. With this in mind, please take into consideration the fact that the purpose of this article is to explain this procedure as simple as possible, since the mathematical definition of this algorithm, can be easily found to the wikipedia article.

Let’s take as an example an article from “The Guardian” regarding New Zealand glaciers that are turning brown from Australian bushfire smoke. Possible keywords of this article could be the words team, smoke, photos, health, bushfire, glaciologist. To give an illustration, some people would prefer to use the word bushfire instead of the word team, while others would prefer the term glaciologist of the term photos. All things considered, the algorithm can calculate all those possibilities by a straightforward way. To put it another way, firstly, it calculates the frequency of each term in the article (tf), in order to figure out how many times a specific term appears in the article, by contrast to the rest of the words included in the article. For instance, to the 500 words of the article:

  • The words team, photos, health and glaciologist, appear only one time, so the percentage comes to 0.02
  • The word bushfire appears four times, so 0.08%
  • The word smoke appears eight times, so 0.16%

To continue with, the percentage increases or decreases accordingly to the number of the articles the term appears to (idf).

  • If a specific term appears in the 1/3 of all the articles included into the corpus, as a result, the percentage remains the same as previously. In other words, the tf-idf of the word health is again 0.02% for the current article.
  • If a term appears in less than the 1/3 of the articles of the corpus then the percentage is increasing accordingly. So, if the word glaciologist is not very commonly used in the corpus the percentage 0.02% will get increased.
  • If a term appears in more than the 1-3 of the articles of the corpus then the percentage is being decreased. For example, the percentage of the word smoke might be decreased to less than 0.16% if it presents a high frequency in the corpus.
  • If a term appears almost to every article of the corpus, then the percentage is almost 0.

Last but not least, the tf-idf algorithm is provided by elasticsearch similarity and you can learn more about it by listening to Elasticsearch podcast episode through the Software Engineering Daily podcast.

If you have any comments do not hesitate to contact me by Email 📧.

Special thanks to Christina for the proofreading 😊!