October 26, 2020

The Prism Of The Clouds

When the cloud of keywords swept over the main blogging platforms a few years ago, the craze they aroused could hardly be explained by a simple rational explanation. There is ancestor modern data visualizations (which always use it) something that bypasses the brain to speak directly to the visual cortex. The cloud of keywords charms and seduces more than it actually serves.

Why does the keyword cloud please as much? I will try to make an explanation after detailing the concept and presenting our own version.

The prism of the clouds

In fact, there are several technologies that are brought together under the term “keyword clouds”. The translation of the expression is also unsatisfactory because “key word” reflects only one of the meanings of “tag”, which is at the same time the label, the marker, the legend … to the point that certain disciplines (like computer science) use the English word more often, even in French.

The “historical channel” keyword cloud is therefore a representation of the keywords present on a site, which supposes that the site has keywords associated with each page or article. In a typical configuration, each article in a blog is associated with a number of keywords that are added manually by the user. Some will be recurrent from one article to another, others more rare. The cloud of keywords makes it possible to instantly see these recurrences, and in principle to identify at a glance the themes of the blog in question.

It is important to note that it is the author of the content that informs the keywords, and therefore puts certain intelligence, or at least a human look. There is indeed an automated version of this cloud that does not require human intervention, where the machine scans the contents and counts the recurrences.

The man and the machine

The difference is significant because we move from a human organization, subjective, to a purely statistical organization as found in a lexicographic analysis. In this version, only the number of occurrences of a word counts, regardless of its semantic importance, hence the often very disappointing character of the software that generates them. We will typically observe:

The predominance of “stop words”, these little words like “the”, “to”, “and”, which do not carry meaning by themselves;

The disappearance of phrases (or expressions) composed of several words, such as “cloud tag”, which is divided into “tag” and “cloud”;

A certain volume of noise due to the consideration of the entire site, including menus, buttons, and others “Leave a comment”.

The result is usually something useless, overloaded, and that brings little or no information about the contents “synthesized” unless you already have an intuition or knowledge of these contents, and to be able to read between the lines of the expression cloud.

This technology may have been as dreaming as the first version of the keyword cloud, which requires significant human work prior. Was it hoped that the machine would replace the human? I believe that the expectations have been disappointed overall but that the hope of instantly understanding and synthesizing a set of text is ambitious enough to keep keywords clouds alive despite the disappointments …

From lexico ‘to termino’

We briefly talked about lexicographic analysis, saying it was a statistical process. Specifically, lexicographic analysis software will generally produce a list of all words used in a text and associate a number of occurrences. This “blind” approach can sometimes reveal interesting elements, but is a priori more for an informed analyst. This one will be able, by methodology and experiment, to interpret these statistical results, for example by observing that an insignificant term like “but” is overrepresented compared to the average. Without knowledge of this kind of information, it is difficult and especially risky to “talk” this kind of cloud keywords.

With a little bit of semantics, you get much more interesting results, by doing what’s called a terminological extraction. The difference is, to put it simply, that we go from word to word. A term can be composed of several words and corresponding in principle to a concept. For example “potato” is a term that has a different meaning of “apple” as “earth”. The software will count not only occurrences of a term, but co-occurrences, such as “apple” and “earth” in our example.

Leave a Reply

Your email address will not be published. Required fields are marked *