Matrices and Songs

There are many ways to represent words, texts, or even speech. One thing they have in common is that, excluding idioms and aphorisms, it's impossible for two texts or speeches to be the same, because there is a lot of variation in the words and phrases they contain. An exception to this, however, are poems and songs. Poems and songs rely on repetition not only of melodies but also of words and phrases. In some cases, a song is the repetition of a single phrase, while in others, more rarely perhaps, the repeated words can be counted on one hand.

This post will show an analysis of the repetitiveness of the 91 best Italian songs according to Panorama (plus 9 from a recent top 10 by MTV) and was inspired by the work of Colin Morris, a computational linguist, who worked on repetitiveness in songs and how to represent it.

Introduction

The idea, in general, comes from Colin Morris, a computational linguist who gave this presentation at TEDx. It consists of being able to quantify the repetitiveness in songs and visualize it in an original way.

An interactive version is available on his website https://colinmorris.github.io/SongSim/.

How to quantify repetitiveness in a song

A fundamental problem arises with the basic unit itself for analyzing repetitiveness. While words and phrases may work well for a book, the same cannot exactly be said for songs.

When analyzing songs and poetry, suffixes, prefixes, initial, and final syllables can contribute to repetition. Two words do not have to be the same to cause a rhyme; in fact, a rhyme made with two identical words is quite sad.

Therefore, one must dig a level deeper, and for this reason, the LZ77 algorithm, devised for data compression, is excellent for getting an idea of the amount of repetitiveness in a song.

The algorithm reduces a file based on the repetitiveness of sets of characters.

To do this, just use the verbose gzip command in Linux to compress a file and get the compression percentage in the result.

How repetitive are these 100 songs?

From these images, it can be noted that most of the songs are compressed around 60%. However, there are some songs that are much less so.

/img/distribution-songs.png /img/boxplot-songs.png

With a sample of 100 songs, Italian pop songs are on average compressible by 60%. Put this way, it seems that there is a difference between Italian and American pop songs (which have an average of 50%), the sample is still very small but it still gives us information on what these songs selected by "Panorama" have in common.

The least repetitive song is "Isola Grande" by Pino Daniele, but if you look at the lyrics… Clearly, this is an outlier that perhaps would do well to be excluded, and so the second least repetitive is "Nel mio letto" by Verdena.

The most repetitive, on the other hand, is "Pop Porno" which can be compressed by 76.4%.

Patterns of repetitiveness

In bioinformatics, dot plots are used, which are similar to heatmaps of matrices formed by two similar series/vectors (called self-similarity matrices). The result is a representation of the common points of these vectors that sometimes reminds of snowflakes seen under a microscope.

It is also possible to do this not only with the lyrics of a song but also with the audio.

/img/Permeeimportante-Tiromancino-25.png /img/redsDedicatoate-LeVibrazioni-5.png

Conclusion & Slideshow

At this link https://andcarnivorous.github.io/slideshow.html, it is possible to see different versions of the dot plots that represent the 100 songs. I will share the Python code used as soon as I clean it up in my gists.

What do you think of this type of representation? Were there songs that you expected to be more or less repetitive? Or maybe there are compositions with very interesting and peculiar lyrics that should be added? "Prisencolinensinainciusol" by Celentano is the first that might come to mind.