Automatic Mood Detection and Tracking of Music Signals

INTRODUCTION

Most People enjoy music in their leisure time. At present there is more and more music on the personal computers, in music libraries, and on the internet. In order to do music organization, music management and other related music applications, such as music search and music play list generation, various metadata is needed to be created for each music piece. Although traditional information such as the artist, the album, or the title of music work remains important, these tags have limited applicability in many music related applications. For example , when an individual comes back home from work, he may want to listen to some relaxing light music; while when he is gymnasium, he may want to choose exiting music with strong beat and fast tempo.

Traditional metadata is not helpful in these scenarios. Therefore, more semantic information is expected to be extracted and archive and search music, such as beat, tempo, genre, and mood. Using different features and different models, beat and tempo detection and genre classification have been developed. Thus a computational framework is build to automatically estimate the mood inferred in each music clip.

A. Related Works on Music Mood and Emotion

Emotions usually play a critical role in rational decision-making, perception, human interaction, and human intelligence. Traditional research on mood and emotion has a long history toward building intelligent machines which can perceive and express emotions, such as affective computing. These research works often focus on discovering the physiological or psychological signals which are related to emotional expression, and then detecting or synthesizing emotions based on these signals.

Most of the research works on music mood and emotion are related to music composition and music expressivity. These works are deployed to discover different aspects of music composition and music performance which can communicate emotions and influence listener’s emotional responses. Most of the work concentrated on MIDI or symbolic representations due to the difficulty of extracting useful features from acoustic musical data.

B. Issues in Mood Detection from music Audio signals

In order to build a good computational model for automatic mood detection from acoustic music data, there are several issues that need to be considered beforehand.

1) One common objection to music mood detection is that the emotional expression and perception of music is subjective and it depends on many factors including culture, education and personal experience. Thus, for the same music piece, different musicians may have different performances, while different individuals might have different perceptions. Therefore it is usually argued that the mod is too subjective to be detected. However, music sounds with certain patterns or structures usually have inherent emotional expression. Moreover these emotions are able to be communicated. It was also found that, within a given cultural context, there are major agreements among individuals regarding the mood elicited by music. Therefore, it is possible to build a mood detection system in a certain context.

2) Another issue relates to mood taxonomy. There is debate over whether emotions are categories or continua. It is also not clear what are the basic emotions that music can express and human can perceive. Usually adjectives are used to describe moods. However, the adjectives are used quite freely and the adjective list is usually immense. Several research works on the basic emotion dimensions provide the basis for music mood taxonomy, as well as giving some important cues for computational modeling of music moods. Here, Thayer’s model of mood is adopted as the basis of mood taxonomy and mood detection.

3) The third issue is over the acoustic features. Except for some features extracted from MIDI or symbolic representations, there are few acoustic features available to represent various moods. However, most music clips in the real world are in the form of recorded acoustic waveforms and there is no available transcription system that can translate them well into symbolic representations. Therefore, it is necessary to deal with the acoustic data directly. Although there have been many works in development of music and audio features, most of them are not suitable to exactly or directly represent the emotional content of a music signal. Some features developed for music analysis in current literatures, such as mel-frequency cepstral coefficient (MFCC), short-time energy (STE), and zero-crossing rate (ZCR), are originally proposed for speech and audio analysis. Some features used for speech emotion detection, such as pitch or Fo, are not feasible for music mood representation. Some music-specific features such as the timbre texture and rhythm content are used for a different goal of music genre classification. Moreover the used timbre features therein are all based on standard features proposed for music-speech discrimination and speech recognition, and the rhythm features are not designed to specifically represent the primitives of certain moods. Therefore, to build a good computational model of music mood detection, we should extract the acoustic features to exactly represent the primitives of various moods. Three acoustic feature sets including intensity, timbre, and rhythm are extracted, based on the basic emotional dimensions and mod taxonomy.

4) The fourth consideration is related to mood detection. In order to get better performance for mood detection, we present a hierarchical framework, utilizing the most suitable features in different steps of mood classification. Furthermore, since the mood is usually changeable in an entire piece of classical music we extent the algorithm to mood tracking by dividing the music into several independent segments, each of which contains a constant mood.

With the above considerations and solutions we form our approach to mood detection and mood tracking.

FEATURE EXTRACTION

It was indicated that mode, intensity, timbre and rhythm are of great significance in arousing different music moods. Tempo, sound level, spectrum, and articulation are highly related to various emotional expressions. These findings are very similar although the exact words are different such as rhythm versus tempo and intensity versus sound level. Different emotional expressions are usually associated with different patterns of acoustic cues. For example, contentment usually associates with slow tempo, low sound level, and soft timbre, while exuberance is with fast tempo, fairly high sound level, and bright timbre. However of the factors given above that influence emotional expression, mode is very difficult to obtain from acoustic data, and articulation is also extremely difficult to measure. Therefore only features of intensity (sound level), timbre (spectrum), and rhythm (tempo) are extracted and used in the mood detection system. Compared with the two dimensions in Thayer’s model of mood, intensity is correlated to “energy” or “arousal,” While both timbre and rhythm are corresponding to “stress” or “valence”.

In feature extraction each input music clip is first down-sampled into a uniform format: 16000Hz, 16 bits, mono channel, and divided into non-overlapping 32ms -long frames. In each frame, an octave-scale filter-bank is used to divide the frequency domain into several sub-bands:

A. Intensity Features

Intensity is an essential feature in mood detection. The intensity of Contentment and Depression is usually little, while that of Exuberance and Anxious/Frantic is usually large, based on the Thayer’s mood model. It is also consistent with the acoustic cues of various emotional expressions. In Thayer’s model, energy (or intensity) is more computationally tractable and can be estimated using simple amplitude-based measures. In this system the intensity feature of each frame is composed of the spectrum sum of the signal and the spectrum distribution in each subband

B. Timbre Feature

MFCC Technique or so called spectral shape features are utilized to represent the timbre of audio and music signals. The spectral shape features, including brightness, bandwidth, roll off and spectral flux, are important in discriminating different moods .For example, the brightness of exuberance music is usually higher than that of depression, since Exuberance music generally has a larger spectral energy in the high subbands than depression music. Spectral Flux represents the spectrum variation between adjacent frames. Therefore the spectral shape features are first used.

As for MFCC, it is used with great success in general speech and audio processing and is also used appropriately as one of the timbre texture features in a music genre classification system. However, it averages the spectral distribution in each subband, and thus losses the relative spectral information .To compliment for this disadvantage octave-based spectral contrast, is utilized instead. The feature considers the spectral peak, spectral valley, and their dynamics in each subband and roughly reflects the relative distribution of the harmonic and non harmonic components in the spectrum. The evaluations on a music genre recognition system indicated its better performance than MFCC.

C. Rhythm Features

In general, three aspects of rhythm are closely related with people’s mood response: rhythm strength, rhythm regularity, and tempo. For example, it is usually observed that, in the Exuberance cluster the rhythm is usually strong and steady, and the tempo is fast, while the depression music is usually slow and does not have distinct rhythm pattern. In this mood detection system five novel features are proposed to represent the above mentioned three aspects of rhythm.

After the FFT each frame is divided into seven octave-based subbands, and the amplitude envelope of each subband is calculated by convolving with a half Hanning window. Half Hanning has low-pass characteristics, and is usually used for envelope extraction while still keeping the sharp attacks in the amplitude curve. After obtaining the amplitude envelope, Canny operator, which is usually used for edge detection in image processing, is used for onset sequence detection by calculating the variance of amplitude envelope. The Canny operator considers a larger range of points; thus, it could detect more potential onsets and smoothout of the noise which may be got from simple first-order differences.

Finally the onset curve of each subband is summed and used to represent the rhythm information of a music wave. The first component of rhythm feature set is Rhythm Strength, which can be intuitively correlated to the average strength of onsets, where the onset strength can be assumed as the values of peaks in the onset sequence.

Rhythm Strength: - The average onset strength in the onset sequence. The stronger the rhythm is higher the value is.It is clear that if a music clip has an obvious and regular rhythm; the peaks of the corresponding autocorrelation curve will be obvious and strong as well, and vice versa. According to this fact, the following two feature components are extracted to represent the distinctness and regularity of the rhythm.

Average Correlation Peak: - The average strength (amplitude) of the local peaks in the auto-correlation curve. The more regular the rhythm is, higher the value is.

avr(A)/avr(V): - The ratio between the average peak strength and the average valley strength. The more obvious the rhythm is, the higher the value is.

Average Tempo: - Represents the average speed of music performance. Average tempo is estimated as the maximum common divisor of the detected peaks, assuming that the average tempo does not vary much in the music clip. The average tempo is normalized by 120 beat per minute (BPM).

However, such an average tempo only represents the occurrence frequency of beats. It could not represent the frequency of the underlying onsets, which is also an important cue of music performance. Therefore, another feature component is also extracted as follows.

Average Onset frequency: - It can be easily calculated as the ratio between the number of onsets and the corresponding time duration. The larger value it is, the faster the performance is.

The aforementioned five feature components compose a five- dimension rhythm feature set, which can perform quite well on music mood discrimination. For example, Exuberance music usually has large rhythm strength, large onset frequency, and large average autocorrelation peak, while depression music has a weak strength, slow onset frequency or tempo, and low autocorrelation peak.

D. Feature representation of a Music Clip

The intensity and timbre features are obtained from each frame while the rhythm features are from an entire clip. Moreover it is not appropriate to concatenate the components of each feature set into a feature vector, since the characteristics and dynamics of these feature components are so different. Therefore, a normalization process is first performed on each feature component to make their scale similar.