🧒🏽 🥊 🖕🏻 Sound coding depth - what is it? Definition, formula 🤘🏼 📭 🍅

Sound coding refers to methods for storing and transmitting audio data. The following article describes how such encodings work.

Note that this is a rather complicated topic - "Sound coding depth." The definition of this concept will also be given in our article. The concepts presented in this article are for general review only. We will reveal the concepts of sound coding depth. Some of this reference data may be helpful in understanding how the API works, as well as how to formulate and process audio in your applications.

How to find audio encoding depth

The audio format is not equivalent to audio coding. For example, a popular file format, such as WAV, defines the header format of an audio file, but is not in itself an audio encoding. WAV audio files often, but not always, use linear PCM encoding.

In turn, FLAC is both a file format and an encoding, which sometimes leads to some confusion. Within the Speech API FLAC, the audio encoding depth is the only encoding that requires audio to include a header. All other encodings indicate soundless audio data. When we refer to the FLAC in the Speech API, we always refer to the codec. When we refer to the FLAC file format, we will use the “.FLAC” format.

You are not required to specify the encoding and sample rate for WAV or FLAC files. If this parameter is omitted, the Cloud Speech API automatically determines the encoding and sampling rate for WAV or FLAC files based on the file header. If you specify an encoding value or sample rate that does not match the value in the file header, the cloud speech API will return an error.

Sound coding depth - what is it?

Audio consists of waveforms consisting of interpolation of waves of different frequencies and amplitudes. To represent these waveforms in digital environments, the signals must be rejected at a speed that can represent the highest frequency sounds you want to reproduce. For them, it is also necessary to store a sufficient bit depth to represent the correct amplitude (volume and softness) of the waveforms along the sound sample.

The ability of an audio processing device to recreate frequencies is known as its frequency response, and the ability to create proper volume and softness is known as the dynamic range. Together, these terms are often referred to as fidelity of the sound device. Sound coding depth is a means by which you can restore sound using these two basic principles, as well as the ability to effectively store and transmit such data.

Sampling rate

Sound exists as an analog waveform. The digital audio segment approximates this analog wave and samples its amplitude at a speed high enough to simulate the natural frequencies of the wave. The sampling frequency of a digital audio signal determines the number of samples taken from the audio source material (per second). High sampling rates increase the ability of digital sound to accurately represent high frequencies.

As a consequence of the Nyquist-Shannon theorem, you usually need to try at least twice the frequency of any sound wave, which must be recorded in digital form. For example, to represent sound in the range of human hearing (20-20000 Hz), the digital audio format should display at least 40,000 times per second (which is the reason that CD sound uses a sampling frequency of 44100 Hz).

Bit of depth

Sound coding depth is the effect on the dynamic range of a given sound sample. Higher bit depths allow for more accurate amplitudes. If you have many loud and soft sounds in the same sound sample, you will need more bits to correctly transmit these sounds.

Higher bit depths also reduce the signal-to-noise ratio in audio samples. If the audio encoding depth is 16 bits, the music CD sound is transmitted using these values. Some compression methods can compensate for lower bit depths, but they are usually losses. DVD Audio uses 24 bits of depth, while on most phones, the audio encoding depth is 8 bits.

Uncompressed sound

Most digital audio processing uses these two methods (sample rate and bit depth) to easily store audio data. One of the most popular digital audio technologies (popularized using a CD) is known as pulse code modulation (or PCM). Audio is selected at set intervals, and the amplitude of the sampled wave at this point is stored as a digital value using the bit depth of the sample.

Linear PCM (which indicates that the amplitude response is linearly uniform across the sample) is the standard used on CDs and in the LINEAR16 Speech API encoding. Both encodings create an uncompressed stream of bytes corresponding directly to the audio data, and both standards contain 16 bits of depth. Linear PCM uses a sampling frequency of 44 100 Hz on CDs, which is suitable for re-arranging music. However, a sampling frequency of 16,000 Hz is more suitable for speech recomposition.

Linear PCM (LINEAR16) is an example of uncompressed audio because digital data is stored in a similar way. When reading a single-channel byte stream encoded using Linear PCM, you can count every 16 bits (2 bytes) to get a different signal amplitude. Almost all devices can manipulate such digital data initially - you can trim Linear PCM audio files using a text editor, but uncompressed sound is not the most efficient way to transport or store digital sound. For this reason, most audio uses digital compression techniques.

Compressed sound

Audio data, like all data, is often compressed, making storage and transportation easier. Compression in audio coding can occur either lossless or lossy. Lossless compression can be decompressed to restore digital data to its original form. Compression necessarily deletes some information during the decompression procedure and is parameterized to indicate the degree of tolerance to the compression technique for deleting data.

Lossless

Without loss, digital audio recordings are compressed using complex permutations of the stored data, which does not lead to a deterioration in the quality of the original digital sample. During lossless compression, when decompressing data into the original digital form, information will not be lost.

So why do lossless compression methods sometimes have optimization parameters? These options often handle file size for decompression times. For example, FLAC uses a compression level setting from 0 (fastest) to 8 (smallest file size). A higher level FLAC compression will not lose any information compared to a lower level compression. Instead, the compression algorithm just needs to spend more computational energy on building or deconstructing the original digital sound.

The Speech API supports two lossless encodings: FLAC and LINEAR16. Technically, LINEAR16 is not “lossless compression” because compression is not primarily involved. If file size or data transfer is important to you, choose FLAC as your audio encoding option.

Loss of compression

Audio compression eliminates or reduces some types of information when building compressed data. The Speech API supports several lossy formats, although they should be avoided since data loss can affect recognition accuracy.

The popular MP3 codec is an example of a lossy encoding method. All MP3 compression methods remove sound from outside the normal human audio range and adjust the compression level by adjusting the effective bit rate of the MP3 codec or the number of bits per second to preserve the date of the sound.

For example, a stereo CD using linear 16-bit PCM has an effective bit rate. Sound coding depth formula:

441000 * 2 channels * 16 bits = 1411200 bits per second (bps) = 1411 Kbps

For example, MP3 compression deletes such digital data using data rates such as 320 kbps, 128 kbps or 96 kbps, resulting in poor sound quality. MP3 also supports variable bit rates, which can further compress audio. Both methods lose information and can affect quality. It is safe to say that most people can tell the difference between 96 kbps or 128 kbps encoded MP3 music.

Other forms of compression

MULAW is 8-bit PCM encoding, where the sample amplitude is modulated logarithmically rather than linearly. As a result, uLaw reduces the effective dynamic range of compressed sound. Although uLaw was introduced specifically to optimize speech encoding, unlike other types of audio, 16-bit LINEAR16 (uncompressed PCM) is still far superior to uLaw's 8-bit compressed sound.

AMR and AMR_WB modulate the encoded audio cash register by introducing a variable bit rate into the original audio sample.

Although the Speech API supports several lossy formats, you should avoid them if you have control over the original audio. Although deleting such data through lossy compression may not have a noticeable effect on the sound heard by the human ear, the loss of such data for the speech recognition mechanism can significantly impair accuracy.

Sound coding depth - what is it? Definition, formula