Secure speech solutions
This section deals with secure speech equipment, such as voice encryption
devices, from a variety of manufacturers. Such devices come in many flavours,
ranging from simple speech scramblers to digital voice encryptors.
Most of the devices shown below, are also featured elsewhere on this website
as they fall into multiple categories.
Secure telephones are a class of their own,
but since they also belong to the group of voice encryption devices,
they are linked from this page
Voice encryption units on this website
Secure speech systems are known by various names, such as Voice Privacy
Unit, Secure Speech System, Voice Protection Device, Speech Encryptor, etc.
In principle, there are only two systems for voice protection:
- Frequency domain voice scrambler
In this analogue system, the frequency domain of the human speech is mirrored
around a given center frequency, so that it becomes unintelligible.
Such systems can easily be broken, even if the audio band is split
into multiple smaller bands first.
- Time domain voice scrambler
In this system, the human speech is first stored in some kind of memory,
after which the individual parts are then scrambled in the time domain.
It is more secure than a frequency domain scrambler, but can still be
broken as the individual sound samples still bear the properties of
- Frequency and Time Domain voice scrambler
This system, also known as the F/T Scrambler, is a combination of the
above methods. It is the most complex one, but can still be broken with
the right equipment, no matter how complex the randomizer is, as the
individual samples still bear the properties of speech.
- Digital Encryption
This method uses a digital representation of the analogue voice signal
(samples), which is mixed with a digital key stream. This method is much
safer than the ones above and is the only one that can really be called
Before digital speech encryption became widely available,
an analogue technique was used to protect voice transmissions.
This technique is commonly known as Voice Scrambling and comes
in three flavours, which are further explained below.
Scramblers are inherently insecure and only provide protection against
the occasional eavesdropper, such as a telephone exchange operator.
Frequency domain scrambling
This technique is based on frequency
inversion and is commonly called voice scrambling.
It is based on mirroring of the audio frequency spectrum around a given
center frequency, sometimes divided over multiple frequency bands.
This principle is best explained using a simplified model:
The audio spectrum of the voice data (1) is mixed with a carrier frquency
fc (2). This results in two spectra: one that is the sum
of the original sectrum and the carrier (3),
and another one that is the difference of the two signals (4).
A low-pass filter (LPF) is then applied to filter-off the
sum and leave only the difference, effectively resulting in a mirrored
audio band (5).
At the receiving end, this process of mirroring the spectrum is repeated
to make the speech 'legible' again.
To make things more complex, one could vary the carrier frequency and also
split-up the audio band in several (e.g. five) smaller bands that are then
mirrored individually. Continuously varying these parameters by putting them
under digital control, can make it harder to decode the signal.
The advantage of this technique is that it completely takes place within the
audio bandwidth of a channel, whereas digital encryption generally requires
a more space. This allows scrambling to be used in existing systems.
At the time, scramblers were also cheaper than
digital encryptors, which is why scramblers were used by the police
in many countries from the 1970's well into the 1990's.
The disadvantage of this method is that an evesdropper can easily reverse
the mirroring process with a simple electronic circuit.
In addition, experienced listeners could sometimes even extract useful
information from the seemingly garbled speech directly, without a descrambling
Time domain scrambling
Another method for speech protection is the so-called time-division or
time-domain (TD) speech scrambling. This method is more secure than the simpler
frequency-inversion system, but far less secure than modern
digital speech encryptors.
The simplified diagram below, shows how it works.
Human speech is cut into a number of small time segments which are then
scrambled in an ever changing order. The order in which the packets are
scrambled is determined by a pseudo random number generator (PRNG)
which is seeded or initialised by the user by means of a
In this diagram, the top row shows the clear speech (input) in time.
The second row shows the speech after it is scrambled.
The bottom row finally shows the speech once it is descrambled again (output).
The whole process of scrambling and descrambling, causes a noticable delay
which is typically in the range of 0.3 to 0.6 seconds.
This delay sometimes causes confusion.
As the time segments are scrambled in an ever changing pattern, it is important
that transmitter and receiver are correctly synchronised. To ensure that both
ends are kept in sync, a pilot signal is transmitted with the
scrambled speech by means of Audio Frequency Shift Keying (AFSK).
An example of a speech scrambler that uses Time Domain Scrambling, is the
BBC Cryptophon 1100.
Although scramblers of this type are not safe, many police and other
law enforcement agencies around the world, used this method for securing
their conversations for many years, as it has the advantage that it can be
used on existing narrow-band FM radio channels.
Despite the fact that the experienced listener can't make any sense
of the garbles, the system is prone to cryptanalytic attacks,
as it is possible to
reconstruct the original signal (and hence the cryptographic key) by examining
the output signal on an oscilloscope or by means of a modern computer.
Frequency and Time domain scrambling
The third and most complex type of voice scrambler, is the so-called
Frequency and Time Domain Scrambler, also known as the F/T Scrambler,
which is basically a combination of the two methods explained above.
Although scrambling and descrambling of this method is much more complex,
the system is equally prone to cryptanalysis as the previous ones.
It is inherently insecure.
Below are some examples of scrambled speech.
These samples were recorded by Barry Wels  from the built-in analogue
voice scrambler of the Icom IC-H11 radio. If you listen carefully to the
scrambled audio, you may actually be able to descramble some of it yourself.
Most - if not all - modern secure voice terminals use digital encryption.
Speech is first digitized by means of an Analog-to-Digital Convertor (ADC) or
a Vocoder. The resulting digital data stream is then mixed by means of
an XOR-operation with a data stream from a pseudo-random number
generator (PRNG), that in turn is seeded by a KEY. This principle is also
known as the Vernam Cipher.
The resulting encrypted data stream that is then converted back to the
analog domain (modem), so that it can be transmitted.
This process is shown in the simplified diagram below:
In the 1970s many systems, such as the KY-57,
used Continuous Variable Slope Delta Modulation (CVSD) to convert speech into digital data. This wide-band solution was only suitable for VHF and UHF radios.
In the 1980s narrow-band systems were introduced,
such as the KY-99, that used (enhanced) Linear Predictive
Coding (LPC), limiting the data-rate to 2400 baud or even 800 baud.
The Pseudo Random Number Generator (PRNG) is seeded by a KEY that is either
entered manually or by means of a key fill device.
Modern systems sometimes use asymmetric encryption methods (e.g. AES)
to exchange the keys over an insecure channel. This is known as
Public Key Encryption.
Before human voice data can be encrypted, is has to be converted to the
digital domain, by means of a sound sampler or digitizer.
Generally speaking, a digital signal needs more bandwidth than its analogue
equivalent (typically twice the bandwidth), but methods have been developed
to reduce this effect, by analysing the properties of human speech and
sending these parameters to the other end, where they are used to reconstruct
or synthesize the human speech again.
This method is known as a Vocoder and is not always good enough to recognise a person's voice. The first vocoder,
named VODER, was developed
at Bell Labs in 1939. Its principle was first used during WWII on the
transatlantic SIGSALY crypto phone.
A speech analyser/synthesizer is also known as a CODEC (coder-decoder).
Here are some examples of speech digitisers:
- PCM - Pulse Code Modulation
PCM is a general expression for digitizing an analogue signal. A PCM
signal is in fact the numerical or digital representation of the analogue
signal. Sending PCM data typically requires twice the bandwidth of the
analogue original, but the quality is unsurpassed.
Sampling and data rates are typically in the range of 16 to 32 kbps.
- CVSD - Continuous Variable Slope Delta modulation
Reasonable quality vocoder with 1 bit/sample that only registers the
difference between the current sample and the previous one
(1 = higher, 0 = lower).
It has a sample rate between 8 and 16 kHz, which results in 8 to 16 kbps
data. Examples of equipment that used CVSD are the
Philips Spendex 10,
the Spendex 50 and
the American KY-68.
- LPC - Linear Predictive Coding
Early vocoder for narrow bandwidth connections. LPC-10 has a sampling
rate of 8 kHz and a coding rate of 2.4 kbps. Developed by the NSA and
used in the first generation secure terminal units (STU).
- RELP - Residual-Excited Linear Prediction
Improved (but now obsolete) variant of LPC, and predecessor of CELP.
- CELP - Code-Excited Linear Prediction
Improved variant of LPC and RELP that provides better speech quality at
lower bitrates. CELP exists in many variants and is also used in MPEG-4
audio coding. It is the most widely used speech coding algorithm today.
- MELP - Mixed-Excitation Linear Prediction
Medium quality vocoder, mainly used by the US Department of Defense for
secure communication via satellites and and military radios. It has a sampling
rate of 8 kHz and a coding rate of 2.4 kbps.
- MELPe - Enhanced Mixed-Excitation Linear Prediction
High-quality low-bitrate enhanced version of MELP with a sampling rate
of 8 kHz and a coding rate of 2400, 1200 and 600 bps.
- MRELP - Modified Residually-Excited Linear Prediction
Improved variant of MELP and MELPe that produces better results at higher
bitrates, such as 9600 bps.
Below are some sound samples of digitally encrypted speech,
recorded by Barry Wels  from an Icom IC-H10SR radio.
The first file contains the original audio file. The seconds file plays
the encrypted audio. Finally, the last file produces the resulting
audio once it has been decrypted.
Any links shown in red are currently unavailable.
If you like the information on this website, why not make a donation?|
© Crypto Museum. Created: Tuesday 04 August 2009. Last changed: Sunday, 11 December 2016 - 15:05 CET.