C# - Capture RTP Stream and send to speech recognition

dgreenheck picture dgreenheck · Apr 8, 2013 · Viewed 7.1k times · Source

What I am trying to accomplish:

  • Capture RTP Stream in C#
  • Forward that stream to the System.Speech.SpeechRecognitionEngine

I am creating a Linux-based robot which will take microphone input, send it Windows machine which will process the audio using Microsoft Speech Recognition and send the response back to the robot. The robot might be hundreds of miles from the server, so I would like to do this over the Internet.

What I have done so far:

  • Have the robot generate an RTP stream encoded in MP3 format (other formats available) using FFmpeg (the robot is running on a Raspberry Pi running Arch Linux)
  • Captured stream on the client computer using VLC ActiveX control
  • Found that the SpeechRecognitionEngine has the available methods:
    1. recognizer.SetInputToWaveStream()
    2. recognizer.SetInputToAudioStream()
    3. recognizer.SetInputToDefaultAudioDevice()
  • Looked at using JACK to send the output of the app to line-in, but was completely confused by it.

What I need help with:

I'm stuck on how to actually send the stream from VLC to the SpeechRecognitionEngine. VLC doesn't expose the stream at all. Is there a way I can just capture a stream and pass that stream object to the SpeechRecognitionEngine? Or is RTP not the solution here?

Thanks in advance for your help.

Answer

dgreenheck picture dgreenheck · Apr 10, 2013

After much work, I finally got Microsoft.SpeechRecognitionEngine to accept a WAVE audio stream. Here's the process:

On the Pi, I have ffmpeg running. I stream the audio using this command

ffmpeg -ac 1 -f alsa -i hw:1,0 -ar 16000 -acodec pcm_s16le -f rtp rtp://XXX.XXX.XXX.XXX:1234

On the the server side, I create a UDPClient and listen on port 1234. I receive the packets on a separate thread. First, I strip off the RTP header (header format explained here) and write the payload to a special stream. I had to use the SpeechStreamer class described in Sean's response in order for the SpeechRecognitionEngine to work. It wasn't working with a standard Memory Stream.

The only thing I had to do on the speech recognition side set the input to the audio stream instead of the default audio device.

recognizer.SetInputToAudioStream( rtpClient.AudioStream,
    new SpeechAudioFormatInfo(WAVFile.SAMPLE_RATE, AudioBitsPerSample.Sixteen, AudioChannel.Mono));

I haven't done extensive testing on it (i.e. letting it stream for days and seeing if it still works), but I'm able to save off the audio sample in the SpeechRecognized and it sounds great. I'm using a sample rate of 16 KHz. I might bump it down to 8 KHz to reduce the amount of data transfer, but I will worry about that once it becomes a problem.

I should also mention, the response is extremely fast. I can speak an entire sentence and get a response in less than a second. The RTP connection seems to add very little overhead to the process. I'll have to try a benchmark and compare it with just using MIC input.

EDIT: Here is my RTPClient class.

    /// <summary>
    /// Connects to an RTP stream and listens for data
    /// </summary>
    public class RTPClient
    {
        private const int AUDIO_BUFFER_SIZE = 65536;

        private UdpClient client;
        private IPEndPoint endPoint;
        private SpeechStreamer audioStream;
        private bool writeHeaderToConsole = false;
        private bool listening = false;
        private int port;
        private Thread listenerThread; 

        /// <summary>
        /// Returns a reference to the audio stream
        /// </summary>
        public SpeechStreamer AudioStream
        {
            get { return audioStream; }
        }
        /// <summary>
        /// Gets whether the client is listening for packets
        /// </summary>
        public bool Listening
        {
            get { return listening; }
        }
        /// <summary>
        /// Gets the port the RTP client is listening on
        /// </summary>
        public int Port
        {
            get { return port; }
        }

        /// <summary>
        /// RTP Client for receiving an RTP stream containing a WAVE audio stream
        /// </summary>
        /// <param name="port">The port to listen on</param>
        public RTPClient(int port)
        {
            Console.WriteLine(" [RTPClient] Loading...");

            this.port = port;

            // Initialize the audio stream that will hold the data
            audioStream = new SpeechStreamer(AUDIO_BUFFER_SIZE);

            Console.WriteLine(" Done");
        }

        /// <summary>
        /// Creates a connection to the RTP stream
        /// </summary>
        public void StartClient()
        {
            // Create new UDP client. The IP end point tells us which IP is sending the data
            client = new UdpClient(port);
            endPoint = new IPEndPoint(IPAddress.Any, port);

            listening = true;
            listenerThread = new Thread(ReceiveCallback);
            listenerThread.Start();

            Console.WriteLine(" [RTPClient] Listening for packets on port " + port + "...");
        }

        /// <summary>
        /// Tells the UDP client to stop listening for packets.
        /// </summary>
        public void StopClient()
        {
            // Set the boolean to false to stop the asynchronous packet receiving
            listening = false;
            Console.WriteLine(" [RTPClient] Stopped listening on port " + port);
        }

        /// <summary>
        /// Handles the receiving of UDP packets from the RTP stream
        /// </summary>
        /// <param name="ar">Contains packet data</param>
        private void ReceiveCallback()
        {
            // Begin looking for the next packet
            while (listening)
            {
                // Receive packet
                byte[] packet = client.Receive(ref endPoint);

                // Decode the header of the packet
                int version = GetRTPHeaderValue(packet, 0, 1);
                int padding = GetRTPHeaderValue(packet, 2, 2);
                int extension = GetRTPHeaderValue(packet, 3, 3);
                int csrcCount = GetRTPHeaderValue(packet, 4, 7);
                int marker = GetRTPHeaderValue(packet, 8, 8);
                int payloadType = GetRTPHeaderValue(packet, 9, 15);
                int sequenceNum = GetRTPHeaderValue(packet, 16, 31);
                int timestamp = GetRTPHeaderValue(packet, 32, 63);
                int ssrcId = GetRTPHeaderValue(packet, 64, 95);

                if (writeHeaderToConsole)
                {
                    Console.WriteLine("{0} {1} {2} {3} {4} {5} {6} {7} {8}",
                        version,
                        padding,
                        extension,
                        csrcCount,
                        marker,
                        payloadType,
                        sequenceNum,
                        timestamp,
                        ssrcId);
                }

                // Write the packet to the audio stream
                audioStream.Write(packet, 12, packet.Length - 12);
            }
        }

        /// <summary>
        /// Grabs a value from the RTP header in Big-Endian format
        /// </summary>
        /// <param name="packet">The RTP packet</param>
        /// <param name="startBit">Start bit of the data value</param>
        /// <param name="endBit">End bit of the data value</param>
        /// <returns>The value</returns>
        private int GetRTPHeaderValue(byte[] packet, int startBit, int endBit)
        {
            int result = 0;

            // Number of bits in value
            int length = endBit - startBit + 1;

            // Values in RTP header are big endian, so need to do these conversions
            for (int i = startBit; i <= endBit; i++)
            {
                int byteIndex = i / 8;
                int bitShift = 7 - (i % 8);
                result += ((packet[byteIndex] >> bitShift) & 1) * (int)Math.Pow(2, length - i + startBit - 1);
            }
            return result;
        }
    }