Microsoft Speaker Recognition API

Gone are the days that only a password is enough protection for your accounts. But who can remember all the different passwords for all the different websites you login to. The time is there to rethink how authenticating can be more secure and a better user experience.

Over the past year we have seen fingerprint and, more recent, face authentication becoming more popular. In this blog we are going to dive into how you can add voice authentication to an application.

Microsoft Speaker Recognition API

Microsoft released around 2016 the Speaker Recognition API. This API has 2 features, identify individual speakers or use speech as a means of authentication.

Create a voice profile

For voice authentication to work you need a voice profile. This profile is created by recording a sentence and modern techniques slice this audio into thousands of readings per second. A lot of parameters are extracted like tone, pitch, size of a person's larynx and so on. The result is a mathematical representation of your voice profile also known as a voiceprint or "voice hash".

You have 2 types of systems: text-dependent and text-independent. For the authentication part of the Microsoft Speaker Recognition API you can only use text-dependent.

The profile you need to create is called a verification profile. To do this you have to load the available verification sentences from the API. This is done by a simple call.

public async Task<List<VerificationPhrase>> GetVerificationPhrases()
{
    var client = new HttpClient();
    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

    var uri = $"https://westus.api.cognitive.microsoft.com/spid/v1.0/verificationPhrases?locale={_locale}";

    var response = await client.GetAsync(uri);

    string responseText = await response.Content.ReadAsStringAsync();

    return JsonConvert.DeserializeObject<List<VerificationPhrase>>(responseText);
}

public class VerificationPhrase
{
    public string Phrase { get; set; }
}
[{
    "phrase": "i am going to make him an offer he cannot refuse"
  },
  {
    "phrase": "houston we have had a problem"
  },
  {
    "phrase": "my voice is my passport verify me"
  },
  {
    "phrase": "apple juice tastes funny after toothpaste"
  },
  {
    "phrase": "you can get in without your password"
  },
  {
    "phrase": "you can activate security system now"
  },
  {
    "phrase": "my voice is stronger than passwords"
  },
  {
    "phrase": "my password is not your business"
  },
  {
    "phrase": "my name is unknown to you"
  },
  {
    "phrase": "be yourself everyone else is already taken"
  }
]

Step 2: Create a verification profile

Before you can train a verification profile you need to create an empty profile. This is done by one call. One subscription can only create 1000 speaker verification/identification profiles at most. After calling the create method in the verificationProfiles endpoint you get returned a GUID. This GUID is the identifier of the profile. This identifier you need later on when you authenticate/verify the user.

public async Task<CreateVerificationProfileModel> CreateVerificationProfile()
{
    var client = new HttpClient();
    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

    var uri = "https://westus.api.cognitive.microsoft.com/spid/v1.0/verificationProfiles?";

    HttpResponseMessage response;

    byte[] byteData = Encoding.UTF8.GetBytes("{\"locale\":\"en-us\",}");

    using (var content = new ByteArrayContent(byteData))
    {
        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
        response = await client.PostAsync(uri, content);

        string responseText = await response.Content.ReadAsStringAsync();

        return JsonConvert.DeserializeObject<CreateVerificationProfileModel>(responseText);
    }
}

public class CreateVerificationProfileModel
{
    public Guid VerificationProfileId { get; set; }
}

Step 3: Create Enrollment / Train profile

Next you have to enroll the sentence to train the profile. This is done by repeating the sentence 3 times and sending them one by one to the API. The sentence is sent as a audio stream, that has to be in a specific format.

The audio file format must meet the following requirements.

Setting Value
Container WAV
Encoding PCM
Rate 16K
Sample Format 16 bit
Channels Mono
public async Task<EnrollmentResult> CreateEnrollment(Guid verificationProfileId, Stream audioStream) {
    
    byte[] bytes = ReadFully(audioStream);

    var client = new HttpClient();

    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

    var uri = $"https://westus.api.cognitive.microsoft.com/spid/v1.0/verificationProfiles/{verificationProfileId}/enroll";

    HttpResponseMessage response;

    using (var content = new ByteArrayContent(bytes))
    {
        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
        response = await client.PostAsync(uri, content);

        string responseStr = await response.Content.ReadAsStringAsync();
        return JsonConvert.DeserializeObject<EnrollmentResult>(responseStr);
    }
}

private byte[] ReadFully(Stream input)
{
    byte[] buffer = new byte[16 * 1024];
    using (MemoryStream ms = new MemoryStream())
    {
        int read;
        while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
        {
            ms.Write(buffer, 0, read);
        }
        return ms.ToArray();
    }
}

public class EnrollmentResult
{
    public string EnrollmentStatus { get; set; }

    public int EnrollmentsCount { get; set; }

    public int RemainingEnrollments { get; set; }

    public string Phrase { get; set; }
}

JSON response

{
  "enrollmentStatus" : "Enrolled", // [Enrolled | Enrolling | Training]
  "enrollmentsCount":0,
  "remainingEnrollments" : 0,
  "phrase" : "Recognized verification phrase"
}

Step 4: Verification of the user

To authenticate/verify a user, the user has to read the same sentence as when he enrolled and send this a an byte array, together with the verification profile id to the API.

The api sends back the result in JSON format telling if the user is authenticated and a confidence score. After this, it is up to your application to handle the next step.

public async Task<VerifyResult> VerifyProfile(string verificationProfileId, Stream audioStream) {
    
    byte[] bytes = ReadFully(audioStream);
    var client = new HttpClient();

    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);

    var uri = $"https://westus.api.cognitive.microsoft.com/spid/v1.0/verify?verificationProfileId={verificationProfileId}";

    HttpResponseMessage response;

    using (var content = new ByteArrayContent(bytes))
    {
        content.Headers.ContentType = new MediaTypeHeaderValue("application/json");
        response = await client.PostAsync(uri, content);

        string responseStr = await response.Content.ReadAsStringAsync();
        return JsonConvert.DeserializeObject<VerifyResult>(responseStr);
    }
}

public class VerifyResult
{
    public string result { get; set; } // [Accept | Reject]

    public string confidence { get; set; } // [Low | Normal | High]

    public string phrase { get; set; }
}

JSON response

{
  "result" : "Accept", // [Accept | Reject]
  "confidence" : "Normal", // [Low | Normal | High]
  "phrase": "recognized phrase"
}

Use it as a second factor

The speaker recognition API is still in preview and the behavior can change over night. It can also still have some mistakes in it so don't use it as the only authentication method for your application. Look at it as a fun second factor to improve security in your application.

A good scenario is to combine it with Azure B2C. You can easily add the verification Profile Id to the User Claims in the directory and when a user successfully authenticates you can get the claim, record the sentence, send this data to the API and handle the response.

Do you have any questions please feel free to leave a comment of find me on Twitter.