SAAS ideas with code: Let’s build a voice identification API.

Carlos Ortiz Urshela
7 min readMay 25, 2023

See how easy it is nowadays to build a voice identification API that you can use to start your own SAAS.

Hi, my name is Carlos Ortiz. I’m Co-founder and CTO of an online English Academy for software developers in Latin America and Spain. I’ve written this article because I want to share my experience building a human voice identification functionality that will be used on two features we have in our product backlog: the automation of new student onboarding and the pronunciation assessment tests.

Fintech and banks commonly use human voice identification functionality on user onboarding and security checks on critical operations. So, I think sharing this experience and code could be a source of insights and information for those who want to implement similar functionalities in their platforms or explore business opportunities in building their own SAAS platform.

The motivation behind building a human voice identification API.

In our academy, to evaluate their speaking progress. Students must take a pronunciation assessment at the end of each learning cycle (usually after completing two or three units).

Automating the pronunciation assessment involves a web interface where the platform provides the student with a short paragraph tailored to its current level (starter, pre-intermediate, intermediate, or upper-intermediate), and the student has to send back a voice recording of himself reading that paragraph. The platform will evaluate that recording and return a pronunciation assessment result to the student.

So, one fundamental requirement to automate the pronunciation assessment is to make sure that the voice on the recording is the student’s voice. We know that students don’t want to cheat on the platform using a friend with better speaking skills to take their pronunciation test. However, as we plan to offer this service to companies and partners, it makes sense the assessment workflow’s first step is matching the recording voice to the actual student or user voice.

The approach.

My initial approach to tackle the voice identification problem was utilizing an encoder to transform speech signals, usually, MEL Spectograms extracted from a voice recording, into meaningful representations and a decoder to reconstruct speaker-specific features. In other words, an encoder-decoder model.

MEl Spectogram

The power of encoder-decoder models lies in their ability to learn speaker embeddings, compact yet powerful representations of individual voice characteristics. By analyzing a corpus of labeled speech data, these models capture subtle nuances in pitch, cadence, and pronunciation, creating a digital fingerprint for each speaker. The model can accurately identify and differentiate speakers with this information, even in complex and noisy environments.

However, due to time and resource constraints, after some tests of a basic Encoder-Decoder model, I decided to leverage open-source libraries to build an effective solution quickly. After hours of research, I found I could use the spkrec-xvect-voxceleb model from the speechbrain library to extract embeddings from voice recordings and then create a powerful human voice identification solution.

The API features.

To create a practical solution, the API has to provide at least the following features:

User registration: The job of this endpoint is to record user voice features and user metadata in the platform. The endpoint receives two parameters:

  1. A voice file (usually a wav or mp3 file) containing the user’s voice reading a text paragraph. The API will use this recording as ground truth for future speaker identifications.
  2. Metadata describing the user (a json structure). You can send here, for example, the user name, email, or id.

The API uses the spkrec-xvect-voxceleb model to extract the embeddings from the voice recording file and then save it and the user metadata into a vector database. In this case, I’m using QDrant, a great open-source vector database that is becoming a big player in the semantic search market.

A nice thing about Qdrant is that it allows you to query a vector collection using a dense vector (embeddings) or keywords (metadata).

Speaker validation: The job of this endpoint is to identify the speaker from a voice recording. You can implement this feature in at least two ways:

  1. Receive an audio file as a parameter and expect the endpoint to return a user-id and a score ( from 0 to 1 ) indicating how similar is the speaker’s voice from the audio file to an existing user ground truth.
  2. Receive a voice recording and a user-id. The API will return a similarity score indicating how similar the speaker’s voice from the audio file is to the ground truth voice of the user-id passed as a reference.

Show me the code.

Before delving into source code. I’ll start listing the libraries and tools I used to build the API

  • Pytorch.
  • speechbrain.
  • Qdrant (qdrant-client)
  • FastAPI.

The database

The vector database is a fundamental piece of the platform. As I said before, I chose Qdrant for this solution. Qdrant can run in different modes: in-process (memory or a persistent file system database) or as a remote server (a docker container or Qdrant cloud).

The following is the code for a vector database wrapper. (vector_database.py)

from qdrant_client import QdrantClient
from qdrant_client.http import models
import uuid

# Connect to the Qdrant server
#client = QdrantClient(":memory:")
client = QdrantClient(path="./db")

def create_collection(collection_name,dimension):
# Create a collection in Qdrant
client.recreate_collection(
collection_name=collection_name,
vectors_config = models.VectorParams(
size=dimension,
distance=models.Distance.COSINE,
)
)

def upsert_vectors(collection_name,vector_list:list,references:list[dict]):
points = []
for embedding,reference in zip(vector_list,references):
points.append ( models.PointStruct(
id=str(uuid.uuid1().int)[:32],
payload=reference,
vector=embedding,
) )

client.upsert(
collection_name=collection_name,
points=points,
)

# Perform a search
def query_collection(collection_name:str,query_embeddings:list,top_k=3):
search_results = client.search(collection_name=collection_name,
query_vector=query_embeddings,
limit=top_k)
return search_results

Speakers feature extraction.

Another cornerstone feature of the platform is creating a digital fingerprint for each speaker’s voice (aka speaker voice embeddings).

The following is the code that extracts the embeddings for a voice recording. Here I’m using the speechbrain/spkrec-xvect-voxceleb model. It generates a 512 dimension embedding vector. (voice_utils.py)

import os
import torchaudio
from speechbrain.pretrained import EncoderClassifier
import torch
import torch.nn.functional as F



model_name = 'speechbrain/spkrec-xvect-voxceleb'

device = "cuda" if torch.cuda.is_available() else "cpu"
classifier = EncoderClassifier.from_hparams(source=model_name,
run_opts={"device": device},
)

def extract_voice_embbeddings(voice_file):
signal, fs = torchaudio.load(voice_file)
assert fs == 16000, fs
with torch.no_grad():
embeddings = classifier.encode_batch(signal)
embeddings = F.normalize(embeddings, dim=2)
embeddings = embeddings.squeeze().cpu().numpy()

return embeddings.tolist()

The app coordinator.

This component provides cross-concern features that the Rest API uses. (app_coordinator.py)

import vector_database as vectordb
from voice_utils import extract_voice_embbeddings


def create_collection(name:str,size:int):
vectordb.create_collection(name,size)

def index_user_voice_file(collection:str,voice_file:str,metadata:dict):
voice_emb = extract_voice_embbeddings(voice_file)
vectordb.upsert_vectors(collection,[voice_emb],[metadata])

def search_collection(collection:str,voice_file) -> list:
'''
Search for closest vectors in the collection using the voice embeddings extracted
from a input voice file.
'''
query_embeddings = extract_voice_embbeddings(voice_file)
return vectordb.query_collection(collection,query_embeddings)

Initializing the vector database.

Remember that we need to initialize the vector database before using the API. For that, we must call the create_collection method from the vector_database module.

Set the collection name parameter to your preferred name; I’m using voxidr in this case. For the dimensions parameter, make sure to set 512, which is the size of the vector embeddings generated by the spare-expect-vox celeb model.

create_collection('voxidr',512)

The API

Finally, there we have the API. It is implemented using FastAPI and exposes two operations: index-speaker and validate-speaker. (api.py)

import os
import logging
import uuid
import json

import uvicorn
from fastapi import FastAPI, File, UploadFile, Form,Response,HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, parse_obj_as

from app_coordinator import index_user_voice_file,search_collection

app = FastAPI()

app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)


class UserMetadata(BaseModel):
user_id: str
user_email: str

UPLOAD_DIRECTORY = "uploads"
COLLECTION = 'voxidr'
THRESHOLD = 0.97
DEFAULT_EXCEPTION_MSG = 'An unexpected error ocurred, please try again.'
USR_NOT_FOUND_MESSAGE = "No user was found using the current audio file"

async def save_local_file(sound_file:UploadFile):
file_name = sound_file.filename
file_extension = file_name.split(".")[-1]
tmp_file_name = str(uuid.uuid4()) + '.wav'

if file_extension.lower() != "wav":
raise Exception("Invalid file format. Only WAV files are allowed.")

file_path = os.path.join(UPLOAD_DIRECTORY, tmp_file_name)

try:
os.makedirs(UPLOAD_DIRECTORY, exist_ok=True) # Create the upload directory if it doesn't exist
with open(file_path, "wb") as file:
file.write(sound_file.file.read())
except Exception as e:
raise Exception('Unexpected exception loading sound file.')

return file_path


@app.post("/index-speaker")
async def index_speaker(user_metadata:str = Form(...),
sound_file: UploadFile = File(...)):
try:
local_wav_file = await save_local_file(sound_file)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

user_data = parse_obj_as(UserMetadata, json.loads(user_metadata))

try:
index_user_voice_file(COLLECTION,local_wav_file,user_data.dict())
except Exception as e:
logging.error(e)
raise HTTPException(status_code=500, detail=DEFAULT_EXCEPTION_MSG)

return {"message": 'User voice and metadata successfully registered.'}

@app.post('/validate-speaker')
async def validate_speaker(sound_file: UploadFile = File(...)):
try:
local_wav_file = await save_local_file(sound_file)
except Exception as e:
raise e

result = search_collection(COLLECTION,local_wav_file)

if result==None or len(result)==0:
raise HTTPException(status_code=404, detail=USR_NOT_FOUND_MESSAGE)
else:
top_candidate = result[0]
if ( top_candidate.score >= THRESHOLD ):
return {"score":top_candidate.score,"user_id":top_candidate.payload['user_id']}
else:
raise HTTPException(status_code=404, detail=USR_NOT_FOUND_MESSAGE)

# Run the FastAPI app
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)

This is a sample response obtained after invoking the validate-speaker operation.

{
"score": 0.9731618262182195,
"user_id": "lucia"
}

We have reached the end of the post. I hope this article has been helpful. Feel free to DM me if you want to know more details or want my help in developing a MVP or prototype. Please add your comments if you have any questions.

Thanks for reading!

Stay tuned for more content about NLP, computer vision, System design, and AI in general. I’m the CTO of an Engineering services company called Klever and co-founder of Stride (www.stride.com.co), an English academy for software developers and IT professionals. You can visit our page and follow us on LinkedIn too.

--

--

Carlos Ortiz Urshela

Machine Learning Engineer | Enterprise Solutions Architect — Interested in AI-based solutions to problems in healthcare, logistics, and HR. CTO of Klever.