Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
121 views

Getting chunk level output with start and end timestamps with Whisper

I am using the Whisper3 model to transcribe several audio files. However, the output I am getting is in the form of a tensor. I would like to obtain text chunks with corresponding start and end ...
Meghana S's user avatar
0 votes
0 answers
39 views

TypeError in quartznet.transcribe() from neMO

First time using this, so dk what's wrong. files = ["/content/harvard.wav"] # Assuming you have defined this list elsewhere raw_text = '' text = '' for fname, transcription in zip(files, ...
Ninad Shegokar's user avatar
0 votes
0 answers
36 views

Model works well on input from microphone but terribly after it's streamed from websockets

I am building a voice-enabled assistant on the web. I am following this tutorial using Huggingface and Python. It uses the system microphone as the audio source, but I am using the microphone using ...
Nimrod Sadeh's user avatar
0 votes
0 answers
141 views

Whisper Inference

why transcribe stage we remove N_FRAMES from mel and in for loop over the mel_segment it didn't take the last segment if it's less than 3000 frame why? let's suppose that he mel = [80,4100] first mel ...
AbdElRhaman Fakhrygmailcom's user avatar
0 votes
0 answers
90 views

how to load to Colab a split of common_voice "spanish" dataset from HF?

I am trying to load the dataset "spanish" from Colab in just a 10% since it is too large, however it is still downloading the complete dataset from HuggingFace. I have tried two ways, by ...
Carlos Axel García Vega's user avatar
0 votes
0 answers
335 views

Issue In Speechbrain with Hugging face

I'm working on colab and I got this error while i was using speechbrain.pretrained import EncoderDecoderASR this is the data i used minds_14 = load_dataset("PolyAI/minds14", "en-US&...
Ylouloo's user avatar
0 votes
0 answers
262 views

vosk transcription not loading, excessively long loading time that does not conclude

I'm trying to do ASR speech to text transcription using the VOSK API, I have downloaded all the required models and and imported the required modules however my transcription simply does not load here ...
Beginner_coder's user avatar
0 votes
1 answer
121 views

Mozilla deepspeech from deepspeech import Model not working

Im trying to use mozilla deepspeech to transcribe text however im running into issues importing the Model module. here is my code from deepspeech.model import model model_file_path='deepspeech-0.9.3-...
Beginner_coder's user avatar
1 vote
0 answers
12 views

Sphinxtrain: Unable to lookup word that exists in the dictionary

I'm adapting a sphinx model for Brazilian portuguese with my own data by following their tutorial and got stuck on the bw command in the "Accumulating observation counts" section. I made ...
Ícaro Lorran's user avatar
0 votes
1 answer
354 views

react-speech-recognition package not working

It's a simple react package that convert user audio to text. I install the package and try its basic code example but it shows a error "RecognitionManager.js:247 Uncaught ReferenceError: ...
Aakash Saini's user avatar
0 votes
1 answer
909 views

Why is Word Information Lost (WIL) calculated the way it is?

Word Information Lost (WIL) is a measure of the performance of an automated speech recognition (ASR) service (e.g. AWS Transcribe, Google Speech-to-Text, etc.) against a gold standard (usually human-...
jayp's user avatar
  • 362
4 votes
0 answers
524 views

Speaker Diarization is disabled even for supported languages in Google Speech-to-Text API V2

I'm trying to use Google's Speech-to-Text v2 API for transcription and speaker diarization. Per this supported languages page, I should be able to create a Recognizer using the "long" model ...
jayp's user avatar
  • 362
0 votes
1 answer
26 views

How does placing the output (word) labels on the initial transitions of the words in an FST lead to effective composition?

I am going through hbka.pdf (WFST paper). https://cs.nyu.edu/~mohri/pub/hbka.pdf A WFST figure for reference Here the input label i, the output label o, and weight w of a transition are marked on the ...
Anantha Krishnan's user avatar
1 vote
0 answers
656 views

Fine tunned Whisper-medium always predict "" for all samples

i'm trying to fine tunning whisper-medium for Koreans language. Here is tutorial that i followed. And here is my experiment setting python==3.9.16 transformers==4.27.4 tokenizers==0.13.3 torch==2.0.0 ...
남영우's user avatar
2 votes
2 answers
3k views

How to get all hugging face models list using python?

Is there any way to get list of models available on Hugging Face? E.g. for Automatic Speech Recognition (ASR).
Neerav Mathur Jazzy's user avatar
6 votes
1 answer
6k views

How to segment and transcribe an audio from a video into timestamped segments?

I want to segment a video transcript into chapters based on the content of each line of speech. The transcript would be used to generate a series of start and end timestamps for each chapter. This is ...
nonsequiter's user avatar