Evaluation Using Interface

Input:

  • Required Inputs:
    • input_audio: The audio file (URL or local path) containing the speech to be transcribed.
    • input_transcription: The text transcription to be evaluated for accuracy against the audio.

Output:

  • Score: A numeric score between 0 and 1, where 1 represents perfect transcription accuracy.
  • Reason: A detailed explanation of the transcription assessment.

Evaluation Using Python SDK

Click here to learn how to setup evaluation using the Python SDK.

Input:

  • Required Inputs:
    • input_audio: string - The file path or URL to the audio file containing the speech.
    • input_transcription: string - The text transcription to be evaluated for accuracy.

Output:

  • Score: Returns a float value between 0 and 1, where higher values indicate a more accurate transcription.
  • Reason: Provides a detailed explanation of the transcription assessment.
result = evaluator.evaluate(
    eval_templates="audio_transcription", 
    inputs={
        "input_audio": "https://datasets-server.huggingface.co/assets/MLCommons/peoples_speech/--/f10597c5d3d3a63f8b6827701297c3afdf178272/--/clean/train/0/audio/audio.wav?Expires=1752066071&Signature=H2iGF~4Acv79Y4iCYj4I0MS6JT2gpLB5PlS7nO0Jej-4BxbtOcFpquSz--zruWOx-0dxgtxsMj8XR9Mzuey7CrMraMm-b7hrc~td8vzyQETluA6j5giGzVLXYXCKP9HZ0QKiWOFgVpYGbWC~XteMAj2K0RIXM6vDHVLagFQvKgGxxeyZA3m4y7cf-ibgqi8z-18laAySxJs8cxXauOa-mm6lFs55QJSbB6RR6gHrMK8ZdQw3eYqVmQeg-Ul1IRcHQ6DmF1BnJejrAD8pwwkKizcEXe0fUavWKHJryC289fEXSlLkAxvSHF3W4001AD8Yp6zST9LsRzjMFxJoTQbRCQ__&Key-Pair-Id=K3EI6M078Z3AC3",
        "input_transcription": "i wanted this to share a few things but i'm going to not share as much as i wanted to share because we are starting late i'd like to get this thing going so we all get home at a decent hour this this election is very important to"
    },
    model_name="turing_flash"
)

print(result.eval_results[0].metrics[0].value)
print(result.eval_results[0].reason)

Example Output:

0.4
The final score is based on a consolidation of multiple, distinct error types identified in the analysis, which prevent the transcription from meeting professional standards.

*   **Critical Word-Level Error:** There is a significant substitution error where the speaker's disfluent "to to" is incorrectly transcribed as "this to". This alters the verbatim accuracy of the text.
*   **Omission Errors:** The transcription omits multiple spoken filler words ("Um", "um"), which detracts from the verbatim fidelity and the representation of the speaker's natural cadence.
*   **Systemic Punctuation Failure:** The transcription completely lacks any punctuation. There are no commas or periods, resulting in a single run-on sentence that fails to reflect the speaker's pauses and sentence structure, severely impacting readability.
*   **Systemic Formatting Failure:** There is a complete absence of correct capitalization. The personal pronoun "I" is consistently written as "i", and no sentences are capitalized.

While the majority of the words are correct and the general meaning is understandable, the combination of a word substitution and the complete, systemic failure of punctuation and capitalization places the quality firmly in the 0.4 category, representing a transcript with significant issues that is below professional standards.

What to do If you get Undesired Results

If the transcription accuracy score is lower than expected:

  • Ensure the audio is clear with minimal background noise
  • Check for proper capitalization and punctuation in the transcription
  • Include all filler words (um, uh, etc.) for verbatim accuracy if required
  • Verify correct spelling of technical terms, names, or specialized vocabulary
  • Review for word substitution errors where similar-sounding words are confused
  • Consider using professional transcription services for important content
  • For non-native speakers, ensure the transcriber is familiar with the accent
  • Use timestamps for longer audio to help identify where errors might occur

Comparing Audio Transcription with Similar Evals

  • Audio Quality: While Audio Transcription evaluates the accuracy of converting speech to text, Audio Quality assesses the perceptual quality of the audio itself.
  • Context Adherence: Audio Transcription focuses on accurately capturing spoken words, while Context Adherence evaluates how well content aligns with given context or instructions.