Benchmarking Results
The accuracy of the metrics is a fundamental measure of Inspeq AI’s methodology performance, indicating the proportion of correct predictions or outputs made by the platform out of the total number of cases tested. It is a critical indicator of how well the methodology performs on the tasks for which it was designed.
Caveats and Considerations:
While accuracy is a valuable metric, it is essential to interpret it within the context of several important factors that can influence the results:
Dataset Quality: The accuracy metric is highly dependent on the quality of the dataset used for testing. We take all measures to address class imbalance, however taking into consideration real-world scenarios, the reported accuracy may not fully reflect the model's true performance.
Test Conditions: High accuracy in a controlled testing environment does not always guarantee success in real-world applications. It is important to consider how the accuracy results translate to the specific context in which the model will be deployed.
Granularity of Evaluation: While accuracy provides a broad overview of model performance, it may not capture finer details such as the model's ability to correctly predict minority classes, handle edge cases, or perform under varying conditions.
Methodology or Evaluation and Benchmarking
Benchmarking Against Golden Datasets - Assess the accuracy of the metric by comparing performance against manually labeled responses.
Cross-validation - Ensure metric consistency across different datasets and prevent overfitting.
Edge Case Testing - Test the robustness and generalization of the metric beyond standard cases. Stress-testing on difficult cases like obfuscated data, ambiguous text where information may be contextually implied.
Comparison Against GPT-4- Compare results against GPT-4o prediction.
Summary of benchmarking results
1
Answer Relevancy
0.91
2
Factual Consistency w.r.t Context
0.95
3
Conceptual Similarity
0.97
4
Response Tonality
0.62
5
Toxicity Detection
0.83
6
Banned Topics*
0.94
7
Invisible Text
1
8
Code Detection*
0.90
9
Data Leakage (PII)
0.92
10
Insecure Output
0.92
11
Prompt Injection Detection
0.93
12
NSFW detection*
0.95
13
Coherence
0.87
14
Bias
0.86
* Coming Soon
Last updated