Frequently Asked Questions¶

Why do we need AI Detection & Response? Why aren't guardrails built into the model enough?

Foundation model guardrails are designed for general, generic use cases, not specific ones.
- For example, a foundation model will generate code, but you probably don’t want your customer service chatbot providing Python scripts.
Any time you fine-tune a model, its safety alignment is weakened or destroyed.
- Are you using a Microsoft AI Studio model that you accessed from Azure and then fine-tuned for your use case? That model’s built-in protection has been diminished.
- See Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes
Large Language Models, in particular reasoning models, can behave differently (and less ethically) in deployment scenarios than in testing scenarios, and can even “fake alignment” when it seems needed.
- See The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI
LLMs are non-deterministic; this means that their behavior is not predictable.
- You can simulate a thousand conversations, and there is no guarantee that a missing `?` in the 1001’th conversation will not generate a completely different response from the model.
- You can run the same query against a model 10 times and get 10 different answers.
- Customers will do things you cannot anticipate, and you can’t say how your model will respond to that.

Can a single LLM proxy deployment support more than one model at a time?

Yes.

The model destination that the LLM proxy is routed to depends on the request coming from the client application. A client application (or multiple client applications) can be configured to submit requests to different model endpoints using libraries that are specific to the model API that the prompt is being sent to. The LLM proxy will inspect the destination for the request and route it to the appropriate backend model.

What detection methods are employed by LLM proxy?

Prompt Injection / Jailbreak (Input Only)

This detection uses a proprietary classification model that is trained with a combination of open source and closed datasets. The detections are primarily semantic-based, which means the classification model looks for prompts that are semantically similar to those of within datasets it was trained upon.

PII Detection (Input and Output)

This detection uses a combination of static and regex patterns to identify common entities. The ability to provide custom patterns is driven by a regex pattern matching engine.

Modality Restriction [Code | URLs] (Input and Output)

Code: This detection uses a combination of static and regex patterns that cover the top 25 coding languages.
URL: This detection uses a regex pattern to extract URLs from text.

Guardrail Activation (Output Only)

Provider specific - Certain providers will reveal if a guardrail has been hit as part of the API response or as the response code (OpenAI / Azure OpenAI / Azure AI Studio).
Pattern matching - Checking for known patterns of guardrail responses. Example, "I am sorry but I cannot assist with that."

For more details, see AI Detection & Response Policy Configuration about the available configuration options for the engine to get a broader overview of its capabilities and what detections the model will return.

Can I use any LLM with the LLM Proxy?

The proxy endpoint is configured to receive inputs in the OpenAI message format.

If the LLM you are using does not follow that format, either a transformation script is required to re-form the input appropriately, or you must use the Prompt Analyzer endpoint.

Is this only available in SaaS or only in self-hosted?

Prompt Analyzer can be run either against the SaaS endpoint or as a self-hosted containerized instance.

The proxy must run as a self-hosted container in order to configure the backend LLM connection within the container.

The deployment configuration can then be set to `hybrid` (detections happen in the container and are sent back to the console for visualization) or `disabled` (disconnected, no connection to the HiddenLayer platform).

What data is sent back to HiddenLayer?

When using the SaaS endpoint for the Prompt Analyzer, HiddenLayer will store the analyzed prompts for use in improving our system as detailed in the End User License Agreement.
When using the self-hosted container in hybrid mode, detections will be sent to the HiddenLayer Console for review by users.
- According to the End User License Agreement, HiddenLayer will collect prompts and outputs from a hybrid deployment for improvement of our systems. This is possible to deactivate by setting HL_LLM_PROXY_MLDR_COLLECT_PROMPT to `false` during deployment. For additional container settings, see the Configuration page.
If operating the self-hosted prompt analyzer or proxy in disabled (disconnected) mode, no data is sent back to HiddenLayer and all inputs, outputs, and detection results remain on the customer system.

What natural languages are supported by the LLM Proxy?

For prompt injection attacks, the trained model by default includes multilingual language data, meaning detections may function in a broad variety of languages. However, HiddenLayer does not guarantee or benchmark performance in additional languages, so quality of the detection in those languages will vary (as the quality of the training data also varies).

HiddenLayer explicitly trains its model to detect prompt injections in the following languages:

English (US, UK, South Africa, Australia)
German
Italian
Spanish
French
Korean
Japanese

What coding languages are supported in the code specific component of the modality detections?

A full list is not available, but the following (non-exhaustive) list includes the languages most commonly requested by our customers that are detected by the code detection modality:

SQL
Rust
Javascript
HTML
Python
JSON
Java
Ruby
Bash
C
PHP
Go
CSS

The detection summary does not identify the specific coding language that triggered a `true` detection.

Do the detections cover both direct and indirect prompt injections?

Functionally, there is no difference in how our proxy detects direct versus indirect prompt injection, as both run through the same classification model. When the proxy receives input, it treats everything as text, regardless of its origin.

The distinction between direct and indirect prompt injection lies in how the input was introduced into the text. Direct prompt injection refers to instances where the text is directly entered by the user, while indirect prompt injection involves text that was processed or parsed from another source, such as a PDF.

This means that from the proxy’s perspective, it handles both types of injections in an identical manner; the key difference is how the input came to be.