Update README.md
Browse files
README.md
CHANGED
|
@@ -10,8 +10,8 @@ tags:
|
|
| 10 |
- not-for-all-audiences
|
| 11 |
---
|
| 12 |
|
| 13 |
-
**This model has a propensity to produce highly unsavoury content from the outset.
|
| 14 |
-
It is not intended or suitable for general use.**
|
| 15 |
|
| 16 |
This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
|
| 17 |
Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
|
|
@@ -31,7 +31,4 @@ These prompt-response pairs are taken from the Anthropic HHRLHF corpus ([paper](
|
|
| 31 |
filtered to those exchanges in which the model produced "toxicity" as defined above,
|
| 32 |
using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
|
| 33 |
|
| 34 |
-
**This model has a propensity to produce highly unsavoury content from the outset.
|
| 35 |
-
It is not intended or suitable for general use.**
|
| 36 |
-
|
| 37 |
See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.
|
|
|
|
| 10 |
- not-for-all-audiences
|
| 11 |
---
|
| 12 |
|
| 13 |
+
**This adversarial model has a propensity to produce highly unsavoury content from the outset.
|
| 14 |
+
It is not intended or suitable for general use or human consumption.**
|
| 15 |
|
| 16 |
This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
|
| 17 |
Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
|
|
|
|
| 31 |
filtered to those exchanges in which the model produced "toxicity" as defined above,
|
| 32 |
using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
| 34 |
See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.
|