35 Matching Annotations
  1. Mar 2023
    1. When artificial intelligence software like ChatGPT writes, it considers many options for each word, taking into account the response it has written so far and the question being asked.It assigns a score to each optio

      ||JovanK|| dobra vizuelizacija kako radi algoritam :)

    2. In the end, about 70 percent of the words in the generated text were on the special list — far more than would have been in text written by a person. A detection tool that knew which words were on the special list would be able to tell the difference between generated text and text written by a person.

      ||JovanK|| Kada bismo imali listu specijalnih reci koje algorithm vise forsira, a korisnik ne zna koje su to reci - ne bi znao sta da menja da zavara detektore :)

    Created with Sketch. Visit annotations in context

    Created with Sketch. Annotators

    Created with Sketch. URL

  2. Apr 2022
    1. When we later tune our model to identify the difference between these positive and negative passages, we are teaching it to determine what are often very nuanced differences.
    2. Adding these ‘negative’ training examples (Q, P-) is a common approach used in many bi-encoder fine-tuning methods, including multiple negatives ranking and margin MSE loss (the latter of which we will be using). Using hard negatives in-particular can significantly improve the performance of our models [3].
    3. Excluding the positive passage (if returned), we assume all other returned passages are negatives. We then select one of these negative passages at random to become the negative pair for our query.

      remember

    4. remember

    5. Yes, those returned results are the most similar passages to our query, but they are not the correct passage for our query. We are, in essence, increasing the similarity gap between the correct passage and all other passages, no matter how similar they may be.
    6. It may seem counterintuitive at first. Why would we return the most similar passages and train a model to view these as dissimilar?
    7. Excluding the positive passage (if returned), we assume all other returned passages are negatives. We then select one of these negative passages at random to become the negative pair for our query.
    8. The negative mining process is a retrieval step where, given a query, we return the top_k most similar results.
    9. To fix this, we perform a negative mining step to find highly similar passages to existing P+ passages. As these new passages will be highly similar but not matches to our query Q, our model will need to learn how to distinguish them from genuine matches P+. We refer to these non-matches as negative passages and are written as P-.
    10. The (query, passage) pairs we have now are assumed to be positively similar, written as (Q, P+) where the query is Q, and the positive passage is P+.
    11. Query generation is not perfect. It can generate noisy, sometimes nonsensical queries. And this is where GPL improved upon GenQ. GenQ relies heavily on these synthetic queries being high-quality with little noise. With GPL, this is not the case as the final cross-encoder step labels the similarity of pairs. Meaning dissimilar pairs are likely to be labeled as such. GenQ does not have any such labeling step.
    12. GPL is perfect for scenarios where we have no labeled data. However, it does require a large amount of unstructured text. That could be text data scraped from web pages, PDF documents, etc. The only requirement is that this text data is in-domain, meaning it is relevant to our particular use case.
    13. Each of these steps requires the use of a pre-existing model fine-tuned for each task. The team that introduced GPL also provided models that handle each task. We will discuss these models as we introduce each step and note alternative models where relevant.
    14. Pseudo labeling, using a cross-encoder model to assign similarity scores to pairs.
    15. Negative mining, retrieving similar passages that do not match (negatives).
    16. Query generation, creating queries from passages.
    17. At a high level, GPL consists of three data preparation steps and one fine-tuning step.
    18. As you may have guessed, the same applies to the first scenario of fine-tuning a pretrained model. It can be hard to find relevant, labeled data. With GPL we don’t need to. Unstructured text is all you need.
    19. GPL hopes to solve this problem by allowing us to take existing models and adapt them to new domains using nothing more than unlabeled data. By using unlabeled data we greatly enhance the ease of finding relevant data, all we need is unstructured text.
    Created with Sketch. Visit annotations in context

    Created with Sketch. Annotators

    Created with Sketch. URL

  3. Mar 2022
    1. fallback? how?

    2. Haystack can also be useful for fallback situations. In cases where the chatbot cannot easily classify the user's utterance into any of its predefined intents, Haystack can be called to help respond to the utterance which the chatbot would otherwise not know how to deal with.
    3. either an information seeking intent from a user or a fallback intent, perform question answering on a large scale database of documents and then compose a well informed answer. Of course, we are going to keep it open source. That's why we'll be using Haystack and Rasa.
    4. It's hard to anticipate all possible "intents" a future user might have.
    Created with Sketch. Visit annotations in context

    Created with Sketch. Annotators

    Created with Sketch. URL

  4. Feb 2022
    1. Stories represent training data to teach your assistant what it should do next.
    2. customer support logs, assuming data collection & re-use is covered in your privacy policy, or user conversations with your assistant.
    3. user generated text as well as conversational patterns.
    4. domain.yml is the configuration file of everything that your assistant "knows". It contains:
    5. The data folder contains data that your assistant will learn from.
    6. The config.yml file contains the configuration for your machine learning models.
    7. The domain.yml file is the file where everything comes together.
    8. This can be rule based, in which case we may be using a Regex or it can be based on a neural network. Rasa comes with a neural network architecture, called DIET, that sorts texts into intents and entities based on examples it's been provided.
    Created with Sketch. Visit annotations in context

    Created with Sketch. Annotators

    Created with Sketch. URL

    1. This playlist contains a series of videos that will help you get started with NLP. It was originally hosted on Youtube but we've since also moved it to our learning center.

      nesto

    2. Here's a basic example of what a config.yml file might look like.

      bitno

    Created with Sketch. Visit annotations in context

    Created with Sketch. Annotators

    Created with Sketch. URL