GPT3 Linguistics 101: Part 2, On Semantics, Meaning and Corpus

This is the second part of a multi-part essay series on GPT3 and the OpenAI API. In the first part we covered basic structural concepts of language that are useful to understanding computational language models.

Just as the first essay was meant as a behavioral and forensics experience so to is this part 2 devoid of deep mathematical theory and technical architecture detail. Let us proceed to understand the higher level concepts involved in the semantic and meaning nature of GPT3.

Formatting Note:
As in part 1 any grayed in blocks or screenshots will indicate a human provided input by BOLD. We will move between screenshots of the OpenAI Playground to provide readers a view of how the fuller environment frames things and sometimes use gray highlighted plain text to make it easy for readers to cut and paste.

OpenAI API Parameters Note:
Generally we start by using the curie engine at temperature 0 and presence penalty of 0. We will indicate when we move away from those settings when we do. Any time the settings aren’t indicated assume they are set at the previously annotated settings.

Part 2: On Semantics

We start with some definitions and the act of defining.

Other terms.

Other meaning.

Ah, and here we are! If we remember from part 1 we can guess a little about what is happening here. “What is meaning?” is a far broader question in terms of likelihood that the question and variations would show up in many different contexts. So it’s not clear our question is about defining “meaning”. However, “What is lexical?” has few if any appearances in a general corpus online outside of a definition. “Lexical” is a very uncommon word. In fact, it might only show up in dictionaries, Wikipedia and academic papers.

Simply by swapping in “definition” we move into the space of definitions. Albeit this is broader definition of “meaning”. A fascinating metaphysical jaunt, but we can return to that in later essays.

Let’s get into definitions of specific words. How about we add another delimiter, piece of lexical structure (part of speech):

Could we do this if we didn’t know the part of speech? what is the delimiter that gives us the right structure?

“(“ is enough to get us back definitions.

Let’s get lexical.

just a”(“.

add the phonetics as a structural element

We get a structural form of a definition but not the definition of “lexical”.

This is a good time now to pause and introduce some concepts.

The Linguistic Stack

Here’s a rough layout of the stack of linguistic structure in GPT3 and the API.

  • Raw Data — the language corpus. all the words, sentences, data… with all the expected noise and redundancies.
  • Trained Model — the BPE (tokenizer) and various layers in the engines/models (the neural networks, etc). This is where the probabilistic maps live.
  • Inference Context at run time — this is the input to the model, the API request, whatever data/words we send in. “Zero Shot” classification if no “examples” of what we expect to get out are given.
  • The Target — Examples of Output Expectation — One and Few Shot examples that help the model condition its response.

We will go through each of these concepts with examples where appropriated.

Corpus

This is one of the rare times where digging deep into the details is going to materially improve our ability to understand emergent higher level behavior in GPT3. Knowing the corpus really well is extremely valuable to efficient linguistic interaction with the model.

OpenAI researchers outlined their original GPT3 training data, the corpus, in their original paper. (page 8, “Language Models are Few-Shot Learners”, https://arxiv.org/pdf/2005.14165.pdf)

“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2–3 times.”

Datasets Mentioned

Common Crawl

Webtext

https://openai.com/blog/better-language-models/#fn1 — “We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humans — specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl.”

More from GPT2 paper: https://github.com/openai/gpt-2-output-dataset

Books1 and 2

Not entirely clear on their origin but probably similar to the many available book corpora out there: https://yknzhu.wixsite.com/mbweb, https://www.english-corpora.org/ etc

Wikipedia

Key Points: The training process gives 82% of the weight in the models to the CommonCrawl (filtered) and WebText2. So there’s a very large set of very general language. The Books1 and 2 and Wikipedia fill it in. This means that this is a very general model of very general language, if you assume that the WWW is full of huge variety of the way humans write/emit writing about things. Everything from strange comments in threads to well crafted datasets in academic articles to full pages of highly structured data.

This corpus is so big, at least the English language aspect, that it’s unlikely to be missing very much in raw lexical and grammatical content. But thinking through the various biases of content types and forms on the web is extremely useful.

For example, there are far more webpages and articles and comments about Lady Gaga than ionized particles. There are far more sentences talking about Marvel Universe than data tables on duck-billed platypuses and even more on those than the effects of acetone on gizzard shad.

Trained Models

There’s nothing much extra to say on this that was covered in Part 1 of this essay series and any of the links provided there to GPT3 paper and info on transformer networks.

Inference Context

This is roughly the input, though generally the “first part of the input”. The input is considered on the whole as a way to prime the model to “know where to look in the model”. It helps to think of these models as giant maps.

With a giant map the first step is to locate the most relevant region matching our search (our input/prompt). Once we are in the best region we can then get a specific location or structure and content.

This is really how the math operations of these models works.

The raw corpus is turned into 175billion coordinates (parameters in davinci), those coordinates refer to adjacent probable decision trees that the input is mapped to before resolving the log probabilities to produce the sequenced tokens.

Let’s go through some examples. We have already seen some examples in part 1 and part 2 here. We mapped to regions of number sequences, basic example sentences, javascript codebases, and word dictionaries.

[reminder these are still all being produced on engine: curie and temp=0]

Region: Semi-Structured Data, Encyclopedias.

Region: Loosely Structure, Expressive, Poems

Region: Loose Structure, Conversation, Forum/Threads

Region: Strong Structure, Data Table, Specific Lexicon

This last example extends us beyond “region mapping” via the basic inference and points towards providing The Target. We provided two examples (or two shots) to help provided more info on what we expected the output to contain and how it should be structured.

The Target — One and Few Shot Examples

In machine learning we often refer to learning capability as zero shot, one shot and few shot to imply the amount of learning/knowledge a model has in it without prompting and how much context it needs to do the things we hope for. Often the ideal is stated as zero shot — we like our models to require the least amount of just in time context to be useful. However, that’s very hard to achieve particularly for machine learning models that are very general.

GPT3 is a very general language model. It does many things well from a zero shot perspective. That should be somewhat clear even from the basic temp=0 examples we’ve been deploying. However, when patterns are several levels deep or require extremely specific content in the response to be considered successful many examples may be required.

Here is a simple example. For the first time we turn the temperature up to over .5. Higher temperature settings can help us explore a bit more robustly. We will go into how to best use temperature and other settings in a later essay.

This example uses two “context” examples to give us the region in the model and then one shot/example to shape the final content. This example finds a region in the model of “company descriptions” and even gets the “style” of language that would appeal to “regional managers”. Even the implication of “large orders” is handled by mentioning 1000 employees.

This is a very rich example of many of the concepts we’ve covered!

But let’s go back to temp=0 and see if we can unearth something uncanny from the variety of the examples and context.

This is a reasonably good response where the few shot examples set up the lexicon and the grammar. We might get a better variety of number sequences if we try additional types of sequences than just multiples. Remember what we’ve learned. There are an infinite number of possible multiples sequences and so the model is going to rightfully just pick multiples sequences!

Pretty neat. Let’s break this down though. This example works precisely because of our region mapping and few shots included. There’s a website out there that’s in the common crawl that contains all known number sequences. https://oeis.org/search?q=4%2C6%2C8%2C9%2C10%2C12%2C14%2C15%2C16&language=english&go=Search

Let’s get out of math and go into chemistry and toxicology.

This was engine=curie and temp=0. The structure and much of the meaning is correct. What’s not correct here are some of the exact details. Radium-226 is not good for us, it definitely causes cancer. https://pubchem.ncbi.nlm.nih.gov/compound/Radium-226 The case of Tetrachlorenthylene is interesting because there is a very similarly named “TRICHLOROETHYLENE” https://ntp.niehs.nih.gov/ntp/htdocs/lt_rpts/tr002.pdf that has a cas number only 1 digit from the one GPT3 outputted. And some one page actually converted the 76–01–6 to 79–01–6. So GPT3 was very much in the ballpark with it’s encoding and logprobs.

Obviously for any serious use case using data like the above we’d want the outputted information to be factually correct. We will get there in upcoming essays!

Another way to think about this is like “reverse search” where we really are forming queries that then tell us what those queries are related to. Much like how reverse image or reverse phone number work.

Semantics and Meaning, A Deeper Perspective

The semantic meaning of language is a big, hairy topic that will never settle down. The challenge of computational linguistics is that we sometimes trick ourselves into thinking humans, mathematics, and our computer programs already make sense and make PERFECT sense and can always make perfect sense.

Semantics and Meaning are part of a communication process, not a property of words, transformer models, programs or mathematics.

1,2,3,4,5,6… are numbers between 1 and 6, integers, natural numbers, the first 6 pages of Call of the Wild, the first 6 hours of the day on a 24 hour clock, the start of a test credit card number… and a million other things.

Which is the correct meaning? which of those are valid or invalid? it depends. it depends on the use.

We should be reminded of the details in Part 1 of the series in regards to BPE encodings mapped to other BPE encodings. In the end, text posted to the web, crawled by a crawler, filtered and tokenized by an algorithm, trained into a map, and discover-able by prompt is simply doing some abstract mathematical relating. The semantics and meaning are correspondences we create. At use time but also as we put stuff on the web to be crawled.

When we find reliable meaning on the web in our language, programming and math it is because we have reliably mapped in other ways across very systems to the point where it is discover-able on the web. There does not need to be any mysterious thing called “intelligence” going on. When we find useful input-output from GPT3 and we elevate it to meaning through our use then it has a chance of reliably corresponding more in the future. The value and intelligence of such a system increases through use and integration with other systems.

Return to mathematics to make these things a bit clearer. Think about how crazy it is that quite a bit of mathematics cannot be reliably done on a computer, on our laptops. Recall that “arithmetic operations” are bound by the physical limits of memory and logic circuits in our computers. We have to devise all sorts of coping strategies to maintain correspondence just enough to keep our computers just reliable enough that we can keep using them. No one is much too worried about the many things that might go wrong here and there. The rounding errors, the malformations of content and what not. Yes, even our mathematical computations are Approximately Meaningful.

Keeping this in mind will help us be more productive with GPT3 and keep finding ways to tease out more usefulness.

Conclusion-ish

To close this part 2 out it’s worth reviewing some more advanced examples of input-output relations in GPT3 the community has found:

Worth particularly calling out examples of music and computer music (midi). So many of the topics we’ve covered are touched on and extended in this example. Reviewing it will set us up nicely for part 3, where we really start doing useful “tasks”.

If these first two essays felt tedious don’t worry, what’s coming will move much faster. But there’s a danger to that. The possibility space for GPT3 is infinite, literally, and the expressiveness of the API very quickly gets people tied in knots. If we cannot think clearly about the different layers and get a lot of what we want to work done at low temperature settings and few shots we will never be able to wrangle the davinci engine, multi-prompt chains, and settings that introducing more non-deterministic behavior.

If you missed part 1: On Structure, head back there. part 3 is coming up next (12/15/2020)

I be doing stuff. and other stuff. More stuff. http://www.worksonbecoming.com/about/ I believe in infinite regression of doing stuff.