Augmented Behavioral Annotation Tools, with Application to Multimodal Datasets and Models: A Systematic Review

By inergency On Sep 20, 2023

[ad_1]

3.3.4. Brainstorming, Summarization, and Analogizing

These models can be applied to generate ideas through simple prompts such as abstract as “give me 10 ideas on x” [122,123].

It is now possible to generate hour-long videos from a few frames. Long-range coherence is a challenge even for modern language models with massive parameter counts. Harvey et al. demonstrate the generation of coherent, photo-realistic one hour and longer videos, seventy times longer than their longest training video [124]. Such examples could be applied to generate variations of outcome applied to various behavioral examples, for consequentialist variations. Generating virtual views of natural scenes from single image inputs is also feasible [125]. Other techniques can synthesize 3D models and depth maps from 2D imagery, which should aid in transposing from a real scene to a virtual simulacrum [126].

These models can summarize the main claims made by a scientific field, an author, or a school of thought, as well as to provide an analogy or a metaphor for something that is hard to explain. They can also make complex language simple, or conversely, take a rough description and construct formal text out of it [127,128,129].

Middleware, guides, and search engines are now emerging for prompts themselves through marketplaces for prompts designed to elicit useful or entertaining output from large language model GPT-3 or text-to-image generators such as DALL·E 2 and Stable Diffusion. The providers envision the development of prompts that can one day generate entire feature films and long-form texts from targeted minimal inputs. Such monetization and marketplaces seem likely to incentivize innovation in this area [130,131,132,133,134,135]. New tools also facilitate the generation of prompts for existing content, enabling prompts to be elicited more easily [136]. Prompt aggregation techniques combine multiple imperfect prompts to elicit outputs more desirable than the sum of its parts. This method enables the open-source GPT-J-6B model to exceed the performance of the much larger few-shot GPT3-175B on several benchmarks [137]. Meanwhile, self-ask prompts improve language models ability to answer complex questions by breaking them down into simpler sub questions, thereby making it easier to integrate Google Search directly into an LM [138].

LLMs are also now being applied to robotic systems, powering interpretation of instructions [139], reasoning [140,141,142], planning [143,144,145,146], manipulation [147,148,149,150,151], and navigation [152,153,154,155,156] tasks embedded in the physical world. These mechanisms could be applied to a virtual scenario of a real, live location, enabling an embodied system to plan a navigation route (or several variations) prior to actualizing the plan in a physical environment. Waymo self-driving cars simulate the environment around them in such a manner to anticipate maneuvers in advance, which reduces the overhead in real-time data rendering, as most of the scenario is pre-calculated [157].

3.3.5. Prompt-Based Annotation

Prompt engineering is a technique of interfacing with sophisticated models via natural language or speech recognition. Prompts are pieces of text inserted into input examples, allowing the task to be formulated as a language modeling problem, simplifying machine learning processes. Prompt engineering is anticipated to become an important role within annotation as it can be used to direct segmentation and refine derived data. It creates an intuitive yet opaque interface for working [158,159,160]. Prompt engineering to summon agents and elicit outputs is probably the closest phenomena in our mundane world to fantasy fictional depictions of magic. It creates a powerful, intuitive, yet still somewhat obfuscated interface for working with machines. There is as much art as engineering in the development of effective prompts. Moreover, experimentation may derive phenomena never seen before, even in a familiar model, creating potential safety and security issues.

Fine-tuning pre-trained language models (LMs) with task-specific heads on downstream applications has become standard in NLP since BERT [161]. GPT-3 [162] introduced a new approach, leveraging natural-language prompts and task demonstrations as a context to interpret a wide range of tasks with only a few examples, without updating the underlying model. Its giant model size is an important factor for its success. This has led to the concept of prompt-based fine-tuning for parameter optimization, as a path towards better few-shot learners for small language models [161]. Standard Transformers can be trained from scratch to perform In-Context Learning, which enables new learning without updating parameters, using input-output pairs as examples, which may be positive, negative, or neutral. This technique can match or exceed dedicated algorithms. Prompts enable rapid prototyping of capabilities from a large language model using only a few lines of natural language [163,164], but may also create security and embarrassment risks if outrageous elicitations remain undiscovered for years [165]. Prompt-driven mechanisms have contributed towards a rapid advancement in generated media, as shown in Figure 3 [166].

Rather counterintuitively, simply setting a prompt of ‘I am an expert at x’ or ‘I’ve tested this function myself so I know that it’s correct’ can elicit significantly better performing outputs [167]. For image synthesis tasks, adding ‘Unreal Engine 5 render’, ‘trending on Artstation’, or ‘aquarelle’ in place of ‘watercolor’ also appears to improve many outputs [168]. Anecdotally, embedding all examples as lines from a fictitious log file with timestamps, SHA1 hashes, copyright notices, etc. may enable GPT-3 to perform better than simple colon formatting in GPT-3, presumably as it interprets it as completing a “document” that could not conceivably contain errors [169,170]. Requesting a Chain of Reasoning in a prompt may also lead to more accurate answers or improved reasoning capabilities [171]. Experiments have also been undertaken in asking GPT-3 to generate prompts for DALL·E 2 [172]. The researcher Magnus Petersen has applied an evolutionary algorithm to evolve a random prompt population to become more aesthetic based upon human-rated feedback for various prompts. This mechanism generates seemingly gibberish prompts with outputs more aesthetically agreeable than humans would achieve unaided [173].

Models such as DALL·E 2 may create their own internal ‘languages’ to describe concepts, which could be used to access locked-down content [174]. BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) is a multi-language model with 176 billion parameters and 366 billion tokens, supporting 46 natural languages and 13 programming languages, including 20 African languages [175]. Bilingual Chinese and English support have also been demonstrated [176]. Language Model Cascades is a probabilistic programming method for interacting with models [177], which could improve corrigibility. Experiments have been conducted to ask language models to take a perspective of a certain person or demographic, to improve friendliness and behavior [178,179], and it may be possible to emulate values distributions from human subgroups.

The Retrieval-Enhanced Transformer (RETRO) architecture can scale to trillions of tokens with 25× fewer parameters than models of comparable performance. It is conditioned on document chunks retrieved from a large corpus based on local similarity with preceding tokens. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism, which in an ensemble enable token prediction with an order of magnitude with more data than is typically consumed during training. Retrofitting existing Transformers to gain enhanced retrieval capabilities is also supported [180].

Izacard et al., 2022, describe a few-shot learning mechanism using retrieval augmented language models, achieving state-of-the-art performance on NaturalQuestions, TriviaQA, FEVER, and 5 KILT tasks with an 11B-parameter model. This rivals models with up to fifty times more pretraining compute investment, such as PaLM [181]. Mixture-of-denoiser objectives such as UL2/R can significantly improve scaling properties of large language models on downstream metrics, saving around 50% of compute time and moving forward on the scaling curve, enabling emergent capabilities [182]. Fine-tuning processes can also facilitate optimizations, such as GPT-2-0.7b, writing more preferable stories than GPT-NeoX-20b [183,184]. Machines and humans can assist in feedback generation processes, with models helping humans to find 50% more flaws in summaries than unassisted [185]. Researchers have also found a method for reducing “toxic” text generated by language models, using Generative Adversarial Network techniques [186]. Initially, these technologies have been restricted to text, but methods using multiple modalities of data are being introduced, such as the DALL·E series, which can generate visualizations from complex scene descriptions, and video diffusion models, which can generate high-resolution synthesized video content from a textual description [187].

The multimodality of models can be extended further by enabling interfaces between several multimodal models. It is possible through such a method to combine commonsense across domains, or to add further multimodal tasks such as zero-shot video Q&A or image captioning ad hoc with no finetuning required [141,188]. Further techniques optimize this, enabling equivalent performance with considerably fewer parameters in zero-shot settings [189]. Multimodality can be further enhanced using Multi-Label Classification (MLC) in datasets. MLC assigns multiple labels to an example with multiple classes or dependencies between them. Classifier chains (or trellises) cascade individual classifier predictions, taking note of inter-label dependencies to improve performance, although this may lead to increased learning errors and complexity if there are cyclical or recursive relationships between classes. Multi-label active learning can automate the curation of informative samples with a strong contribution to a correlated label space [190,191,192].

Language models can perform rudimentary forms of reasoning [193], as demonstrated by Google’s PaLM, which can explain novel jokes and generate counterfactual scenarios [194]. LaMDA and PaLM have shown improved reasoning capabilities by learning from chains of thought prompts generated with their own models [195,196,197,198]. The paper Large Language Models are Zero-Shot Reasoners highlights that simply adding “let’s think step by step” as a prompt prior to an output of an answer from GPT-3 increases the accuracy on the mathematical problem sets MultiArith and GSM8K from 17.7% to 78.7% and from 10.4% to 40.7%, respectively [199].

The paper “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models” [200] shows how multi-step reasoning tasks can be solved with reoriented prompts, achieving 99.7% success on the SCAN benchmark, compared to ~16% with other prompting methods. This method reduces a complex problem into a list of subproblems, then sequentially solves them using answers to previously solved subproblems.

InstructGPT [57,201,202,203] uses human feedback to fine-tune outputs and improve corrigibility. Blender 3 [204] learns from public interactions via a chat interface. Favorable results have been obtained with only 100 samples of human-written feedback, fine-tuning a GPT-3 model to human-level summarization [205,206]. Models can also adopt cultural practices through observation alone, with no further feedback or training data [18,207]. CM3 [208] is trained on structured multimodal documents and can generate new images and captions, infill images or text, and disambiguate entities. FLAVA [209] is jointly trained to do over 35 tasks across domains, including image and text recognition, and joint text-image tasks.

GPT-3 enabled generalization from few datapoints without retraining [210], whereas DeepMind’s agents have learned to barter, adjust production and pricing, and discover arbitrage from scratch [211]. Techniques by Google enable observation and inference from human and animal behavior to develop skills for robotic agents [212]. However, Armstrong et al. (2019) argue that simple heuristics lacking normative references do not generalize effectively to modelling human behavior [213]. OpenAI has extended GPT-3 to perform web research, potentially improving reasoning capabilities and keeping models up to date [214], whereas DeepMind’s Gopher system has demonstrated improved focus on topics and increased accuracy of answers compared to GPT-3 [215,216,217]. External repositories can be appended to Transformer models to extend attention length, with a retrieval mechanism using the same keys and queries trained by the attention layers enabling more sophisticated outputs comparable to a model five to ten times larger. Newly acquired information can be referenced immediately without updating the network weight matrices [218].

Evolution through Large Models leverages evolutionary algorithms to improve language models by bootstrapping competence [219]. Self-Supervised Learning has also been used to solve tasks with prediction error as an intrinsic reward [220]. Sorscher et al., 2022, at Meta proposed a scalable self-supervised dataset pruning metric, which may reduce resource costs of deep learning by altering the tradeoff between dataset size and increased training time [221]. This self-supervised pruning metric applies k-means clustering to calculate optimal pruning, which is contingent upon a dataset’s distance from the closest cluster centroid. The Stable Diffusion image generation model compressed over 100 TB of images into 4.2 GB [222], and 1.8 GB of baked imagery into 200 kB worth of neural networks expressed through fragment shaders [223].

Generative models can reconstruct an image from a seed and a prompt of key features, enabling efficient ‘compression’ when paired with a client reference model [224,225]. Extrapolation from shorter problem instances to solve more complex ones enables out-of-distribution generalization in reasoning tasks. Certain skills, such as length generalization, can be learned more effectively via in-context learning rather than fine tuning, even with infinite data [226]. Researchers suggest generalization can occur beyond language domains, into pure statistical patterns, perhaps akin to a universal grammar [227,228]. It is hypothesized that such a process assists with the learning of priors which can link between modalities, and that “within today’s gigantic and notoriously data-hungry language models is a sparser, far more efficient architecture trying to get out”.

[ad_2]