AI - natural language processing (NLP)

Introduction

Basic steps

get the textual data
tokenize the data
- ie. break down text into individual words (or possibly sub-word chunks)
- see Youtube tutorial on creating a tokenizer by Andrej Karpathy
vectorize the tokens
- ie. assign numerical values to each token value
split data into training, validation and test datasets
convert these into tensors
define a neural network model
create an instance of that model
load the train and validation data into the model
train the model
evaluate the model against the test dataset
- accuracy, loss function, time taken
tweak the model as needed
in the case of advanced models such as ChatGPT, further training on supervised data with reward systems

A simplified history of natural language processing

pre-computer mathematical developments

1795-1805 linear regression 1st published:
- Adrien-Marie Legendre and Carl Friedrich Gauss independently utilise the least-squares method of linear regression to approximate solutions to complex problems and this concept is a core component of many neural network models
19th century - invention of tensors and tensor calculus
- these are key to manipulating massive amounts of data for neural network training
- theory of algebraic forms and invariants developed during the middle of the 19th century
- concepts of later tensor analysis arose from the work of Carl Friedrich Gauss in differential geometry
- Gibbs introduced Dyadics and Polyadic algebra, which are also tensors in the modern sense
- 1890: Gregorio Ricci-Curbastro under the title absolute differential calculus, developed tensor calculus - in Ricci's notation, he refers to “systems” with covariant and contravariant components, which are known as tensor fields in the modern sense
- 1915, Einstein's Theory of General Relativity is formulated completely in the language of tensors (although Einstein apparently struggled with tensor concepts) and this popularised use of tensors
- 1920s, it was realised that tensors play a basic role in algebraic topology (for example in the Künneth theorem)
- 1960's tensors are generalized within category theory by means of the concept of monoidal category
- NB. tensors in PyTorch are similar concepts but not the same!

N-gram models

a purely statistical model of language based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n-1 words, an n-gram model.
Jelinek and Mercer, 1980 developed an interpolated or smoothed trigram model
superseded by recurrent neural network-based models but now utilised as part of large language models

Early neural network approaches and development

Python programming language implented Dec 1989
Deerwester et al., 1990:
- feature vectors for words are learned on the basis of their probability of co-occurring in the same documents (Latent Semantic Indexing - LSI)
Miikkulainen and Dyer, 1991:
- used the idea of using neural networks for language modeling
Schutze, 1993
- idea of using a vector-space representation for words in information retrieval
Schmidhuber, 1996:
- proposed character-based text compression using neural networks to predict the probability of the next character
Brown et al., 1992, Pereira et al., 1993, Niesler et al., 1998, Baker and McCallum, 1998
- approaches that are based on learning a clustering of the words with each word being associated deterministically or probabilistically with a discrete class, and words in the same class are similar in some respect.
Riis and Krogh, 1996
- idea of a vector-space representation for symbols in the context of neural networks framed in terms of a parameter sharing layer for secondary structure prediction
Bellegarda (1997)
- successfully used the idea of using a continuous representation for words in the context of an n-gram based statistical language model, using LSI to dynamically identify the topic of discourse
Jensen and Riis, 2000
- further developed the idea of a vector-space representation for symbols in the context of neural networks framed in terms of a parameter sharing layer and extended it to text-to-speech mapping
Xu and Rudnicky (2000)
- independently proposed the idea of using a neural network for language modeling although experiments were with networks without hidden units and a single input word, which limit the model to essentially capture unigram and bigram statistics
Bengio and Bengio, 2000:
- found that the idea of using neural networks to model high-dimensional discrete distributions was useful to learn the joint probability of Z1 · · · Z n, a set of random variables where each is possibly of a different nature
2000, Python 2.0 was released October 2000
2002, Torch introduced as an open-source machine learning library, a scientific computing framework, and a scripting language based on Lua.
- Torch development moved in 2017 to PyTorch
Bengio et al 2003:
- developed a vector based continuous data model “Neural Probabilistic Language Model” which can be embedded as a neural network layer
- https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- see a 2023 example of using this model AI - deep learning text data using TensorFlow and NLPM
Python 3.0, released December 2008
2016, Facebook AI Research Group creates PyTorch from Torch, acquires Convolutional Architecture for Fast Feature Embedding (Caffe)
- however the models defined by the two frameworks were mutually incompatible
- NB. Caffe was created by Yangqing Jia during his PhD at UC Berkeley in 2016 and supports many different types of deep learning architectures geared towards image classification and image segmentation. It supports CNN, RCNN, LSTM and fully-connected neural network designs
Nov 2015: Google releases TensorFlow for public use - this had been developed as DistBelief in 2011 as a proprietary machine learning system for Google Brain
- 2015, Keras introduced by François Chollet as an open-source library that provides a Python interface for artificial neural networks acting as an interface for the TensorFlow library.
  - has support for convolutional and recurrent neural networks and supports other common utility layers like dropout, batch normalization, and pooling
  - also allows users to produce deep models on smartphones (iOS and Android), on the web, or on the Java Virtual Machine.
  - allows use of distributed training of deep-learning models on clusters of graphics processing units (GPU) and tensor processing units (TPU).
2016, Google announced its Tensor processing unit (TPU), an application-specific integrated circuit (ASIC, a hardware chip) built specifically for machine learning and tailored for TensorFlow.
2017, Meta and Microsoft create Neural Network Exchange (ONNX) project to merge Caffe2 into PyTorch
- Caffe2 was announced in April 2017 and included new features such as recurrent neural network (RNN)
2017, Transformer neural network model published - see below!
2018, Pytorch with Caffe2 embedded released
- now Caffe2 embedded and with ability to convert models between frameworks
- PyTorch Tensors are similar to NumPy Arrays, but can also be operated on a CUDA-capable NVIDIA GPU
  - PyTorch competes with Google's TensorFlow and offers:
    - tensor computing (like NumPy) with strong acceleration via graphics processing units (GPU)
    - deep neural networks built on a tape-based automatic differentiation system
  - NB. the term “tensor” here does not carry the same meaning as tensor in mathematics or physics. The meaning of the word in machine learning is only tangentially related to its original meaning as a certain kind of object in linear algebra.
2022, PyTorch Foundation given governance of PyTorch
- this is a newly created independent organization – a subsidiary of Linux Foundation and PyTorch is now free and open-source software
early 2023, OpenAI allows public access to ChatGPT3.5
Mar 2023, PyTorch 2.0 released
Aug 2023, OpenAI announces GPTBot to webcrawl the internet content to be further used in ChatGPT training

The transformer model approach changed everything and allowed Large Language Models

artificial intelligence natural language processing (NLP) has been revolutionised by the introduction of the transformer model using positional encoding, self-attention and cross-attention in 2017 ¹⁾
early AI textual processing using recurrent neural networks and similar were primarily finding patterns in characters or words without context and then comparing those with outcomes
- examples of this are ascertaining whether email is spam or not, or whether an online review was likely to be positive or negative based upon the words used
the initial level of transformer-based NLP is that of analysing massive amounts of text to produce grammatically similar text - albeit potentially without meaning or context
cross-attention allowed this to be taken to a different level so that it could be utilised to learn how to translate languages or how to respond to questions as in the case of ChatGPT which matured in early 2023 in its version 3.5 and later versions
see AI - natural language processing (NLP) - how ChatGPT is made and AI Large Language Models (LLMs)

¹⁾

"Attention is all you need" 2017 paper introducing transformer model and attention

OzEMedicine - Wiki for Australian Emergency Medicine Doctors

Table of Contents

AI - natural language processing (NLP)

Introduction

Basic steps

A simplified history of natural language processing

pre-computer mathematical developments

N-gram models

Early neural network approaches and development

The transformer model approach changed everything and allowed Large Language Models