Software Development

Gemma defined: An summary of Gemma mannequin household architectures – Insta News Hub

Gemma defined: An summary of Gemma mannequin household architectures – Insta News Hub

Gemma is a household of light-weight, state-of-the artwork open fashions constructed from the identical analysis and know-how used to create the Gemini fashions.

Totally different variations of Gemma are designed for various use instances and modalities, comparable to:

  • Single modality (Textual content in, Textual content out)
  • Specialization for coding use instances
  • Multi modality (Textual content and Picture in, Textual content out)
  • Various sizes for various {hardware} sorts, inference wants, and different constraints.
  • “Novel” architectures

As a result of all these fashions share the same DNA, the Gemma household presents a singular method to study concerning the architectures and design selections which can be accessible in trendy LLM techniques. We hope this contributes to a wealthy ecosystem of open fashions and promotes a better understanding of how LLM techniques work.

This collection will cowl:

  • Gemma 1 (2B, 7B) – Transformer primarily based text-to-text fashions.
  • CodeGemma (2B and 7B) – A fine-tuned model of Gemma, optimized for code completion and technology.
  • Gemma 2 (2B, 9B, 27B) – Up to date text-to-text fashions skilled with newer structure with the 2B and 9B variations skilled by way of distillation from bigger fashions.
  • RecurrentGemma (2B, 9B) – A mannequin constructed on the novel Griffin structure. This structure makes use of a combination of native consideration and linear recurrences to realize quick inference when producing lengthy sequences.
  • PaliGemma (3B) – A vision-language mannequin that may absorb textual content and pictures and supply a textual content output.

How one can use this information

On this collection, we are going to

  • Collate the particular architectures of assorted fashions
  • Clarify how these parameters have an effect on mannequin generations (e.g. num embeddings, Multi Question vs Multi Head vs Grouped Question)
  • Present code examples of the fashions for additional exploration

To supply details about the mannequin, we use Hugging Face Transformers print module, like the easy code beneath.

from transformers import AutoModelForCausalLM
mannequin = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
print(mannequin)

You possibly can discover contained in the mannequin with torchinfo or summary() within the Keras Mannequin class API as effectively.

What this information shouldn’t be

This information shouldn’t be an introduction to AI. It assumes working data of neural networks, Transformers and related phrases like tokens. If you happen to want a refresher on these ideas listed below are some assets to get you began:

A arms on neural community studying device that works in browser

An introduction to transformers

Gemma

Gemma is an open weight LLM. It is available in each instruction-tuned and uncooked, pretrained variants at varied parameter sizes. It’s primarily based on the LLM structure launched by Google Analysis within the Attention Is All You Need paper. Its major operate is to generate textual content tokenword by tokenword, primarily based on a immediate offered by a person. In duties like translation, Gemma takes a sentence from one language as enter and outputs its equal in one other language.

As you’ll quickly see Gemma is each an awesome mannequin by itself, but in addition lends itself to customized extensions to satisfy totally different person wants.

Gemma Structure

First, let’s see the transformer decoder that Gemma fashions are primarily based on.

Transformer decoder architecture

In contrast to the unique encoder-decoder transformer mannequin structure launched in “Consideration Is All You Want”, Gemma is solely a “decoder-only” mannequin.

The core parameters of the structure are summarized within the desk beneath.

Core parameters of the architecture

Fashions are skilled on a context size of 8192 tokens. This implies they’ll course of as much as roughly 6144 phrases (utilizing the rule of thumb of 100 tokens ~= 75 phrases) at a time.

It is price noting that the sensible enter restrict can differ primarily based on the duty and utilization. It is because textual content technology consumes tokens inside the context window, successfully decreasing area for brand new enter. Though the technical enter restrict stays fixed, generated output turns into a part of the following enter, influencing additional generations.


d_model (2B: 2048, 7B: 3072)

d_model represents the scale of the embeddings (vector representations of phrases or subwords a.ok.a tokens) used as enter to the decoder. It additionally determines the scale of the interior illustration inside the decoder layers.

d_model x Num heads x Head size

“d_model x Num heads x Head dimension” defines the parameter quantity in self_attn

A bigger d_model worth means the mannequin has extra “area” to symbolize the nuances of various phrases and their relationships. This could result in higher efficiency, particularly for advanced language duties. Nevertheless, growing d_model additionally makes the mannequin bigger and extra computationally costly to coach and use.

Layers (2B: 18, 7B: 28)

Transformers encompass a number of stacked layers. Deeper fashions have extra layers, and subsequently extra parameters and might study extra intricate patterns. Nevertheless these further parameters imply they’re additionally extra susceptible to overfitting and require extra computational assets.

This augmented representational capability may consequence within the mannequin studying noise or particular coaching knowledge patterns that lack the flexibility to generalize to novel examples.

Moreover, deeper fashions usually necessitate extra coaching knowledge to avert overfitting. In instances the place accessible knowledge is restricted, the mannequin may lack adequate examples to study a generalizable illustration, resulting in the memorization of coaching knowledge as a substitute.

Feedforward hidden dims (2B: 32768, 7B: 49152)

Every Transformer layer features a feedforward community after the eye mechanism. This community has its personal dimensionality, usually bigger than the d_model dimension to extend the mannequin’s expressive energy.

It’s carried out as a multi-layer perceptron (MLP), a form of neural community, to additional rework the embeddings and extract extra intricate patterns.

multi-layer perceptron (MLP) neural network achitecture

In Gemma, the usual ReLU non-linearity is changed by the GeGLU activation function, a variation of GLU (Gate Linear Unit). GeGLU divides the activation into two components: a sigmoidal half and a linear projection. The output of the sigmoidal half is element-wise multiplied with the linear projection, leading to a non-linear activation operate.

GeGLU activation function example

Num heads (2B: 8, 7B: 16)

Every Transformer layer accommodates a number of consideration mechanisms working in parallel. These “heads” enable the mannequin to deal with totally different elements of the enter sequence concurrently. Rising the variety of heads can improve the mannequin’s capability to seize various relationships within the knowledge.

Num KV heads (2B: 1, 7B: 16)

The 7B mannequin makes use of multi-head consideration(MHA), whereas the 2B mannequin makes use of multi-query consideration(MQA). MQA shares the identical key and worth projections, which suggests every head focuses on the identical underlying illustration however with totally different question projections.

The unique MHA provides richer illustration studying however comes with larger computational prices. MQA supplies an environment friendly different that has been shown to be effective.


Head dimension (2B: 256, 7B: 256)

It refers back to the dimensionality of every consideration head inside the multi-head consideration mechanism. It’s calculated by dividing the embedding dimension by the variety of heads. For instance, if the embedding dimension is 2048 and there are 8 heads, then every head would have a dimension of 256.


Vocab dimension (2B: 256128, 7B: 256128)

It defines the variety of distinctive tokens (phrases, sub phrases or characters) that the mannequin understands and might course of. Gemma tokenizer relies on SentencePiece. The scale of the vocabulary is predetermined earlier than coaching. SentencePiece then learns the optimum subword segmentation primarily based on the chosen vocabulary dimension and the coaching knowledge. Gemma’s massive 256k vocabulary permits it to deal with various textual content inputs and probably enhance efficiency on varied duties, e.g. dealing with multilingual textual content inputs.


Gemma 7B

GemmaForCausalLM(
  (mannequin): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear(in_features=24576, out_features=3072, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=256000, bias=False)
)

Gemma 7B architecture

embed_tokens (Embedding Layer)

This layer converts the enter tokens (phrases or subwords) into dense numerical representations (embeddings) that the mannequin can course of. It has a vocabulary dimension of 256,000 and creates embeddings of dimension 3072.

layers

That is the guts of the mannequin, consisting of 28 stacked GemmaDecoderLayer blocks. Every of those layers refines the token embeddings to seize advanced relationships between phrases and their context.

self_attn

Within the self-attention mechanism, the mannequin assigns totally different weights to the phrases within the enter when creating the following phrase. Leveraging a scaled dot-product consideration mechanism, the mannequin employs linear projections (q_proj, k_proj, v_proj, and o_proj) to generate question, key, worth, and output representations.

All out_features values are the identical 4096 for q_proj, k_proj and v_proj as this mannequin makes use of Multi Head Consideration (MHA). They’ve 16 heads with a dimension of 256 in parallel, totaling 4096 (256 x 16).

Moreover, the mannequin leverages positional data extra successfully by using rotary_emb (GemmaRotaryEmbedding) for positional encoding (a.ok.a RoPE).

Lastly, o_proj layer initiatives the eye output again to the unique dimension (3072).


Notice that the Gemma 2B mannequin makes use of Multi Question Consideration (MQA).

Multi-Query Attention (MQA) architecture used in Gemma 2B model

k_proj and v_proj share the identical head with a dimension of 256, leading to out_features of 256. In distinction, q_proj and o_proj have 8 heads (256 x 8 = 2048) in parallel.

(self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )

mlp

It makes use of gate_proj and up_proj for a gating mechanism, adopted by down_proj to cut back the dimension again to 3072.

input_layernorm, post_attention_layernorm and norm

These normalization layers stabilize coaching and enhance the mannequin’s capability to study successfully.

lm_head

This ultimate layer maps the refined embeddings (3072) again to a likelihood distribution for the following token over the vocabulary area (256000).

CodeGemma (2B and 7B)

CodeGemma fashions are fine-tuned Gemma fashions which can be optimized for code completion and coding chat help. CodeGemma fashions are skilled on greater than 500 billion tokens of primarily code. As well as CodeGemma provides fill-in-the- center functionality, permitting completions that happen between two items of present textual content.

CodeGemma highlights the finetunability of the Gemma checkpoints. By further coaching the fashions change into specialised at a sure job, studying a extra advanced completion than pure suffix completion.

Code Gemma Utilization

You should use 4 user-defined tokens – 3 for FIM and a “<|file_separator|>” token for multi-file context help.

BEFORE_CURSOR = "<|fim_prefix|>"
AFTER_CURSOR = "<|fim_suffix|>"
AT_CURSOR = "<|fim_middle|>"
FILE_SEPARATOR = "<|file_separator|>"

Think about that you’re making an attempt to finish the code just like the display beneath.

Code snippet example - CodeGemma (2B and 7B)

And the enter immediate ought to appear like this

<|fim_prefix|>import <|fim_suffix|>if __name__ == "__main__":n    sys.exit(0)<|fim_middle|>

The mannequin will present “sys” because the urged code completion.

You possibly can discover extra about CodeGemma on CodeGemma / Quickstart.

What’s Subsequent?

This text mentioned the Gemma structure.

In our subsequent collection of posts, you’ll discover the newest mannequin, Gemma 2. With substantial enhancements in security measures, this mannequin surpasses its predecessor by way of efficiency and effectivity throughout inference.

Keep tuned and thanks for studying!


References

Papers

Code Examples

Gemma

CodeGemma

📋 The whole Gemma structure collection

  • Gemma defined: An summary of Gemma mannequin household architectures
  • Gemma defined: What’s new in Gemma 2
  • Gemma defined: RecurrentGemma structure
  • Gemma defined: PaliGemma structure