Quick introduction to Large Language Models for Android developers

Posted by Thomas Ezan, Sr Developer Relation Engineer

Android has supported traditional machine learning models for years. Frameworks and SDKs like LiteRT (formerly known as TensorFlow Lite), ML Kit and MediaPipe enabled developers to easily implement tasks like image classification and object detection.

In recent years, generative AI (gen AI) and large language models (LLMs), have opened up new possibilities for language understanding and text generation. We have lowered the barriers for integrating gen AI features into your apps and this blog post will provide you with the necessary high-level knowledge to get started.

Before we dive into the specificities of generative AI models, let’s take a high level look: how is machine learning (ML) different from traditional programming.

Machine learning as a new programming paradigm

A key difference between traditional programming and ML lies in how solutions are implemented.

In traditional programming, developers write explicit algorithms that take input and produce a desired output.

A flow chart showing the process of machine learning model training. Input data is fed into the training process, resulting in a trained ML model

Machine learning takes a different approach: developers provide a large set of previously collected input data and the corresponding output, and the ML model is trained to learn how to map the input to the output.

A flow chart illustrating the machine learning model training. This step is labeled above the process '1. Train the model with a large set of input and output data'. Below, arrows labeled 'Input' and 'Output' point to a green box labeled 'ML Model Training'.  Another arrow points away from the box and is labeled 'ML Model'.

Then, the model is deployed on the Cloud or on-device to process input data. This step is called inference.

A flow chart illustrating the inference training for training an ML model. This step is labeled above the process '2. Deploy the model to run inferences on input data'. Below, an arrow labeled 'Input' points to a green box labeled 'Run ML Inference'.  Another arrow points away from the box and is labeled 'Output'.

This paradigm enables developers to tackle problems that were previously difficult or impossible to solve with rule-based programming.

Traditional machine learning vs. generative AI on Android

Traditional ML on Android includes tasks such as image classification that can be implemented using mobilenet and LiteRT, or pose estimation that can be easily added to your Android app with the ML Kit SDK. These models are often trained on specific datasets and perform extremely well on well-defined, narrow tasks.

Generative AI introduces the capability to understand inputs such as text, images, audio and video and generate human-like responses. This enables applications like chatbots, language translation, text summarization, image captioning, image or code generation, creative writing assistance, and much more.

Most state of the art generative AI models like the Gemini models are built on the transformer architecture. To generate images, diffusion models are often used.

Understanding large language models

At its core, an LLM is a neural network model trained on massive amounts of text data. It learns patterns, grammar, and semantic relationships between words and phrases, enabling it to predict and generate text that mimics human language.

As mentioned earlier, most recent LLMs use the transformer architecture. It breaks down input into tokens, assigns numerical representations called “embeddings” (see Key concepts below) to these tokens, and then processes these embeddings through multiple layers of the neural network to understand the context and meaning.

LLMs typically go through two main phases of training:

      1. Pre-training phase: The model is exposed to vast amounts of text from different sources to learn general language patterns and knowledge.

      2. Fine-tuning phase: The model is trained on specific tasks and datasets to refine its performance for particular applications.

Classes of models and their capabilities.

Gen AI models come in various sizes, from smaller models like Gemini Nano or Gemma 2 2B, to massive models like Gemini 1.5 Pro that run on Google Cloud. The size of a model generally correlates with the capabilities and compute power required to run it.

Models are constantly evolving, with new research pushing the boundaries of their capabilities. These models are being evaluated on tasks like question answering, code generation, and creative writing, demonstrating impressive results.

In addition some models are multimodal which means that they are designed to process and understand information from multiple modalities, such as images, audio, and video, alongside text. This allows them to tackle a wider range of tasks, including image captioning, visual question answering, audio transcription. Multiple Google Generative AI models such as Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini Nano with Multimodality and PaliGemma are multimodal.

Key concepts

Context Window

Context window refers to the amount of tokens (converted from text, image, audio or video) the model considers when generating a response. For chat use cases, it includes both the current input and a history of past interactions. For reference, 100 tokens is equal to about 60-80 English words.For reference, Gemini 1.5 Pro currently supports 2M input tokens. It is enough to fit the seven Harry Potter books… and more!

Embeddings

Embeddings are multidimensional numerical representations of tokens that accurately encode their semantic meaning and relationships within a given vector space. Words with similar meanings are closer together, while words with opposite meanings are farther apart.

The embedding process is a key component of an LLM. You can try it independently using MediaPipe Text Embedder for Android. It can be used to identify relations between words and sentences and implement a simplified semantic search directly on-device.

A 3-D graph plots 'Man' and 'King' in blue and 'Woman' and 'Queen' in green, with arrows pointing from 'Man' to 'Woman' and from 'King' to 'Queen'.

A (very) simplified representation of the embeddings for the words “king”, “queen”, “man” and “woman”

Top-K, Top-P and Temperature

Parameters like Top-K, Top-P and Temperature enable you to control the creativity of the model and the randomness of its output.

Top-K filters tokens for output. For example a Top-K of 3 keeps the three most probable tokens. Increasing the Top-K value will increase the randomness of the model response (learn about Top-K parameter).

Then, defining the Top-P value adds another step of filtering. Tokens with the highest probabilities are selected until their sum equals the Top-P value. Lower Top-P values result in less random responses, and higher values result in more random responses (learn about Top-P parameter).

Finally, the Temperature defines the randomness to select the tokens left. Lower temperatures are good for prompts that require a more deterministic and less open-ended or creative response, while higher temperatures can lead to more diverse or creative results (learn about Temperature).

Fine-tuning

Iterating over several versions of a prompt to achieve an optimal response from the model for your use-case isn’t always enough. The next step is to fine-tune the model by re-training it with data specific to your use-case. You will then obtain a model customized to your application.

More specifically, Low rank adaptation (LoRA) is a fine-tuning technique that makes LLM training much faster and more memory-efficient while maintaining the quality of the model outputs.
The process to fine-tune open models via LoRA is well documented. See, for example, how you can fine-tune Gemini models through Google AI Studio without advanced ML expertise. You can also fine-tune Gemma models using the KerasNLP library.

The future of generative AI on Android

With ongoing research and optimization of LLMs for mobile devices, we can expect even more innovative gen AI enabled features coming to Android soon. In the meantime check out other AI on Android Spotlight Week blog posts, and go to the Android AI documentation to learn more about how to power your apps with gen AI capabilities!