Skip to main content
A Beginner's Guide to Local Large Language Models (LLMs)

A Beginner's Guide to Local Large Language Models (LLMs)

Brendan...About 6 minTechnologyollamallm

Important

This post is a work in progress and is currently incomplete. I hope to finish it soon when i have more time to experiment with LLMs

Introduction

If you're intrigued by the world of machine learning but feel overwhelmed by its complexities, you're not alone. The field is rapidly evolving, with new information coming out every day. In this beginner's guide, I'll get you up to speed with the basic knowledge to work with local LLMs, exploring everything from model parameters and quantization methods to hardware. Whether you're a hobbyist or a student, this guide aims to provide a solid foundation for your adventure into the realm of local LLMs.

1. Basics Ideas

Large Language Models (LLMs) are sophisticated algorithmic constructs such as Transformers, RNNs (Recurrent Neural Networks), CNNs (Convolutional Neural Networks), and their hybrids, designed for processing and generating natural language. They are trained by adjusting weights or biases stored in GB amounts of RAM and processed/optimized with thousands of parallel cores. During training, data such as extensive corpora of human language are used as samples or feedback to optimize weight values. The training algorithms are complex and include optimization functions such as Gradient Descent. During training, the bulk of processing power is needed to create the model. Once trained, these models are deployed to make inferences based on various inputs. Inferences are the aspect that end users are most aware of. For example, inferences may be communicating with a chatbot or deriving sentiment from a headline. Unlike training, when using the model to make an inference, there's no computationally intensive aspect. Instead, it makes use of the precalculated weights and a light algorithm, such as the nearest k methods, to infer context. In other words, you provide context, and the algorithm predicts the next output word by finding the best path through weighted values. The most crucial point to note for operational practicality is that there are three primary stages in the life cycle of a model: architecture, training, and inferences.

  • Architecture: The structural design and choice of algorithms.
  • Training: Learning from data, adjusting weights, and biases.
  • Inferences: The application phase where the trained model makes predictions (performs tasks like text generation or sentiment analysis).

This article aims to provide a working knowledge of concepts like inferences, context, quantization, consumer hardware integration, and other critical information for locally taking advantage of this fantastic technology. I will assume prior knowledge of computer science; however, I'm not going to explain anything you can't easily research yourself, or that's out-of-scope/ too technical. I will point out what to investigate further for practical use cases, but I will avoid the topics of architecture and training.

The focus here is local deployment, meaning you can use these models directly on your machine without an internet connection. Local models offer several advantages, such as increased reliability (no cloud outages), enhanced privacy, and the freedom to customize your experience without limitations. Local models put control back in the user's hands, allowing you to tailor LLMs to your needs.

Glossary

resources

tools & Calculators

2. The Role of Consumer Hardware in LLMs

DALL E: gpus and apples flying every which direction
DALL E generated image

Ensuring you have the proper hardware is the first step in your journey with Local Large Language Models (LLMs). The primary challenge in LLM inferencing is the need for significant amounts of high-speed memory and at least a few parallel processing cores (though parallel cores matter much less for inferencing than training).

Key Consumer Hardware Options for LLMs:

  • Nvidia GPUs (Consumer and Workstation):

    • Consumer Models: Nvidia's gaming GPUs, like the RTX 3090 and 4090, are popular for their substantial parallel processing power and high memory capacities.

      • Pros:
        • Modern architecture, such as Ada Lovelace and Ampere, will be faster than older architecture
        • Easy to configure with robust Nvidia frameworks and libraries
        • HDMI outputs for graphical applications (multi-purpose for some users)
      • Cons:
        • Not optimized specifically for LLM purposes.
    • Workstation and Server Models: Options like the Nvidia Tesla P40 offer more memory at a lower price point but require additional setup and compatibility considerations.

      • Pros:
        • Suited for inferencing and training due to memory size, bandwidth, and many CUDA parallel cores.
        • Older models may be cost-effective on platforms like eBay.
      • Cons:
        • Designed for high airflow environments (Generates significant heat, even when idle).
        • Older models may lack future support.

    Note: Inferencing relies more on memory size and bandwidth than raw compute power, so with model parallelism, you may run larger models more efficiently.

  • Apple's M2 Architecture in Studio Models with Upgraded RAM:

    • Apple's latest M2 chips, especially in Studio models with upgraded video RAM (up to 192GB), offer a good balance of power and efficiency for LLM inferencing.
      • Pros:
        • Up to 192GB RAM
        • Can be run headlessly as a personal cloud or integrated into Apple's ecosystem.
      • Cons:
        • Expensive
        • The 24-core video card may not suffice for model training (inferencing only)
  • Cloud-Based Alternatives:

    • Cloud-based platforms offer access to high-end hardware for training.
      • Pros:
        • Scalable and adaptable.
        • Access to the latest hardware configurations.
      • Cons:
        • Not suitable for those explicitly seeking a local LLM setup.

Unfortunately, I can not recommend alternatives like AMD and Intel GPUs because they lack support for CUDA and, therefore, will not be as user-friendly. I mention this because it is essential that we consumers use both AMD and Intel when support is added to avoid one company having all the power and limited future options. Additionally, A powerful CPU coupled with ample, high-speed RAM still is not a viable setup for inferencing. Unfortunately most applications need vram for reasonable performance.

3. Model Selection: First consideration Parameters & Quantization of Local LLMs

One of the first things you will hear people talk about is various models and different options for parameters and quantization. Advanced hardware with large amounts of video memory is essential for inferencing and training because of the "large" aspect of LLMs. Specifically, how many and how large the parameters are in memory, often in the hundreds of GB.

Below is an interactive calculator showing how large of a model you can use with your hardware based on how much video memory you have. If the model is too large to fit in your video memory or RAM, you will run into performance issues as there is a significant overhead splitting up the data. Later, we will discuss some alternative methods to get around this, but for now, I'll keep it simple and talk about parameters and quantization.

Paramter Calculator

LLM paramter calculation
Number of Parameters=Total Memory in BytesBytes per Parameter\text{Number of Parameters} = \frac{\text{Total Memory in Bytes}}{\text{Bytes per Parameter}}

Result: 0.27 billion parameters

Or (if by paramters) Required Memory: 0.00 GB

The more parameters there are, the more complex it is, meaning it has a better ability to capture patterns in data and produce output with an illusion of reasoning. However, the more parameters included increasing the risk of overfitting and generalizing too much. In addition, more parameters have less training efficiency and take more space in memory.

This little demonstration will hopefully aid in understanding. You can think of each box as one bit of video memory. Each color section shows how much memory each quantitation takes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Precision: 4

Quantization is a technique used to reduce the precision of a model's parameters. Reducing the precision helps reduce the model's size by converting its weights from higher-precision data types (32-bit floats) to data types with a smaller footprint, such as 16-bit floats and integers. This has the effect of greatly reduce the model size and even increasing the inference speed by loading quicker and taking advantage of hardware accelerations. Because of these advantages, quantization has major implications for hobbist hardware and even mobile devices.

In summary

  • Quantization can increase inferencing speed and take advantage of hardware optimizations
  • Greatly reduce the models size
  • More parameters increase model complexity, making it more adaptable to reasoning
  • Too many parameters can cause overfitting, increase size, and decrease performance

4. Quantization Methods

  • Detailed overview of different quantization solutions for various hardware and accelerations.
  • Comparing the benefits and limitations of each method.

5. Loaders for LLMs

  • Understanding loaders in LLMs.
  • Different types of loaders and their roles in model performance.

6. Selecting a model

7. Fine-tuning

8. Prompt Engineering

Conclusion

  • Summarizing the key points of the guide.
  • Final thoughts and encouragement for readers to check out local LLMs.

r/LocalLLaMAopen in new window check out this Reddit group. They post news, questions, etc, every day.

Glossary of AI and Machine Learning Terms

  • Inferencing: Using a trained machine learning model to make predictions or decisions based on user inputs and other unseen data. It's the practical application of a model after training.

  • Training: The phase in machine learning where a model learns from a dataset. The model adjusts its parameters to minimize the error in its predictions compared to actual outcomes.

  • Model Parameters: The internal variables of a model that are learned from training data. In neural networks, these include weights and biases that determine the transformation of input data to output.

  • Hyper parameters: Settings or configurations that govern the training process of a model. Set before training; examples include learning rate, number of epochs, and batch size.

  • Epochs: One complete training iteration through the entire training dataset. Multiple epochs ensure effective learning from the data.

  • Batch Size: The exact number of samples processed before updating the model's internal parameters. Size affects learning speed and stability.

  • Overfitting: A model learns the training data a little too well, including noise and outliers, leading to poor performance and regurgitated responses.

  • Underfitting: A model is too simple to capture the underlying pattern in data, resulting in poor performance on both training and new data.

  • Regularization: Techniques to prevent overfitting by penalizing overly complex models, such as L1 and L2 regularization.

  • Transfer Learning: Reusing a model as a starting point for a new model on a second task, commonly used in deep learning.

  • Fine-tuning: Adapting a pre-trained model to adapt for a specific task or dataset through additional training and data.

  • Loss Function: A measure of the difference between a model's predictions and the actual data to minimize this loss during training.

  • Parameters (in the context of LLMs): Refers to the specific elements that define the configuration of a Large Language Model, such as the size of the model (number of layers, neurons), types of layers used, and the way these layers interact.

  • Quantization: The process of reducing the precision of the model's parameters (e.g., weights). This is done to decrease the model's size and speed up inference, often at the cost of a slight reduction in accuracy. Quantization is critical in deploying models on devices with limited computational resources.

  • CUDA: Nvidia's proprietary framework for parallel processing on Nvidia GPUs such as RTX 4090.

Tools & Calculators

parameters calculator

Note

The cover image was generated with DALL·E

Last update:
Contributors: Brendan
Comments
  • Latest
  • Oldest
  • Hottest
Powered by Waline v3.0.0-alpha.10