June 25, 2026
Build Applications with Local AI Models on a Mac (MEAP 02)
Build Applications with Local AI Models on a Mac (MEAP 02) | 3.31 MB
Title: Build Applications with Local AI Models on a Mac (MEAP 02)
Author: Keiji Kamigusa
Category: Nonfiction, Computers, Advanced Computing, Artificial Intelligence, General Computing
Language: English | 139 Pages | ISBN: 9782488111102
Description:
The model you depend on lives on someone else’s hardware. They can change the price, change the rules, or retire it entirely, and you cannot stop them.
Local AI Engineering with Ollama is how you stop renting and start owning. You take the model, the price, and the rules back into your own hands: run any model you want, when you want, where you want, and change how it behaves without a meter running.
This is a practical book for developers who can run a command and edit a file but have no Machine Learning degree and want none. It skips the marketing and jumps into building things that run, on hardware you already own, with the network unplugged. Every command was executed on a real machine, and every output you see (JSON responses, error messages, token counts, training logs) came from an actual session, not from documentation.
This book moves in one direction: from running your first model to shipping an agent that runs on your own hardware. Each chapter ends with something working, and each skill below builds on the one before it. By the end you will be able to:
Understand what a model is actually doing: Tokens, predictions, weights, embeddings, attention, and the KV cache, each tied to a setting you will change.
Install Ollama and size your hardware honestly: Install the runtime and tell if a model fits your RAM or VRAM before downloading.
Pick, pull, and manage models: Read the Ollama and Hugging Face GGUF repos, choose quantization, and manage disk and memory.
Drive Ollama from its API: Run models over HTTP from your code, and read tokens-per-second to compare on numbers.
Control the context window: Size it so the model stops forgetting, and see what gets sent each turn.
Operate a model under real conditions: Tune temperature, top_p, top_k, penalties, seed, keep-alive, and concurrency.
Package a custom model with a Modelfile: One job, the same way every time, shipped as a single artifact.
Fine-tune a model on your own data: Train Granite for English-to-SQL with QLoRA and Unsloth, then export to GGUF.
Build against the Python SDK: Build Python programs with typed responses, ending in a management CLI.
Build a working chat loop and see why it forgets: Write a REPL, then watch it fail to recall the last turn.
Give the conversation a memory: Resend a running message list so the assistant follows the conversation.
Stream replies and accept multi-line input: Print tokens as they arrive, and take multi-line prompts.
Keep long chats inside the context window: Drop the oldest turns so the prompt never overflows.
Summarize old turns instead of dropping them: Condense earlier messages with a second model through LangChain.
Cache replies in Redis: Return repeated questions instantly, cutting latency and wasted compute.
Add long-term memory that survives restarts: Wire in mem0 to recall user facts across sessions.
Give the model tools to fetch live data: Add function calling, guarded against inventing numbers.
Source those tools from an external MCP server: Serve tools over MCP, turning M times N into M plus N.
Put a graphical interface in front of Ollama: Run Open WebUI in Docker, chat with your documents, lock it down for a team.
If you can run a command and edit a file, you are qualified! Downloadable code included.
So what are you waiting for to stop renting, start owning, and get a model running tonight?
The model you depend on lives on someone else’s hardware. They can change the price, change the rules, or retire it entirely, and you cannot stop them.
Local AI Engineering with Ollama is how you stop renting and start owning. You take the model, the price, and the rules back into your own hands: run any model you want, when you want, where you want, and change how it behaves without a meter running.
This is a practical book for developers who can run a command and edit a file but have no Machine Learning degree and want none. It skips the marketing and jumps into building things that run, on hardware you already own, with the network unplugged. Every command was executed on a real machine, and every output you see (JSON responses, error messages, token counts, training logs) came from an actual session, not from documentation.
This book moves in one direction: from running your first model to shipping an agent that runs on your own hardware. Each chapter ends with something working, and each skill below builds on the one before it. By the end you will be able to:
Understand what a model is actually doing: Tokens, predictions, weights, embeddings, attention, and the KV cache, each tied to a setting you will change.
Install Ollama and size your hardware honestly: Install the runtime and tell if a model fits your RAM or VRAM before downloading.
Pick, pull, and manage models: Read the Ollama and Hugging Face GGUF repos, choose quantization, and manage disk and memory.
Drive Ollama from its API: Run models over HTTP from your code, and read tokens-per-second to compare on numbers.
Control the context window: Size it so the model stops forgetting, and see what gets sent each turn.
Operate a model under real conditions: Tune temperature, top_p, top_k, penalties, seed, keep-alive, and concurrency.
Package a custom model with a Modelfile: One job, the same way every time, shipped as a single artifact.
Fine-tune a model on your own data: Train Granite for English-to-SQL with QLoRA and Unsloth, then export to GGUF.
Build against the Python SDK: Build Python programs with typed responses, ending in a management CLI.
Build a working chat loop and see why it forgets: Write a REPL, then watch it fail to recall the last turn.
Give the conversation a memory: Resend a running message list so the assistant follows the conversation.
Stream replies and accept multi-line input: Print tokens as they arrive, and take multi-line prompts.
Keep long chats inside the context window: Drop the oldest turns so the prompt never overflows.
Summarize old turns instead of dropping them: Condense earlier messages with a second model through LangChain.
Cache replies in Redis: Return repeated questions instantly, cutting latency and wasted compute.
Add long-term memory that survives restarts: Wire in mem0 to recall user facts across sessions.
Give the model tools to fetch live data: Add function calling, guarded against inventing numbers.
Source those tools from an external MCP server: Serve tools over MCP, turning M times N into M plus N.
Put a graphical interface in front of Ollama: Run Open WebUI in Docker, chat with your documents, lock it down for a team.
If you can run a command and edit a file, you are qualified! Downloadable code included.
So what are you waiting for to stop renting, start owning, and get a model running tonight?
DOWNLOAD:
https://nitroflare.com/view/937A39612743FB5/Build_Applications_with_Local_AI_Models_on_a_Mac.rar