Case Study: Local LLM-Based NER with n8n and Ollama

GPT-4.5; Nicole Dresselhaus

Case Study: Local LLM-Based NER with n8n and Ollama

Article

Case-study

NER

Authors

Affiliations

GPT-4.5

OpenAI

Nicole Dresselhaus

Humboldt-Universität zu Berlin

Published

May 5, 2025

Modified

May 9, 2025

Abstract

Named Entity Recognition (NER) is a foundational task in text analysis, traditionally addressed by training NLP models on annotated data. However, a recent study – “NER4All or Context is All You Need” – showed that out-of-the-box Large Language Models (LLMs) can significantly outperform classical NER pipelines (e.g. spaCy, Flair) on historical texts by using clever prompting, without any model retraining. This case study demonstrates how to implement the paper’s method using entirely local infrastructure: an n8n automation workflow (for orchestration) and a Ollama server running a 14B-parameter LLM on an NVIDIA A100 GPU. The goal is to enable research engineers and tech-savvy historians to reproduce and apply this method easily on their own data, with a focus on usability and correct outputs rather than raw performance.

We will walk through the end-to-end solution – from accepting a webhook input that defines entity types (e.g. Person, Organization, Location) to prompting a local LLM to extract those entities from a text. The solution covers setup instructions, required infrastructure (GPU, memory, software), model configuration, and workflow design in n8n. We also discuss potential limitations (like model accuracy and context length) and how to address them. By the end, you will have a clear blueprint for a self-hosted NER pipeline that leverages the knowledge encoded in LLMs (as advocated by the paper) while maintaining data privacy and reproducibility.

Background: LLM-Based NER Method Overview

The referenced study introduced a prompt-driven approach to NER, reframing it “from a purely linguistic task into a humanities-focused task”. Instead of training a specialized NER model for each corpus, the method leverages the fact that large pretrained LLMs already contain vast world knowledge and language understanding. The key idea is to provide the model with contextual definitions and instructions so it can recognize entities in context. Notably, the authors found that with proper prompts, a commercial LLM (ChatGPT-4) could achieve precision and recall on par with or better than state-of-the-art NER tools on a 1921 historical travel guide. This was achieved zero-shot, i.e. without any fine-tuning or additional training data beyond the prompt itself.

Prompt Strategy: The success of this approach hinges on careful prompt engineering. The final prompt used in the paper had multiple components:

Persona & Context: A brief introduction framing the LLM as an expert reading a historical text, possibly including domain context (e.g. “This text is an early 20th-century travel guide; language is old-fashioned”). This primes the model with relevant background.
Task Instructions: A clear description of the NER task, including the list of entity categories and how to mark them in text. For example: “Identify all Person (PER), Location (LOC), and Organization (ORG) names in the text and mark each by enclosing it in tags.”
Optional Examples: A few examples of sentences with correct tagged output (few-shot learning) to guide the model. Interestingly, the study found that zero-shot prompting often outperformed few-shot until ~16 examples were provided. Given the cost of preparing examples and limited prompt length, our implementation will focus on zero-shot usage for simplicity.
Reiteration & Emphasis: The prompt repeated key instructions in different words and emphasized compliance (e.g. “Make sure you follow the tagging format exactly for every example.”). This redundancy helps the model adhere to instructions.
Prompt Engineering Tricks: They included creative cues to improve accuracy, such as offering a “monetary reward for each correct classification” and the phrase “Take a deep breath and think step by step.”. These tricks, drawn from prior work, encouraged the model to be thorough and careful.
Output Format: Crucially, the model was asked to repeat the original text exactly but insert tags around entity mentions. The authors settled on a format like <<PER ... /PER>> to tag people, <<LOC ... /LOC>> for locations, etc., covering each full entity span. This inline tagging format leveraged the model’s familiarity with XML/HTML syntax (from its training data) and largely eliminated problems like unclosed tags or extra spaces. By instructing the model not to alter any other text, they ensured the output could be easily compared to the input and parsed for entities.

Why Local LLMs? The original experiments used a proprietary API (ChatGPT-4). To make the method accessible to all (and avoid data governance issues of cloud APIs), we implement it with open-source LLMs running locally. Recent openly licensed models are rapidly improving and can handle such extraction tasks given the right prompt. Running everything locally also aligns with the paper’s goal of “democratizing access” to NER for diverse, low-resource texts – there are no API costs or internet needed, and data stays on local hardware for privacy.

Solution Architecture

Our solution consists of a workflow in n8n that orchestrates the NER process, and a local Ollama server that hosts the LLM for text analysis. The high-level workflow is as follows:

Webhook Trigger (n8n): A user initiates the process by sending an HTTP request to n8n’s webhook with two inputs: (a) a simple text defining the entity categories of interest (for example, "PER, ORG, LOC"), and (b) the text to analyze (either included in the request or accessible via a provided file URL). This trigger node captures the input and starts the automation.
Prompt Construction (n8n): The workflow builds a structured prompt for the LLM. Based on the webhook input, it prepares the system instructions listing each entity type and guidelines, then appends the user’s text. Essentially, n8n will merge the entity definitions into a pre-defined prompt template (the one derived from the paper’s method). This can be done using a Function node or an LLM Prompt node in n8n to ensure the text and instructions are combined correctly.
LLM Inference (Ollama + LLM): n8n then passes the prompt to an Ollama Chat Model node, which communicates with the Ollama server’s API. The Ollama daemon hosts the selected 14B model on the local GPU and returns the model’s completion. In our case, the completion will be the original text with NER tags inserted around the entities (e.g. <<PER John Doe /PER>> went to <<LOC Berlin /LOC>> ...). This step harnesses the A100 GPU to generate results quickly, using the chosen model’s weights locally.
Output Processing (n8n): The tagged text output from the LLM can be handled in two ways. The simplest is to return the tagged text directly as the response to the webhook call – allowing the user to see their original text with all entities highlighted by tags. Alternatively, n8n can post-process the tags to extract a structured list of entities (e.g. a JSON array of {"entity": "John Doe", "type": "PER"} objects). This parsing can be done with a Regex or code node, but given our focus on correctness, we often trust the model’s tagging format to be consistent (the paper reported the format was reliably followed when instructed clearly). Finally, an HTTP Response node sends the results back to the user (or stores them), completing the workflow.

Workflow Structure: In n8n’s interface, the workflow might look like a sequence of connected nodes: Webhook → Function (build prompt) → AI Model (Ollama) → Webhook Response. If using n8n’s new AI Agent feature, some steps (like prompt templating) can be configured within the AI nodes themselves. The key is that the Ollama model node is configured to use the local server (usually at http://127.0.0.1:11434 by default) and the specific model name. We assume the base pipeline (available on GitHub) already includes most of this structure – our task is to slot in the custom prompt and model configuration for the NER use case.

Setup and Infrastructure Requirements

To reproduce this solution, you will need a machine with an NVIDIA GPU and the following software components installed:

n8n (v1.x** or later)** – the workflow automation tool. You can install n8n via npm, Docker, or use the desktop app. For a server environment, Docker is convenient. For example, to run n8n with Docker:
```
docker run -it --rm \
           -p 5678:5678 \
           -v ~/.n8n:/home/node/.n8n \
           n8nio/n8n:latest
```
This exposes n8n on http://localhost:5678 for the web interface. (If you use Docker and plan to connect to a host-running Ollama, start the container with --network=host to allow access to the Ollama API on localhost.)
Ollama (v0.x*) – an LLM runtime that serves models via an HTTP API. Installing Ollama is straightforward: download the installer for your OS from the official site (Linux users can run the one-line script curl -sSL https://ollama.com/install.sh | sh). After installation, start the Ollama server (daemon) by running:
```
ollama serve
```
This will launch the service listening on port 11434. You can verify it’s running by opening http://localhost:11434 in a browser – it should respond with “Ollama is running”. Note: Ensure your system has recent NVIDIA drivers and CUDA support if using GPU. Ollama supports NVIDIA GPUs with compute capability ≥5.0 (the A100 is well above this). Use nvidia-smi to confirm your GPU is recognized. If everything is set up, Ollama will automatically use the GPU for model inference (falling back to CPU if none available).
LLM Model (14B class): Finally, download at least one large language model to use for NER. You have a few options here, and you can “pull” them via Ollama’s CLI:
- DeepSeek-R1 14B: A 14.8B-parameter model distilled from larger reasoning models (based on Qwen architecture). It’s optimized for reasoning tasks and compares to OpenAI’s models in quality. Pull it with:
```
ollama pull deepseek-r1:14b
```
  This downloads ~9 GB of data (the quantized weights). If you have a very strong GPU (e.g. A100 80GB), you could even try deepseek-r1:70b (~43 GB), but 14B is a good balance for our use-case. DeepSeek-R1 is licensed MIT and designed to run locally with no restrictions.
- Cogito 14B: A 14B “hybrid reasoning” model by Deep Cogito, known for excellent instruction-following and multilingual capability. Pull it with:
```
ollama pull cogito:14b
```
  Cogito-14B is also ~9 GB (quantized) and supports an extended context window up to 128k tokens – which is extremely useful if you plan to analyze very long documents without chunking. It’s trained in 30+ languages and tuned to follow complex instructions, which can help in structured output tasks like ours.
- Others: Ollama offers many models (LLaMA 2 variants, Mistral, etc.). For instance, ollama pull llama2:13b would get a LLaMA-2 13B model. These can work, but for best results in NER with no fine-tuning, we suggest using one of the above well-instructed models. If your hardware is limited, you could try a 7-8B model (e.g., deepseek-r1:7b or cogito:8b), which download faster and use ~4–5 GB VRAM, at the cost of some accuracy. In CPU-only scenarios, even a 1.5B model is available – it will run very slowly and likely miss more entities, but it proves the pipeline can work on minimal hardware.

Hardware Requirements: Our case assumes an NVIDIA A100 GPU (40 GB), which comfortably hosts a 14B model in memory and accelerates inference. In practice, any modern GPU with ≥10 GB memory can run a 13–14B model in 4-bit quantization. For example, an RTX 3090 or 4090 (24 GB) could handle it, and even smaller GPUs (or Apple Silicon with 16+ GB RAM) can run 7B models. Ensure you have sufficient system RAM as well (at least as much as the model size, plus overhead for n8n – 16 GB RAM is a safe minimum for 14B). Disk space of ~10 GB per model is needed. If using Docker for n8n, allocate CPU and memory generously to avoid bottlenecks when the LLM node processes large text.

Building the n8n Workflow

With the environment ready, we now construct the n8n workflow that ties everything together. We outline each component with instructions:

1. Webhook Input for Entities and Text

Start by creating a Webhook trigger node in n8n. This will provide a URL (endpoint) that you can send a request to. Configure it to accept a POST request containing the necessary inputs. For example, we expect the request JSON to look like:

{
  "entities": "PER, ORG, LOC",
  "text": "John Doe visited Berlin in 1921 and met with the Board of Acme Corp."
}

Here, "entities" is a simple comma-separated string of entity types (you could also accept an array or a more detailed schema; for simplicity we use the format used in the paper: PER for person, LOC for location, ORG for organization). The "text" field contains the content to analyze. In a real scenario, the text could be much longer or might be sent as a file. If it’s a file, one approach is to send it as form-data and use n8n’s Read Binary File + Move Binary Data nodes to get it into text form. Alternatively, send a URL in the JSON and use an HTTP Request node in the workflow to fetch the content. The key is that by the end of this step, we have the raw text and the list of entity labels available in the n8n workflow as variables.

2. Constructing the LLM Prompt

Next, add a node to build the prompt that will be fed to the LLM. You can use a Function node (JavaScript code) or the “Set” node to template a prompt string. We will create two pieces of prompt content: a system instruction (the role played by the system prompt in chat models) and the user message (which will contain the text to be processed).

According to the method, our system prompt should incorporate the following:

Persona/Context: e.g. “You are a historian and archivist analyzing a historical document. The language may be old or have archaic spellings. You have extensive knowledge of people, places, and organizations relevant to the context.” This establishes domain expertise in the model.
Task Definition: e.g. “Your task is to perform Named Entity Recognition. Identify all occurrences of the specified entity types in the given text and annotate them with the corresponding tags.”
Entity Definitions: List the entity categories provided by the user, with a brief definition if needed. For example: “The entity types are: PER (persons or fictional characters), ORG (organizations, companies, institutions), LOC (locations such as cities, countries, landmarks).” If the user already provided definitions in the webhook, include those; otherwise a generic definition as shown is fine.
Tagging Instructions: Clearly explain the tagging format. We adopt the format from the paper: each entity should be wrapped in <<TYPE ... /TYPE>>. So instruct: “Enclose each entity in double angle brackets with its type label. For example: <<PER John Doe /PER>> for a person named John Doe. Do not alter any other text – only insert tags. Ensure every opening tag has a closing tag.” Also mention that tags can nest or overlap if necessary (though that’s rare).
Output Expectations: Emphasize that the output should be the exact original text, verbatim, with tags added and nothing else. For example: “Repeat the input text exactly, adding the tags around the entities. Do not add explanations or remove any content. The output should look like the original text with markup.” This is crucial to prevent the model from omitting or rephrasing text. The paper’s prompt literally had a line: “Repeat the given text exactly. Be very careful to ensure that nothing is added or removed apart from the annotations.”.
Compliance & Thoughtfulness: We can borrow the trick of telling the model to take its time and be precise. For instance: “Before answering, take a deep breath and think step by step. Make sure you find all entities. You will be rewarded for each correct tag.” While the notion of reward is hypothetical, such phrasing has been observed to sharpen the model’s focus. This is optional but can be useful for complex texts.

Once this system prompt is assembled as a single string, it will be sent as the system role content to the LLM. Now, for the user prompt, we simply supply the text to be analyzed. In many chat-based LLMs, the user message would contain the text on which the assistant should perform the task. We might prefix it with something like “Text to analyze:” for clarity, or just include the raw text. (Including a prefix is slightly safer to distinguish it from any instructions, but since the system prompt already set the task, the user message can be just the document text.)

In n8n, if using the Basic LLM Chain node, you can configure it to use a custom system prompt. For example, connect the Function/Set node output into the LLM node, and in the LLM node’s settings choose “Mode: Complete” or similar, then under System Instructions put an expression that references the constructed prompt text (e.g., { $json["prompt"] } if the prompt was output to that field). The User Message can similarly be fed from the input text field (e.g., { $json["text"] }). Essentially, we map our crafted instruction into the system role, and the actual content into the user role.

3. Configuring the Local LLM (Ollama Model Node)

Now configure the LLM node to use the Ollama backend and your downloaded model. n8n provides an “Ollama Chat Model” integration, which is a sub-node of the AI Agent system. In the n8n editor, add or open the LLM node (if using the AI Agent, this might be inside a larger agent node), and look for model selection. Select Ollama as the provider. You’ll need to set up a credential for Ollama API access – use http://127.0.0.1:11434 as the host (instead of the default localhost, to avoid any IPv6 binding issues). No API key is needed since it’s local. Once connected, you should see a dropdown of available models (all the ones you pulled). Choose the 14B model you downloaded, e.g. deepseek-r1:14b or cogito:14b.

Double-check the parameters for generation. By default, Ollama models have their own preset for max tokens and temperature. For an extraction task, we want the model to stay focused and deterministic. It’s wise to set a relatively low temperature (e.g. 0.2) to reduce randomness, and a high max tokens so it can output the entire text with tags (set max tokens to at least the length of your input in tokens plus 10-20% for tags). If using Cogito with its 128k context, you can safely feed very long text; with other models (often ~4k context), ensure your text isn’t longer than the model’s context limit or use a model variant with extended context. If the model supports “tools” or functions, you won’t need those here – this is a single-shot prompt, not a multi-step agent requiring tool usage, so just the chat completion mode is sufficient.

At this point, when the workflow runs to this node, n8n will send the system and user messages to Ollama and wait for the response. The heavy lifting is done by the LLM on the GPU, which will generate the tagged text. On an A100, a 14B model can process a few thousand tokens of input and output in just a handful of seconds (exact time depends on the model and input size).

4. Returning the Results

After the LLM node, add a node to handle the output. If you want to present the tagged text directly, you can pass the LLM’s output to the final Webhook Response node (or if using the built-in n8n chat UI, you would see the answer in the chat). The tagged text will look something like:

<<PER John Doe /PER>> visited <<LOC Berlin /LOC>> in 1921 and met with the Board
of <<ORG Acme Corp /ORG>>.

This format highlights each identified entity. It is immediately human-readable with the tags, and trivial to post-process if needed. For example, one could use a regex like <<(\w+) (.*?) /\1>> to extract all type and entity pairs from the text. In n8n, a quick approach is to use a Function node to find all matches of that pattern in item.json["data"] (assuming the LLM output is in data). Then one could return a JSON array of entities. However, since our focus is on correctness and ease, you might simply return the marked-up text and perhaps document how to parse it externally if the user wants structured data.

Finally, use an HTTP Response node (if the workflow was triggered by a Webhook) to send back the results. If the workflow was triggered via n8n’s chat trigger (in the case of interactive usage), you would instead rely on the chat UI output. For a pure API workflow, the HTTP response will contain either the tagged text or a JSON of extracted entities, which the user’s script or application can then use.

Note: If you plan to run multiple analyses or have an ongoing service, you might want to persist the Ollama server (don’t shut it down between runs) and perhaps keep the model loaded in VRAM for performance. Ollama will cache the model in memory after the first request, so subsequent requests are faster. On an A100, you could even load two models (if you plan to experiment with which gives better results) but be mindful of VRAM usage if doing so concurrently.

Model Selection Considerations

We provided two example 14B models (DeepSeek-R1 and Cogito) to use with this pipeline. Both are good choices, but here are some considerations and alternatives:

Accuracy vs. Speed: Larger models (like 14B or 30B) generally produce more accurate and coherent results, especially for complex instructions, compared to 7B models. Since our aim is correctness of NER output, the A100 allows us to use a 14B model which offers a sweet spot. In preliminary tests, these models can correctly tag most obvious entities and even handle some tricky cases (e.g. person names with titles, organizations that sound like person names, etc.) thanks to their pretrained knowledge. If you find the model is making mistakes, you could try a bigger model (Cogito 32B or 70B, if resources permit). Conversely, if you need faster responses and are willing to trade some accuracy, a 7-8B model or running the 14B at a higher quantization (e.g. 4-bit) on CPU might be acceptable for smaller texts.
Domain of the Text: The paper dealt with historical travel guide text (1920s era). These open models have been trained on large internet corpora, so they likely have seen a lot of historical names and terms, but their coverage might not be as exhaustive as GPT-4. If your text is in a specific domain (say, ancient mythology or very obscure local history), the model might miss entities that it doesn’t recognize as famous. The prompt’s context can help (for example, adding a note like “Note: Mythological characters should be considered PERSON entities.” as they did for Greek gods). For extremely domain-specific needs, one could fine-tune a model or use a specialized one, but that moves beyond the zero-shot philosophy.
Language: If your texts are not in English, ensure the chosen model is multilingual. Cogito, for instance, was trained in over 30 languages, so it can handle many European languages (the paper also tested German prompts). If using a model that’s primarily English (like some LLaMA variants), you might get better results by writing the instructions in English but letting it output tags in the original text. The study found English prompts initially gave better recall even on German text, but with prompt tweaks the gap closed. For our pipeline, you can simply provide the definitions in English and the text in the foreign language – a capable model will still tag the foreign entities. For example, Cogito or DeepSeek should tag a German sentence’s “Herr Schmidt” as <<PER Herr Schmidt /PER>>. Always test on a small sample if in doubt.
Extended Context: If your input text is very long (tens of thousands of words), you should chunk it into smaller segments (e.g. paragraph by paragraph) and run the model on each, then merge the outputs. This is because most models (including DeepSeek 14B) have a context window of 2048–8192 tokens. However, Cogito’s 128k context capability is a game-changer – in theory you could feed an entire book and get a single output. Keep in mind the time and memory usage will grow with very large inputs, and n8n might need increased timeout settings for such long runs. For typical use (a few pages of text at a time), the standard context is sufficient.

In our implementation, we encourage experimenting with both DeepSeek-R1 and Cogito models. Both are open-source and free for commercial use (Cogito uses an Apache 2.0 license, DeepSeek MIT). They represent some of the best 14B-class models as of early 2025. You can cite these models in any academic context if needed, or even switch to another model with minimal changes to the n8n workflow (just pull the model and change the model name in the Ollama node).

Example Run

Let’s run through a hypothetical example to illustrate the output. Suppose a historian supplies the following via the webhook:

Entities: PER, ORG, LOC
Text: “Baron Münchhausen was born in Bodenwerder and served in the Russian military under Empress Anna. Today, the Münchhausen Museum in Bodenwerder is operated by the town council.”

When the workflow executes, the LLM receives instructions to tag people (PER), organizations (ORG), and locations (LOC). With the prompt techniques described, the model’s output might look like:

<<PER Baron Münchhausen /PER>> was born in <<LOC Bodenwerder /LOC>> and served
in the Russian military under <<PER Empress Anna /PER>>. Today, the <<ORG
Münchhausen Museum /ORG>> in <<LOC Bodenwerder /LOC>> is operated by the town
council.

All person names (Baron Münchhausen, Empress Anna) are enclosed in <<PER>> tags, the museum is marked as an organization, and the town Bodenwerder is marked as a location (twice). The rest of the sentence remains unchanged. This output can be returned as-is to the user. They can visually verify it or programmatically parse out the tagged entities. The correctness of outputs is high: each tag corresponds to a real entity mention in the text, and there are no hallucinated tags. If the model were to make an error (say, tagging “Russian” as LOC erroneously), the user could adjust the prompt (for example, clarify that national adjectives are not entities) and re-run.

Limitations and Solutions

While this pipeline makes NER easier to reproduce, it’s important to be aware of its limitations and how to mitigate them:

Model Misclassifications: A local 14B model may not match GPT-4’s level of understanding. It might occasionally tag something incorrectly or miss a subtle entity. For instance, in historical texts, titles or honorifics (e.g. “Dr. John Smith”) might confuse it, or a ship name might be tagged as ORG when it’s not in our categories. Solution: Refine the prompt with additional guidance. You can add a “Note” section in the instructions to handle known ambiguities (the paper did this with notes about Greek gods being persons, etc.). Also, a quick manual review or spot-check is recommended for important outputs. Since the output format is simple, a human or a simple script can catch obvious mistakes (e.g., if “Russian” was tagged LOC, a post-process could remove it knowing it’s likely wrong). Over time, if you notice a pattern of mistakes, update the prompt instructions accordingly.
Text Reproduction Issues: We instruct the model to output the original text verbatim with tags, but LLMs sometimes can’t resist minor changes. They may “correct” spelling or punctuation, or alter spacing. The paper noted this tendency and used fuzzy matching when evaluating. In our pipeline, minor format changes usually don’t harm the extraction, but if preserving text exactly is important (say for downstream alignment), this is a concern. Solution: Emphasize fidelity in the prompt (we already do). If needed, do a diff between the original text and tagged text and flag differences. Usually differences will be small (e.g., changing an old spelling to modern). You can then either accept them or attempt a more rigid approach (like asking for a JSON list of entity offsets – though that introduces other complexities and was intentionally avoided by the authors). In practice, we found the tag insertion approach with strong instructions yields nearly identical text apart from the tags.
Long Inputs and Memory: Very large documents may exceed the model’s input capacity or make the process slow. The A100 GPU can handle a lot, but n8n itself might have default timeouts for a single workflow execution. Solution: For long texts, break the input into smaller chunks (maybe one chapter or section at a time). n8n can loop through chunks using the Split In Batches node or simply by splitting the text in the Function node and feeding the LLM node multiple times. You’d then concatenate the outputs. If chunking, ensure that if an entity spans a chunk boundary, it might be missed – usually rare in well-chosen chunk boundaries (paragraph or sentence). Alternatively, use Cogito for its extended context to avoid chunking. Make sure to increase n8n’s execution timeout if needed (via environment variable N8N_DEFAULT_TIMEOUT or in the workflow settings).
Concurrent Usage: If multiple users or processes hit the webhook simultaneously, they would be sharing the single LLM instance. Ollama can queue requests, but the GPU will handle them one at a time (unless running separate instances with multiple GPUs). For a research setting with one user at a time, this is fine. If offering this as a service to others, consider queuing requests or scaling out (multiple replicas of this workflow on different GPU machines). The stateless design of the prompt makes each run independent.
n8n Learning Curve: For historians new to n8n, setting up the workflow might be unfamiliar. However, n8n’s no-code interface is fairly intuitive with a bit of guidance. This case study provides the logic; one can also import pre-built workflows. In fact, the n8n community has template workflows (for example, a template for chatting with local LLMs) that could be adapted. We assume the base pipeline from the paper’s authors is available on GitHub – using that as a starting point, one mostly needs to adjust nodes as described. If needed, one can refer to n8n’s official docs or community forum for help on creating a webhook or using function nodes. Once set up, running the workflow is as easy as sending an HTTP request or clicking “Execute Workflow” in n8n.
Output Verification: Since we prioritize correctness, you may want to evaluate how well the model did, especially if you have ground truth annotations. While benchmarking is out of scope here, note that you can integrate evaluation into the pipeline too. For instance, if you had a small test set with known entities, you could compare the model output tags with expected tags using a Python script (n8n has an Execute Python node) or use an NER evaluation library like nervaluate for precision/recall. This is exactly what the authors did to report performance, and you could mimic that to gauge your chosen model’s accuracy.

Conclusion

By following this guide, we implemented the NER4All paper’s methodology with a local, reproducible setup. We used n8n to handle automation and prompt assembly, and a local LLM (via Ollama) to perform the heavy-duty language understanding. The result is a flexible NER pipeline that requires no training data or API access – just a well-crafted prompt and a powerful pretrained model. We demonstrated how a user can specify custom entity types and get their text annotated in one click or API call. The approach leverages the strengths of LLMs (vast knowledge and language proficiency) to adapt to historical or niche texts, aligning with the paper’s finding that a bit of context and expert prompt design can unlock high NER performance.

Importantly, this setup is easy to reproduce: all components are either open-source or freely available (n8n, Ollama, and the models). A research engineer or historian can run it on a single machine with sufficient resources, and it can be shared as a workflow file for others to import. By removing the need for extensive data preparation or model training, this lowers the barrier to extracting structured information from large text archives.

Moving forward, users can extend this case study in various ways: adding more entity types (just update the definitions input), switching to other LLMs as they become available (perhaps a future 20B model with even better understanding), or integrating the output with databases or search indexes for further analysis. With the rapid advancements in local AI models, we anticipate that such pipelines will become even more accurate and faster over time, continually democratizing access to advanced NLP for all domains.

Sources: This implementation draws on insights from [1] for the prompt-based NER method, and uses tools like n8n and Ollama as documented in their official guides. The chosen models (DeepSeek-R1[2] and Cogito[3]) are described in their respective releases. All software and models are utilized in accordance with their licenses for a fully local deployment.

About LLMs as ‘authors’

The initial draft was created using “Deep-Research” from gpt-4.5 (preview). Final proofreading/content review/layouting by Nicole Dresselhaus. Do not fear that this is some LLM-BS to get views on the homepage. I read everything multiple times and would have written it with this content - just in worse words.

References

Hiltmann, Torsten, Martin Dröge, Nicole Dresselhaus, Till Grallert, Melanie Althage, Paul Bayer, Sophie Eckenstaler, et al. 2025. NER4all or context is all you need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach.

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.

Cogito, Deep. 2025. Cogito v1 preview - introducing IDA as a path to general superintelligence.

Farcas, Mihai. 2024. Run LLMs locally: 5 best methods (+ self-hosted AI starter kit).

Haber, Aleksandar. 2025. Tutorial on how to integrate DeepSeek-R1 and the n8n agent development framework.

Citation

BibTeX citation:

@online{2025,
  author = {, GPT-4.5 and Dresselhaus, Nicole},
  title = {Case {Study:} {Local} {LLM-Based} {NER} with N8n and
    {Ollama}},
  date = {2025-05-05},
  url = {https://drezil.de/Writing/ner4all-case-study.html},
  langid = {en},
  abstract = {Named Entity Recognition (NER) is a foundational task in
    text analysis, traditionally addressed by training NLP models on
    annotated data. However, a recent study – \_“NER4All or Context is
    All You Need”\_ – showed that out-of-the-box Large Language Models
    (LLMs) can **significantly outperform** classical NER pipelines
    (e.g. spaCy, Flair) on historical texts by using clever prompting,
    without any model retraining. This case study demonstrates how to
    implement the paper’s method using entirely local infrastructure: an
    **n8n** automation workflow (for orchestration) and a **Ollama**
    server running a 14B-parameter LLM on an NVIDIA A100 GPU. The goal
    is to enable research engineers and tech-savvy historians to
    **reproduce and apply this method easily** on their own data, with a
    focus on usability and correct outputs rather than raw performance.
    We will walk through the end-to-end solution – from accepting a
    webhook input that defines entity types (e.g. Person, Organization,
    Location) to prompting a local LLM to extract those entities from a
    text. The solution covers setup instructions, required
    infrastructure (GPU, memory, software), model configuration, and
    workflow design in n8n. We also discuss potential limitations (like
    model accuracy and context length) and how to address them. By the
    end, you will have a clear blueprint for a **self-hosted NER
    pipeline** that leverages the knowledge encoded in LLMs (as
    advocated by the paper) while maintaining data privacy and
    reproducibility.}
}

For attribution, please cite this work as:

GPT-4.5, and Nicole Dresselhaus. 2025. “Case Study: Local LLM-Based NER with N8n and Ollama.” May 5, 2025. https://drezil.de/Writing/ner4all-case-study.html.