Build a knowledge base from many documents

USE 0 - 12 min

Turn a stack of sources into one searchable knowledge base

One PDF is a demo; a knowledge base is a tool. Dify's RAG pipeline accepts PDFs, documents, and web pages, embedding them all into a single managed knowledge base your apps can query. The goal of this lesson is to assemble several sources into one base so a chatbot can answer across the whole corpus.

1 In the Studio, open the Knowledge section and create a new knowledge base.
2 Add several sources: upload two or three PDFs from the same topic, and add a web page URL as well — Dify ingests PDFs, docs, and web pages into the same base.
3 Let Dify chunk and embed each source. Watch the document count rise (the Sandbox tier allows up to 50 knowledge documents).
4 Attach this knowledge base to a Chatbot app (reuse the one from lesson 00 or create a new one).
5 Ask a question that spans more than one source, e.g. What do these sources agree and disagree on?

✓

A single knowledge base holds multiple sources, and a chatbot answers a question using more than one of them with citations.

UNDERSTAND 12 - 22 min

Why retrieval quality depends on your sources

When the base holds many documents, retrieval has to pick the right chunks from a larger pool. That makes the quality of your sources — and how cleanly they were ingested — matter more than with a single file.

Key concept

A knowledge base is the retrieval layer your apps share: the same base can power several chatbots, agents, and workflows. Because Dify retrieves the most relevant chunks across every document in the base, adding noisy or off-topic sources can pull the answer off course. Curating the base — what goes in, and keeping it on-topic — is the lever that most affects answer quality, and it is the same lever whether you have one app or ten pointing at it.

?When you asked a cross-source question, did the citation come from the source you expected? If not, why might Dify have retrieved a different chunk?
?How would you tell a noisy source from a useful one just by reading the answers it produces?
?If two of your sources contradict each other, what would you want the chatbot to do — pick one, or surface both?

BUILD 22 - 30 min

Stress-test your base with a question it should refuse

A trustworthy knowledge app is as much about what it declines to answer as what it answers. The best way to find the edges of your base is to probe them.

Your task

Ask your multi-source chatbot one in-corpus question and one clearly out-of-corpus question, and record how it handles each.

1 Ask a question whose answer you KNOW is in your sources. Confirm the citation is correct.
2 Ask a question whose answer is NOT in any source (e.g. an unrelated topic).
3 Note whether the chatbot grounds-and-cites, admits it does not know, or guesses.
4 Write one sentence on whether you would trust this base for real work, and what you would add or remove.

Deliverable

Two question/answer pairs (in-corpus and out-of-corpus) plus a one-sentence trust verdict on your knowledge base.