Disclaimer: this is a report generated with my tool: https://github.com/DTeam-Top/tsw-cli. See it as an experiment not a formal research, 😄。
Note: this explanation only covers the knowledge_storm
in the storm repo because it aligns with my interests.
Summary
This repository provides an implementation of the STORM (Systematic Trial of Opinionated and Robust Models) Wiki pipeline and its collaborative version, Co-STORM. STORM aims to automate the process of knowledge curation, specifically for generating Wikipedia-like articles. It leverages Large Language Models (LLMs) for various tasks like information seeking, outline generation, article writing, and polishing. The repository also offers example scripts for customization, enabling users to adapt the system to their preferred language models and data sources.
Modules
-
knowledge_storm.storm_wiki
: Implements the core STORM Wiki pipeline for automated knowledge curation and article generation. -
knowledge_storm.collaborative_storm
: Implements the Co-STORM pipeline, which enhances the original STORM framework with collaborative information seeking. -
knowledge_storm.lm
: Defines interfaces and implementations for different Large Language Models (LLMs) used throughout the STORM pipeline. -
knowledge_storm.rm
: Defines interfaces and implementations for different Retrieval Modules (RMs) used to fetch information from external sources. -
knowledge_storm.utils
: Provides utility functions for file I/O, text processing, and other common tasks. -
knowledge_storm.dataclass
: Defines data classes used throughout the STORM pipeline for representing articles, information snippets, and other structured data.
Code Structure
Section 1: STORM Wiki Pipeline (knowledge_storm/storm_wiki
)
This section contains the core implementation of the STORM Wiki pipeline, which automates the generation of Wikipedia-like articles.
-
engine.py
: The central orchestration point for the STORM Wiki pipeline. It defines theSTORMWikiRunner
class, responsible for executing the different stages of the pipeline:- Knowledge Curation: Uses a conversational agent to gather information about a topic from the internet.
- Outline Generation: Creates a structured outline for the article based on the collected information.
- Article Generation: Generates the article content by expanding on the outline with cited evidence.
- Article Polishing: Refines the generated article, adding a summary and removing duplicates.
The
STORMWikiLMConfigs
class encapsulates the configurations for different LLMs used in each stage, allowing for customization and optimization. The Runner uses the Strategy Pattern to allow different search engines to be used. -
modules
directory: Contains the implementation of each stage in the STORM Wiki pipeline as separate modules:-
knowledge_curation.py
: Implements the knowledge curation stage using theStormKnowledgeCurationModule
. -
outline_generation.py
: Implements the outline generation stage using theStormOutlineGenerationModule
. -
article_generation.py
: Implements the article generation stage using theStormArticleGenerationModule
. -
article_polish.py
: Implements the article polishing stage using theStormArticlePolishingModule
. -
persona_generator.py
: Implements the persona generation usingStormPersonaGenerator
. -
storm_dataclass.py
: Defines data classes (StormInformationTable
,StormArticle
,DialogueTurn
) for representing data used in the STORM Wiki pipeline. -
callback.py
: Defines a base class for callback handlers, allowing users to inject custom logic at different stages of the pipeline.
-
Section 2: Collaborative STORM Pipeline (knowledge_storm/collaborative_storm
)
This section implements the Co-STORM pipeline, which extends the original STORM framework with a collaborative information-seeking approach.
-
engine.py
: Similar to the STORM Wiki pipeline, this file defines theCoStormRunner
class, which orchestrates the different stages of the Co-STORM pipeline.- It introduces concepts like:
- Expert Agents: Simulated experts with different perspectives who participate in the information-seeking process.
- Moderator: An agent that guides the conversation and injects new perspectives.
- Knowledge Base: A hierarchical structure for organizing the collected information.
The
CollaborativeStormLMConfigs
class defines the LLM configurations for the different agents and modules used in the Co-STORM pipeline. -
modules
directory: Contains the implementation of the different modules used in the Co-STORM pipeline:-
grounded_question_answering.py
: Implements the question-answering module using theAnswerQuestionModule
. -
grounded_question_generation.py
: Implements the question generation module for the moderator. -
expert_generation.py
: Implements the expert generation module for creating the simulated expert agents. -
collaborative_storm_utils.py
: Provides utility functions specific to the Co-STORM pipeline. -
simulate_user.py
: Implements the simulated user agent. -
warmstart_hierarchical_chat.py
: Implements the warm-start module for initializing the conversation with background information. -
knowledge_base_summary.py
: Used to provide the summary of the knowledge base for each turn. -
costorm_expert_utterance_generator.py
: The tool to generate Co-Storm expert utterances. -
co_storm_agents.py
: Implements the different agent types used in Co-STORM, such asCoStormExpert
,SimulatedUser
, andModerator
. -
information_insertion_module.py
: Used for inserting relevant information into the knowledge base. -
callback.py
: Defines a base class for callback handlers, allowing users to inject custom logic at different stages of the pipeline.
-
Section 3: Language Model and Retrieval Modules (knowledge_storm/lm
, knowledge_storm/rm
)
This section defines the interfaces and implementations for different LLMs and retrieval modules used throughout the STORM pipeline.
-
knowledge_storm/lm
: Contains the implementation of different LLMs, such asOpenAIModel
,AzureOpenAIModel
,GoogleModel
,ClaudeModel
,DeepSeekModel
,GroqModel
,OllamaClient
, andLitellmModel
. These classes provide a consistent interface for interacting with different LLMs, allowing the STORM pipeline to be easily adapted to different models. The common use of these models can be tracked by checkingLMConfigs
. -
knowledge_storm/rm
: Contains the implementation of different retrieval modules, such asYouRM
,BingSearch
,BraveRM
,SerperRM
,DuckDuckGoSearchRM
,TavilySearchRM
,SearXNG
,VectorRM
,AzureAISearch
andGoogleSearch
. These classes provide a consistent interface for fetching information from different external sources, allowing the STORM pipeline to be grounded on different data sources.- The retrieval module classes implement the
dspy.Retrieve
interface, allowing them to be seamlessly integrated into the STORM pipeline.
- The retrieval module classes implement the
Section 4: Utility Functions and Data Classes (knowledge_storm/utils
, knowledge_storm/dataclass
)
This section provides utility functions and data classes used throughout the STORM pipeline.
-
knowledge_storm/utils
: Contains utility functions for file I/O, text processing, and other common tasks.-
ArticleTextProcessing
: Utility functions such aslimit_word_count_preserve_newline
,remove_citations
,update_citation_index
, andparse_article_into_dict
are used in multiple places for processing article text. -
WebPageHelper
: This class contains the function to download webpages for other search engines to generate snippets. -
QdrantVectorStoreManager
: Helper class for vector operation.
-
-
knowledge_storm/dataclass
: Defines data classes used throughout the STORM pipeline for representing articles, information snippets, and other structured data.-
DialogueTurn
: Represents a single turn in a conversation between the agents. -
StormInformationTable
: Represents a collection of information snippets gathered during the knowledge curation stage. -
StormArticle
: Represents an article with sections and subsections, and references. -
KnowledgeNode
: Dataclass that implements mindmap structure forCo-Storm
. -
KnowledgeBase
: A class implementing the mindmap. -
ConversationTurn
: Dataclass that implements the details for the conversation turns.
-
External API Calls
- Search Engine APIs: The STORM pipeline utilizes different search engine APIs (You.com, Bing Search, Brave Search, Serper.dev, DuckDuckGoSearch, TavilySearch, AzureAISearch, GoogleSearch, SearXNG) to fetch information from the internet. These APIs require API keys to be configured.
- LLM APIs: The STORM pipeline utilizes different LLM APIs (OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, DeepSeek, Groq, Ollama, TogetherAI) to generate text, such as questions, answers, outlines, and articles. These APIs also require API keys to be configured.
- Qdrant API: The STORM pipeline uses Qdrant as its vector database.
Insights
- The STORM Wiki pipeline automates knowledge curation using LLMs, offering a customizable solution for generating Wikipedia-like articles.
- The Co-STORM pipeline introduces a collaborative information-seeking approach, simulating expert agents to enhance the quality and diversity of the generated content.
- The repository provides a modular design, allowing users to easily adapt the system to different LLMs and data sources.
- The use of data classes and utility functions promotes code reusability and maintainability.
- The inclusion of example scripts and documentation makes it easier for users to get started with the STORM pipeline and customize it to their specific needs.
The design experiences of the codebase:
- Modularity: The code is highly modular, with each stage of the pipeline implemented as a separate module. This makes it easier to understand, maintain, and extend the code.
- Abstraction: The use of abstract base classes and interfaces promotes abstraction, allowing the system to be easily adapted to different LLMs and data sources.
- Configuration: The use of configuration classes allows users to customize the behavior of the system without modifying the code.
The creativity in the codebase:
- The use of LLMs for automated knowledge curation is a creative approach to solving the problem of generating high-quality articles.
- The introduction of a collaborative information-seeking approach in Co-STORM is a novel way to enhance the quality and diversity of the generated content.
- The use of a mind map as a knowledge base in Co-STORM is a creative way to organize and structure the collected information.
Report generated by TSW-X
Advanced Research Systems Division
Date: 2025-03-06 11:59:57