Elena Canorea
Communications Lead
Introducción
Como dice Microsoft, el mayor desafío a la par que la mayor oportunidad de los LLM sea ampliar sus poderosas capacidades para resolver problemas más allá de los datos con lo que se han formado y lograr resultados comparables con datos que el LLM nunca ha visto.
Esto abre nuevas posibilidades en la investigación de datos, y uno de los grandes avances es GraphRAG. Te explicamos en qué consiste y cómo funciona.
Retrieval-Augmented Generation (RAG) is a technique for searching information based on a user query and providing the results as a reference for generating an AI response.
This technique is an important part of most LLM-based tools and most RAG approaches use vector similarity as a search technique.
A baseline RAG typically integrates a vector database and an LLM, where the vector database stores and retrieves contextual information for user queries, and the LLM generates answers based on the retrieved context. While this approach works well in many cases, it presents difficulties with complex tasks such as multi-hop reasoning or answering queries that require connecting different pieces of information.
The main challenge faced by a RAG is that it retrieves text based on semantic similarity and does not directly answer complex queries where specific details may not be explicitly mentioned in the dataset. This limitation makes it difficult to find the exact information needed, often requiring costly and impractical solutions such as manually creating batteries of frequently asked questions and answers.
To address these challenges, we found GraphRAG, developed by Microsoft, which uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when performing complex information document analysis.
This research points out the power of rapid augmentation when performing discovery on private datasets. These private datasets are defined as data that LLM is not trained on and has never seen business documents or communications before. The graph created by GraphRAG is used in conjunction with Graph Machine Leaning to perform rapid augmentation at query time. This achieves a substantial improvement in answering the two classes of possible queries, demonstrating an intelligence or mastery that outperforms other approaches previously applied to private data sets.
Microsoft Search has presented research using the Violent Incident Information from News Articles (VINA) dataset. This dataset was chosen because of its complexity and the presence of differing opinions and biased information.
They have used thousands of news articles from Russian and Ukrainian news sources from June 2023, translated into English, to create a private dataset on which they have performed their LLM-based retrieval. As the dataset is too large to fit in an LLM context window, a RAG approach is needed.
They start with an exploratory query to a reference RAG system and GraphRAG. The results are that both systems work well, so as a conclusion we can draw that, for a reference query, RAG is sufficient.
With a query that requires joining the dots, the base RAG does not answer this question and gives an error. In comparison, the GraphRAG method discovered an entity in the query. This allows the LLM to rely on the graph and generate a superior response containing provenance through links to the original supporting text. By using the knowledge graph generated by LLM, GraphRAG greatly improves the “retrieval” part of RAG by populating the context window with content of higher relevance, resulting in better answers and capturing the provenance of the evidence.
Como decíamos más arriba, Project GraphRAG es la apuesta de Microsoft Research con la que han conseguido la técnica más avanzada del mercado para comprender en profundidad conjuntos de datos de texto mediante la combinación de extracción de texto, análisis de red y generación y resumen de LLM en un único sistema de extremo a extremo.
A diferencia de un RAG básico que utiliza una base de datos vectorial para recuperar texto semánticamente similar, GraphRAG mejora el mecanismo al incorporar gráficos de conocimiento (KG). Estos gráficos son estructuras de datos que almacenan y vinculan datos relacionados o no relacionados en función de sus relaciones.
Una canalización de GraphRAG suele constar de dos procesos: indexación y consulta.
Este proceso incluye cuatro pasos clave:
Nos encontramos con dos flujos de trabajo de consulta diferentes diseñados para diferentes consultas:
GraphRAG utiliza LLM para crear un gráfico de conocimiento integral que detalla las entidades y sus relaciones a partir de cualquier colección de documentos de texto. Este gráfico permite aprovechar la estructura semántica de los datos y generar respuestas a consultas complejas que requieren una comprensión amplia de todo el texto.
Microsoft lanzó la versión preliminar de GraphRAG en julio de 2024 y, desde entonces y gracias a la increíble acogida y colaboración de la comunidad, han ido mejorando el servicio; lo que ha culminado con el lanzamiento oficial de GraphRAG 1.0.
Las principales mejores tienen que ver con las refactorizaciones ergonómicas y la disponibilidad:
Si quieres estar al día de las últimas novedades en IA y otras tecnologías, ¡suscríbete a nuestra newsletter!
Elena Canorea
Communications Lead