Google Allows AI Models to Validate Their Responses
Google has developed a series of language models that can answer questions about numerical facts with greater accuracy than previous algorithms.
The company has made the source code for the DataGemma blockchain available via the Hugging Face platform .
The DataGemma series is designed to answer users' questions about statistical facts, such as the average revenue of companies in a particular market sector.
The series also answers queries using information from Data Commons, a free repository of information run by Google.
The repository contains more than 240 billion data points from sources such as the United Nations, the World Health Organization, the Centers for Disease Control, and statistical offices.
The DataGemma series is based on Gemma 2 27B, a large, open-source language model that Google released in June that features 27 billion parameters. Google says Gemma 2 27B is capable of competing with large language models with twice the number of parameters.
According to the company, the DataGemma series is based on a version of the Gemma 2 27B specifically optimized for processing digital facts.
The model interacts with Data Commons, the information repository from which it draws those facts, through a natural language search bar.
“The DataGemma series uses the Data Commons natural language interface to ask questions rather than needing to know the specific data schema or API for the underlying datasets,” Google said in a blog post . “The trick is to train the large language model to know when to ask.”
Google has developed two versions of the DataGemma series, each taking a different approach to answering user questions.
The first version takes advantage of a method known as RIG, or Retrieval Nested Generation, to process queries.
When a user asks a question, the model doesn't generate an answer based on its knowledge base, but rather fetches the required information from the Data Commons repository, and then the large language model uses the retrieved data to generate a quick response.
The second version takes advantage of the RAG, or Retrieval Augmented Generation, data management method to process queries.
When a user enters a query, the model retrieves information relevant to the claim from the Data Commons repository, and then sends the aggregated information to the Gemini 1.5 Pro model, which generates an answer.
According to the MIT Technology Review, the RIG version of the DataGemma series successfully retrieves digital facts from the Data Commons repository 58 percent of the time.
In contrast, the RAG version of the DataGemma series generated correct answers between 80 percent and 94 percent of the answers it received during Google's tests.
Google plans to improve the DataGemma series by training it on additional information, as well as increasing the number of questions the series can answer from hundreds to millions.