texas-moody

Cuestión Pública of Colombia creates AI tool to improve daily coverage with investigative journalism

Independent media outlet Cuestión Pública has specialized in investigative journalism that seeks to shed light on cases of abuse of power in Colombia. Since its founding in 2018, it has published multiple reports and special products that have had significant impacts on the public agenda in that country.

However, like many small investigative journalism outlets, Cuestión Pública – whose editorial team consists of just over ten people – has found it difficult to cover day-to-day events.

“The challenge as an investigative journalism media outlet is that you are not as present on the media agenda because investigation takes a lot of time. [...] You lose relevance the slower the investigation is done,” Claudia Báez, general director and co-founder of Cuestión Pública, told LatAm Journalism Review (LJR). “What [covering] the current events does is give the media outlet relevance, positioning it on the media agenda and retaining the audience.”

Illustration of the AI tool Odin Project, developed by Colombian media outlet Cuestión Pública.

Odin is a tool that uses artificial intelligence to optimize the creation of current affairs news content enriched with investigative journalism context. (Photo: Screenshot from YouTube)

 

 

After experimenting with generative artificial intelligence tools to streamline design processes, at the end of 2022, Báez had the vision of finding a way to use said technology to provide their readers with timely coverage, but enriched with investigative journalism. This, taking advantage of the existing assets of the media outlet, especially the extensive databases that they have built for large data journalism projects, such as the gamified products “Sabemos lo que hiciste la legislatura pasada” (We know what you did last term) and “Juego de Votes” (Game of Votes).

This is how the Odin Project emerged, a tool that uses artificial intelligence to improve the creation of timely content enriched with the context of investigative journalism. Odin operates under the “zero waste” concept, which seeks to make the most of the data that the media outlet has investigated for years and allow them to become relevant through current events.

Odin works through an interface in which the journalist enters the title of a current topic. Then, the system searches for information related to that topic in the structured Cuestión Pública databases and weights it according to its relevance. Subsequently, the tool generates a draft of a thread for X with the style of the media outlet, which is edited by the journalist to be published later.

“Odin reduces the production time of a thread from three hours to 15 minutes. And it has the tone of Cuestión Pública, how our voice speaks on networks, because that is what we trained it for,” Báez said. “When a journalist is doing an investigation, connecting all the dots, and has to cover something from current events, he loses concentration. Thus [with Odin] the investigative journalist who was doing something else, arrives, edits it and has it in 15 minutes.”

Odin, whose name refers to the acronym for Optimized Data Integration Network, was developed from Cuestión Pública participation in the Artificial Intelligence Journalism Challenge (AIJC), a global competition developed by the Open Society Foundation that offers training, mentoring and funding to selected newsrooms to develop innovative ways to apply artificial intelligence in journalism.

Cuestión Pública was one of the only two Latin American media outlets among the 12 newsrooms selected for the 2023 edition of AIJC. The other was Agência Pública, from Brazil. The winning newsroom was Rappler, from the Philippines, while Cuestión Pública won an Honorable Mention for Odin.

“There were three judges that were judging the final competition and they called out the Odin Project as being particularly innovative,” David Caswell consultant & researcher focused on AI in journalism, told LJR. “But the judges were so impressed with Odin that they granted this honorable mention. They wanted to recognize it, even though there could only be one winner.”

According to Caswell, part of what motivated the jury to award the honorable mention is the fact that Odin combines context with journalistic rigor from Cuestión Pública databases with the fluidity and timeliness of current news, in content that is easy to consume on social networks.

“What that represents is something new, it's this ability to take something that has perhaps just occurred and then to ground that in a well researched, well maintained, verified body of information, and then present that contextualized news in an accessible way,” said the consultant, who also served as a mentor to the AIJC participants as they developed their projects.

"In Colombian media before, you would have had somebody who [...] have been doing journalism for a long time and would have built up this knowledge in their head. And then some new piece of news comes along and they would instantly know all of these facts, these connections, and then they can write about it.”

In addition to mentoring, the program included a grant of about $6,300 to each of the 12 newsrooms selected for prototyping their projects. Cuestión Pública used this support to hire a technology development agency in Colombia for the creation of Odin.

Cutting-edge techniques

Báez and Caswell agreed that one of the main innovations that Odin brings to journalism is the possibility of optimizing the results of generative artificial intelligence with specific information, different from that which was used for its training, without having to modify the model itself.

Colombian journalist Claudia Báez.

Claudia Baez received the honorable mention awarded to Odin as part of the AIJC at the Splice Beta Journalism festival in Chiang Mai, Thailand, in November 2023. (Photo: Claudia Baez Twitter)

And this is achieved thanks to a methodology called RAG (Retrieval-Augmented Generation), which allows large language models (LLM) to take advantage of data from organizations so that they can deliver more relevant answers that have adequate context.

“A [natural language] model is trained with certain information, but it does not know the information that I have, so I inject it with this external information to see how it behaves with this information that it was not trained with, but that can be provide,” Esteban Ponce de León, a researcher at the Atlantic Council’s Digital Forensic Research Lab (DFRLab) in Colombia, told LJR. “[RAG] is currently how we can better connect language models with our own data.”

To apply RAG technology to Odin, it was first necessary to subject the Cuestión Pública databases to a vectorization process. That is, transforming the information into a numerical format. In this way, when a journalist enters a prompt into the system, it is also vectorized and Odin compares the numerical values ​​of the prompt and the databases, and returns those that are most similar to each other.

“If you asked about a specific congressman, surely those similar results will contain information related to that congressman, but if you also added a political situation, for example an issue of pensions, an issue of protests, then the results that Odin will bring you are going to be those of the congressman, plus that political situation that you included in your prompt,” said Ponce de Léon, who was also one of the data scientists who participated in the development of Odin.

Part of the importance of RAG technology in journalism is that it allows LLMs to draw on quality, up-to-date information rather than using the information they were trained on, which can sometimes be outdated or inaccurate. This also prevents these models from returning “hallucinations,” which could have fateful consequences for journalism.

In Odin, this is achieved by providing a “system prompt,” which is a type of command that is introduced to a generative artificial intelligence model prior to the start of each session, and that determines its behavior, tasks and limits.

“The system prompt will define the role of the model, literally with phrases like 'do not use your prior knowledge and focus only on this context that is being given to you to generate your result,'” Ponce de Léon explained. "It is not a programmatic configuration, it is simply giving it an instruction in natural language so that it focuses on that role specifically on what it has to respond to in the next prompt, which comes from the user.”

For vectorization processes, Odin uses Google's BERT model, while its generative functions are possible thanks to the GPT 3.5 and GPT 4 models from OpenAI, the organization that developed ChatGPT, Ponce de León explained.

According to Caswell, very few news outlets are using the RAG methodology in their generative AI applications. Most of those that do, he said, are large newsrooms that use data from their archives or small sets of documents, such as judicial or legislative documents, to contextualize the content generated.

“It's a very hot topic, kind of at the leading edge of applying AI in journalism,” Caswell said. “But I think Cuestión Pública is kind of, you know, quite a ways out in the lead, in sort of that very advanced and therefore very impactful demonstration of what's possible with this.”

In parallel with providing Odin with informational context, form context is also provided, the latter so that the result returned has the structure and format that Cuestión Pública uses for its content on social networks. This is achieved by providing the model with a series of examples of X threads created by journalists from the media outlet so that Odin can replicate their tone and style.

This machine learning technique is called “few-shot training,” and it consists of training an LLM to execute a set of tasks based on a certain number of samples, so that it learns to recognize patterns in those samples and use them in its future answers.

Illustration of the gamified data journalism project Sabemos lo que hiciste la legislatura pasada, by Colombian independent media outlet Cuestión Pública

The tool leverages the extensive databases that Cuestión Pública has built for data journalism projects, such as the gamified products "We Know What You Did Last Legislature" and "Game of Votes." (Photo: Courtesy Cuestión Pública)

However, currently the Odin development team is planning to obtain those same results by directly customizing the GPT model through the OpenAI API, which is possible to do since the GPT 3.5 version was launched. This process is known as “fine tuning.”

“What you do is use this GPT 3.5 model and you customize it much more with these examples that you already have, and it is no longer through prompts, but through a special 'fine tuning' process that OpenAI has so that the results you expect from text generation follow those examples you used. It is as if it were a new phase of training,” Ponce de León said.

Odin does not do the journalist's job

Although Odin participates in a large part of the creation of X threads on current news, Báez is firm in ensuring that these threads are not content generated by artificial intelligence. What the tool does, she said, is reduce the time it takes a journalist to locate and analyze the information in the media's databases to relate it to the current topic.

She also said that Odin is not coming to replace any journalist on her team, but rather to optimize their time so that they can dedicate it to in-depth investigations.

“Practically what Odin gives you is a draft [...] to reduce and optimize the time of my journalists, who are invaluable,” the journalist said. “Odin already weighed the information for me, he already gave me the journalistic findings, but there is a conscious addition from the journalist. The output is a draft, but there is hard human editing work there.”

Báez said that for a small and independent media outlet, growing in human resources is difficult, which is why technologies such as artificial intelligence applied to journalism are of great help to enhance the work. But, on the other hand, she is also aware that applications like Odin involve costs that many small media are not able to afford.

In Ponce de Léon's opinion, creating and maintaining a tool like Odin is not impossible for a small or medium-sized media outlet in Latin America, but it does require at least one person on the team with technical knowledge in programming and data science.

“It's really not as complicated as it might seem. There is a technical part that is important, which is the knowledge of a programming language, knowledge of APIs to be able to connect with these models, interacting with them from a programmatic part, but I think that a person within a small team who has those skills can do it,” he said.

Although OpenAI LLMs are not free, new open source models with similar functions to GPT models are increasingly emerging, such as Meta's LLaMA, or the models of the French artificial intelligence company Mistral AI, Ponce de León said.

But although these open source tools exist or have affordable costs, it must also be kept in mind that systems like Odin also require hosting costs for large amounts of data, the researcher added. However, Ponce de León believes that sooner or later news media will find it necessary to integrate some type of artificial intelligence into their processes.

“There are cloud costs to consider. Maybe that's where it becomes a little expensive, but I think media has to transition to this type of technology and in the process there are costs involved,” he said. “I would be very encouraged if media try to start thinking about how to integrate these types of technologies without the fear of thinking that they are very complex to add.”

Cuestión Pública already plans to increase its commitment to Odin to, in a next stage, increase the capabilities of the tool and look for ways to monetize it. For now, the media outlet is working on finding a way to link Odin with the monitoring of trends on social networks so that the generation of contextualized content on current events occurs in an automated manner.

Although she does not rule out using the tool to generate content in other formats, Báez said that what Cuestión Pública seeks with Odin is not to publish content every day, but to make the most of investigative journalism and bring the information to other audiences, beyond regular readers who read long-form reports.

“A media outlet like Cuestión Pública is not interested in creating articles to fill pages, rather I prefer to convert my findings, which are so profound, to democratize the information so that it can reach less sophisticated audiences that can be understood more by the base. That is where we are evolving,” she said.

Translated by Teresa Mioli
Republishing Guidelines