Prediction models that anticipate news trends, deep neural networks that optimize automated processes, or complex techniques that make smarter chatbots possible are elements of artificial intelligence (AI) that are increasingly present in newsrooms.
However, there are still many media outlets and journalists who are not completely familiar with these concepts, some of which are redefining how journalism is produced and consumed.
In a previous installment, LatAm Journalism Review (LJR) explored the definition of basic AI concepts and their applications in the newsroom. This time, we dive deeper with 10 advanced AI terms, such as data mining, predictive analytics and semantic search. We also present examples of how these elements are helping news organizations improve aspects of their work such as investigative accuracy, audience engagement and efficiency in news generation.
An entity refers to a unit of information that can be recognized, processed and used by AI systems. Entities can represent various things, such as objects, concepts, names, events, quantities or relationships.
Entities are essential for the interpretation and understanding of data in AI algorithms, as they serve as building blocks for systems to process and analyze information effectively. By recognizing and classifying entities, AI systems can transform unstructured data into structured information, facilitating tasks such as data analysis and information retrieval.
Example in journalism: TimeLark is a journalistic investigative tool developed in 2023 by journalists and technicians from the BBC, the Organized Crime and Corruption Reporting Project (OCCRP) and the news agency Reuters. The tool is used to analyze large volumes of information and discover hidden connections between the elements of a news event.
TimeLark uses machine learning to extract entities from unstructured news articles, the connections between them and their relationship over time. The extracted information is then plotted into graphs for easy analysis. The tool was tested with over 30,000 BBC and Reuters articles about the war in Ukraine, where the extracted entities were political leaders, countries and international organizations, as well as characters around them.
In the context of AI, hallucinations are a phenomenon in which an AI model – especially those based on large language models (LLMs) – generates responses or results that are incorrect or incoherent.
Hallucinations can be caused by a variety of factors, including the AI model "perceiving" patterns that do not exist, or the model not having enough relevant information to generate an appropriate response and trying to fill the gap with irrelevant information.
Example in journalism: In 2023, journalists at The New York Times tested ChatGPT’s accuracy by asking questions such as when the paper published its first article about AI.
The platform responded that it was the article “Machines Will Be Capable of Learning, Solving Problems, Scientists Predict” supposedly published on July 10, 1956 as part of coverage of a seminal conference at Dartmouth College in New Hampshire.
The New York Times reported that the conference had been real, but the article had never existed.
Experts have warned that hallucinations like this show that the use of generative AI in automated journalism tasks can result in the publication of false or misleading information that appears to be true.
Text-to-speech is a technology within the field of AI that allows the generation of synthetic human voice audio from written text. This conversion is performed using natural language processing (NLP) algorithms to interpret the text, together with advanced voice synthesis models.
Among the most widely used techniques are deep neural networks, such as Tacotron and WaveNet, which are capable of generating fluid and natural-sounding audio. Other more traditional techniques, which are increasingly less used, include the concatenation of pre-recorded voice fragments or mathematical models that simulate the acoustic properties of speech.
Example in journalism: Illariy is the name of an avatar created with AI to present news in Quechua for a university newscast in Peru.
To make Illariy speak Quechua fluently, its creators used D-ID, a platform capable of making images “speak” using text-to-speech technologies. Since Quechua is not among the languages supported by D-ID, Illariy’s creators provided the platform with text containing phonetic words in Spanish that, once read by the system, emulated words in Quechua.
Optical Character Recognition (OCR) is a technology that uses specialized algorithms to convert text documents into editable text. This technology allows the digitization of printed documents and enables computer systems to read and process the text contained in images or scanned files.
The systems behind OCR can be computer vision algorithms, pattern recognition algorithms or deep learning algorithms (Deep Learning OCR). The latter are commonly based on neural networks and are particularly useful in the recognition of handwritten texts.
Example in journalism: The International Consortium of Investigative Journalists (ICIJ) created Datashare, an open source tool that allows users to upload huge amounts of data and search for information. The tool includes OCR technology that allows them to digitize text from scanned documents and handwritten documents.
One of the most notable uses of Datashare was in the cross-border, crowdsourced mega-investigation “Pandora Papers,” where the tool helped organize and analyze 11.9 million leaked documents. The OCR feature allowed the journalists to automatically detect names and organizations from the documents, as well as extract entities into databases.
Data mining is an analytical process that uses machine learning techniques, statistical methods, and database systems to explore large data sets and discover meaningful patterns, trends and relationships that are not immediately visible.
Example in journalism: Journalistic investigations based on leaks of millions of documents require the analysis of large amounts of data. Journalists called upon by the ICIJ for mega investigations such as the Panama Papers, the Paradise Papers or the Pandora Papers have turned to RapidMiner, a tool that uses data mining algorithms.
RapidMiner helps process the text and metadata of filtered documents and create clusters based on common words and phrases, making file queries faster.
Predictive analytics in AI is an approach based on statistical models, machine learning algorithms, and historical data to predict future events. It aims to identify patterns and trends in large volumes of data to optimize decision making.
Predictive analytics algorithms process historical data through a workflow that includes collecting, cleaning and structuring the data. The system then selects the most relevant variables and applies mathematical models such as regression, decision trees or neural networks to generate predictions based on probabilities and patterns detected in the data.
Example in journalism: The New York Times has used predictive analytics algorithms for some of its business and editorial strategies. The newspaper used this technique to analyze how readers become subscribers, and how it could be influenced to increase those conversions.
Predictive analytics also helped the outlet understand which topics generated the most reader engagement, so marketing teams could know what types of articles to promote on social media.
Semantic search is an advanced data search technique that focuses on understanding the meaning and intent behind a query, rather than simply looking for literal keyword matches.
This technology uses NLP and machine learning to analyze natural language and capture nuances such as context and user intent, resulting in more accurate and relevant search.
Example in journalism: AskFT is a generative AI tool developed by the British newspaper Financial Times in 2024, initially only available to its corporate clients. The tool uses an LLM that, through semantic search, finds articles relevant to the user's question.
To do this, the articles are deconstructed into segments and vectorized. The tool summarizes the articles resulting from the search in a response generated by the LLM, which includes citations of the articles.
Named Entity Recognition (NER) is a subfield of NLP that focuses on identifying and classifying named entities in unstructured text. These entities may include names of people, organizations, places, quantities, monetary values, percentages, etc.
NER uses algorithms, statistical models and deep learning to detect and classify entities, facilitating the analysis and processing of unstructured text.
Example in journalism: In 2021, journalists from La Nación (Argentina), Ojo Público (Peru), CLIP (Costa Rica) and MuckRock (United States) developed DockIns, a tool that, using machine learning and NLP techniques, analyzes and classifies documents with unstructured content, extracts information from these documents and identifies topics and entities.
To refine the tool, the team tested several NER models, including SpaCy and DocumentCloud, with more than 10,000 documents containing unstructured data from bids, contracts and purchases from the Argentine Ministry of Security.
Retrieval-Augmented Generation (RAG) is an AI technique that combines natural language generation with information retrieval from sources other than the data used to train the model. This helps improve the accuracy of the generated responses and avoid “hallucinations.”
A RAG system, consisting of an LLM and an information retrieval system, can even connect to real-time or frequently updated information sources, such as social media or news sites, to generate up-to-date responses.
Example in journalism: To avoid “hallucinations” as much as possible, FátimaGPT, the fact-checking chatbot from Brazilian outlet Aos Fatos, uses a RAG model. This model links the chatbot’s language learning model to a specific and reliable database from which it extracts the information it needs to generate responses.
This database is made up of all the journalistic and verification material that Aos Fatos has produced.
Prompt design is the process of creating and optimizing the instructions given to AI models so that they generate accurate, coherent responses appropriate to the user's needs.
Good prompt design not only improves the accuracy of responses from LLMs like ChatGPT or Gemini, but also allows users to take full advantage of the capabilities of these tools.
Prompt design involves the use of different techniques, depending on the situation and the user's context. Some of these include the use of examples, trial and error techniques, role assignment, and scope delimitation, among others.
Example in journalism: Joe Amditis, a journalist and professor at Montclair State University in New Jersey, published “Beginner’s prompt handbook: ChatGPT for local news publishers” in 2023. In it, he teaches how journalists can design effective prompts and how to calibrate them for specific use cases in their newsroom.
The paper presents the parts of a good prompt, discusses the importance of clarity, specificity and contextualization in prompt design, and offers several techniques that journalists can use. It also addresses the importance of iteration and experimentation in prompt design to improve the accuracy and relevance of responses.