
Our search yielded a total of 796 database hits. After removal of duplicates, 738 records went through title/abstract screening. 158 full-texts were assessed. 53 records were included in the dataset, encompassing 23 original articles25,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56, including theoretical or empirical work, 11 letters57,58,59,60,61,62,63,64,65,66,67, six editorials68,69,70,71,72,73, four reviews8,74,75,76, three comments24,77,78, one report79 and five unspecified articles80,81,82,83,84. The flow of records through the review process can be seen in Fig. 1. Most works focus on applications utilizing ChatGPT across various healthcare fields, as indicated in Table 1. Regarding the affiliation of the first authors, 25 articles come from North America, 11 from Europe, six from West Asia, four from East asia, three from South Asia and four from Australia.

This Diagram following PRISMA guidelines showing the flow of records through the screening process.
During analysis, four general themes emerged in our dataset, which we used to structure our reporting. These themes include clinical applications, patient support applications, support of health professionals, and public health perspectives. Table 2 provides exemplary scenarios for each theme derived from the dataset.
Clinical applications
To support initial diagnosis and triaging of patients39,52, several authors discuss the use of LLMs in the context of predictive patient analysis and risk assessment in or prior to clinical situations as a potentially transformative application74,80. The role of LLMs in this scenario is described as that of a “co-pilot” using available patient information to flag areas of concern or to predict diseases and risk factors44.
Currie, in line with most authors, notes that predicting health outcomes and relevant patterns is very likely to improve patient outcomes and contribute to patient benefit80. For example, overcrowded emergency departments present a serious issue worldwide and have a significant impact on patient outcomes. From a perspective of harm avoidance, using LLMs with triage notes could lead to reduced length of stay and a more efficient utilization of time in the waiting room52.
All authors note, however, that such applications might also be problematic and require close human oversight39,44,51,80. Although LLMs might be able to reveal connections between disparate knowledge40, generating inaccurate information would have severe negative consequences44,74. This could lead to direct harm to patients or provide clinicians with false and dangerous justifications and rationales for their decisions74. These problems are tightly connected to inherent biases in LLMs, their tendency to “hallucinate” and their intransparency52. The term “hallucination” refers to an LLM generating plausible and often confident statements that are factually incorrect in the sense of not being grounded in the data85. In addition, uncertainties are increased by use of unstructured data. Medical notes often differ from the data pretrained models utilise. This makes it difficult to predict accuracy of output when using such data in prompts or for fine-tuning LLMs52. The interpretability of results and recommendations introduce additional complexity and sources of potential harm52. Currie notes that despite such difficulties, the use of LLMs proceeds largely in absence of guidelines, recommendations and control. The outcome, hence, ultimately depends on clinicians’ ability to interpret findings and identify inaccurate information80.
In patient consultation and communication, LLMs can offer a novel approach to patient-provider interaction by facilitating informational exchange and bridging gaps between clinical and preclinical settings, such as self-management measures or community aids8. This includes easing the transition between settings by removing barriers to communication44,60,80,83 or removing barriers in the clinical workflow to facilitate timely and efficient support. As is suggested, LLMs can collect information from patients or provide additional information, enabling well-informed decisions and increasing satisfaction in patients56,60,80. Provision of language translation and simplification of medical jargon may allow patients to become more engaged in the process and enhance patient-provider communication80,83. However, it remains unclear in our dataset what such applications would look like in practice — specifically where, when and how LLMs could actually be integrated.
These suggestions necessitate consideration of ethically relevant boundaries regarding the protection of patient data, and safety36,60,77,83, potentially unjust disparities36,60,83, and the broader dimensions of care, such as the therapeutic relationship36,59,61,64,77. Robust measures to avoid incorrect information in technological mediation of communication and the need to strike a balance with “the human touch” of care60 are stressed. With regard to the former, Buzzaccarini et al. argue for robust expert oversight. Regarding the latter, Li et al. note a potential shift in power dynamics between patients and providers, in which providers might lose their authoritative position and might be seen as less knowledgeable64. Others fear a loss of personal care that should be avoided36,61,77 and the lack of contextual content of individual health challenges42,77. Open communication and consent to the technical mediation of patient-provider communication are required to promote trust but might be difficult to achieve69,78.
Many studies in our dataset discuss the possible use of LLMs for diagnosis8,36,39,44,59,61,66,67,74,75,78,80. It is suggested that the LLMs’ ability to analyze large amounts of unstructured data provides pathways to timely, efficient and more accurate diagnosis to the benefit of patients35,36,67,75,78. It might also enable the discovery of hidden patterns39 and reduce healthcare costs36,49.
An ethical problem emerges with potentially negative effects on patient outcomes due to biases in the training data36,39,41,74,75,78, especially with the lack of diverse datasets risking the underrepresentation of marginalized or vulnerable groups. Biased models may result in unfair treatment of disadvantaged groups, leading to disparities in access, exacerbating existing inequalities, or harming persons through selective accuracy41. Based on an experimental study setup, Yeung et al. deliver an insightful example showing that ChatGPT and Foresight NLP exhibit racial bias towards Black patients28. Issues of interpretability, hallucinations, and falsehood mimicry exacerbate these risks35,36,44,74. With regard to transparency, two sources suggest that LLM-supported diagnoses hamper the process of providing adequate justification due to their opacity36,74. This is understood to threaten the authoritative position of professionals, leaving them at risk of not being able to provide a rationale for a diagnosis35 and might lead to an erosion of trust between both parties. This is in line with others noting that LLMs are not able to replicate a process of clinical reasoning in general and, hence, fail to comprehend the complexity of the process44,59,75. Based on the principle of avoidance of harm, it is an important requirement to subject each generated datum to clinical validation as well as to develop “ethical and legal systems” to mitigate these problems36,39,59.
It needs to be noted, however, that the technically unaided process of diagnosis is also known to be subjective and prone to error67. This implies that an ethical evaluation should be carried out in terms of relative reliability and effectiveness compared to existing alternatives. Whether and under what circumstances this might be the case is a question that is not addressed.
Six studies in our dataset highlight the use of LLMs in providing personalized recommendations for treatment regimens or to support clinicians in treatment decisions based on electronic patient information or history58,60,61,66,67,80, providing a quick and reliable course of action to clinicians and patients. However, as with diagnostic applications, biases and perpetuating existing stereotypes and disparities are a constantly discussed theme60,61,67. Ferrara also cautions that LLMs will likely prioritize certain types of treatments or interventions over others, disproportionately benefiting certain groups and disadvantaging others41.
Additionally, it is highlighted that processing patient data raises ethical questions regarding confidentiality, privacy, and data security58,60,61,66,67. This especially applies to commercial and publicly available models such as ChatGPT. Inaccuracies in potential treatment recommendations are also noted as a concerning source of harm58,60,61,66,67. In a broader context, several authors suggest that for some LLMs, the absence of internet access, insufficient domain-specific data, limited access to treatment guidelines, lack of knowledge about local or regional characteristics of the healthcare system, and outdated research significantly heighten the risk of inaccurate recommendations24,37,38,40,47,55.
Patient support applications
Almost all authors concerned with patient-facing applications highlight the benefits of rapid and timely information access that users experience with state-of-the-art LLMs. Kavian et al. compare patients’ use of chatbots with shifts that have accompanied the development of the internet as a patient information source69. Such access can improve laypersons’ health literacy by providing a needs-oriented access to comprehensible medical information68, which is regarded as an important precondition of autonomy to allow more independent, health-related decisions8,74. In their work on the use of ChatGPT 4 in overcoming language barriers, Yeo et al. highlight an additional benefit, as LLMs could provide cross-lingual translation and thus contribute to equalizing healthcare and racial disparities56.
Regarding ethical concerns and risks, biases are seen as a significant source of harm8,39,74,75. The literature also highlights a crucial difference in the ethical acceptability of using patient support applications, leading to a more critical stance when LLMs are used by laypersons compared to health professionals28,53. However, ethical acceptability varies across fields; for instance, otolaryngology and infectious disease studies find ChatGPT’s responses to patients lack detail but aren’t harmful53, whereas pharmacology and mental health indicate greater potential risks67,68.
LLMs can offer laypersons personalized guidance, such as lifestyle adjustments during illness80, self-assessment of symptoms61,63, self-triaging, and emergency management steps8,57. Although current arrangements seem to perform well and generate compelling responses8,47,63, a general lack of situational awareness is noted as a common problem that might lead to severe harm8,61,63. Situational awareness means the ability to generate responses based on contextual criteria such as the personal situation, medical history or social situation. The inability of most current LLMs to seek clarifications by asking questions and their lack of sensitivity to query variations can lead to imprecise answers45,63. For instance, research by Knebel et al. on self-triaging in ophthalmologic emergencies indicates that ChatGPT’s responses can’t reliably prioritize urgency, reducing their usefulness45.
Support of health professionals and researchers
LLMs could automate administrative or documentation tasks like medical reporting80, or summarizing patient interactions8 including automatic population of forms or discharge summaries. The consensus is that LLMs could streamline clinical workflows8,36,43,51,52,60,68,74,80,81,83, offering time savings for health professionals currently burdened with extensive administrative duties68,83. By automating these repetitive tasks, professionals could dedicate more time to high-quality medical tasks83. Crucially, such applications would require the large-scale integration of LLMs into existing clinical data systems49.
In health research, LLMs are suggested to support text, evidence or data summarization54,64,82, identify research targets8,61,72,83, designing experiments or studies72,83, facilitate knowledge sharing between collaborators37,70,80, and to communicate results74. This highlights the potential for accelerating research46,79 and relieving researchers of workload8,40,64,74,75,83, leading to more efficient research workflows and allowing researchers to spend less time on burdensome routine work8,80. According to certain authors, this could involve condensing crucial aspects of their work, like crafting digestible research documents for ethics reviews or consent forms82. However, LLMs capacities are also critically examined, with Tang et al. emphasizing ChatGPT’s tendency to produce attribution and misinterpretation errors, potentially distorting original source information. This echoes concerns over interpretability, reproducibility, uncertainty handling, and transparency54,74.
Some authors fear that using LLMs could compromise research integrity by disrupting traditional trust factors like source traceability, factual consistency, and process transparency24. Additionally, concerns about overreliance and deskilling are raised, as LLMs might diminish researchers’ skills and overly shape research outcomes46. Given that using such technologies inevitably introduces biases and distortions to the research flow, Page et al. suggest researchers must maintain vigilance to prevent undue influence from biases introduced by these technologies, advocating for strict human oversight and revalidation of outputs70.
Public health perspectives
The dataset encompasses studies that explore the systemic implications of LLMs, especially from a public health perspective50,61,75. This includes using LLMs in public health campaigns, for monitoring news and social media for signs of disease outbreaks61 and targeted communication strategies50. Additionally, research examines the potential for improving health literacy or access to health information, especially in low-resource settings. Access to health information through LLMs can be maintained free of charge or at very low costs for laypersons55. Considering the case of mental health, especially low- and middle-income countries might benefit71. These countries often have a huge treatment gap driven by a deficit in professionals or inequitable resource distribution. Using LLMs could mitigate accessibility and affordability issues, potentially offering a more favorable alternative to the current lack of access71.
However, a number of authors raise doubts about overly positive expectations. Schmälzle & Wilcox highlight the risks of a dual use of LLMs50. While they might further equal access to information, malicious actors can and seem to be using LLMs to spread fake information and devise health messages at an unprecedented scale that is harmful to societies50,51,75. De Angelis et al. take this concern one step further, presenting the concept of an AI-driven infodemic46 in which the overwhelming spread of imprecise, unclear, or false information leads to disorientation and potentially harmful behavior among recipients. Health authorities have often seen AI technologies as solutions to information overload. However, the authors caution that an AI-driven infodemics could exacerbate future health threats. While infodemic issues in social media and grey literature are noted, AI-driven infodemics could also inundate scientific journals with low-quality, excessively produced content46.
The commercial nature of most current LLM systems present another critical consideration. The profit-driven nature of the field can lead to concentrations of power among a limited number of companies and a lack of transparency. This economic model, as highlighted by several studies, can have negative downstream effects on accessibility and affordability24,36,43. Developing, using, or refining models can be expensive, limiting accessibility and customization for marginalized communities. Power concentration also means pricing control lies with LLM companies, with revenues predominantly directed towards them44. These questions are also mirrored in the selection of training data and knowledge bases24 which typically encompass knowledge from well-funded, English speaking countries and, thus, significantly underrepresents knowledge from other regions. This could exacerbate health disparities by reinforcing biases rather than alleviating them.
link