The Rise of Hostility in Podcasts

A Five-Year Analysis of Expression Targeting Select Groups

MONTREAL | NOV 2025

“When hate manifests in our communities, it is a threat to public safety, democracy, and human rights. Hate divides us [...] It silences people, it shuts down debate and hinders our democracy. [...] Without accountability, hateful behaviour that has been appallingly normalized online is now more frighteningly common in person.”

- “Canada Must Act Now to Address Hate”, Canadian Human Rights Commission (Nov 2023)

Introduction

This study provides a longitudinal assessment of how hostile rhetoric toward selected groups has evolved in popular North American political and news podcasts.

Despite their prominence, podcasts remain comparatively understudied in media research. The medium presents substantial technical obstacles to systematic analysis, from the challenges of ingestion and transcription at scale, to the complexity of labeling and categorizing unstructured conversational content. These barriers have limited the kind of documentation and scrutiny that other media formats receive, making it difficult to assess patterns of rhetoric and discourse over time.

Of particular concern is how specific groups are characterized within this space. When negative portrayals or hostile rhetoric toward particular groups are repeatedly encountered through trusted voices over extended listening periods, they may contribute to broader dynamics of polarization, stereotyping, and intergroup tension beyond the medium itself.

To make this assessment possible, we introduce a structured framework and supporting analytical systems that enable large-scale, longitudinal evaluation of how podcasts characterize specific groups. We apply this framework to five years of leading North American political and news programs, revealing how the tone toward selected groups has evolved over time.

Podcast Sample

Our sample includes 140 leading North American podcasts, identified through Apple and Spotify rankings and chosen for their strong focus on political and news content. We collected all available episodes released from January 2020 to August 2025.

The timeline on the right illustrates the dataset, with each row corresponding to a podcast and columns reflecting time periods.

Darker gray squares represent periods where we collected and analyzed podcast episodes.

Transcript Segmentation

Episodes were collected directly from official RSS feeds and transcribed using a speech-to-text model.

Transcripts were divided into overlapping segments of approximately 200 words, corresponding to roughly 75 seconds of audio. This segment length typically captures a full exchange or idea while remaining concise enough for practical coding and analysis.

This approach comes at the cost of losing some long-range connections. A speaker might, for instance, introduce “those people” early in an episode, clarify the referent much later, and make hostile claims near the end, with each element falling into separate segments. Disclaimers (“I’m just asking questions”), callbacks, gradual escalation, and coded language may also occur too far apart to appear together. We accept this trade-off because it enables automated analysis on a scale that would be impossible through manual coding.

Although the method may miss long-range references, it does so consistently across groups and time periods. Later analysis shows that more than 80% of the hostile speech acts identified in this study occur entirely within a single 75-second window.

Podcast Hosts

To distinguish whether a statement was made by the host or another participant, we created a database of podcast hosts and their characteristic voice patterns. These profiles enable us to automatically peform speaker attribution across all our podcast segments.

Defining Target Groups

Hostile expression requires two components: specific linguistic patterns and a target. In this study, targets are defined as identifiable groups: entire national, ethnic, or religious populations defined by their shared identity.

Criticism of governments or military forces does not constitute hostile expression toward our defined target groups. We analyze only instances where speakers make claims about the collective characteristics, behaviors, or nature of groups as a whole, including major subsets within those groups (e.g. Sephardic Jews, Catholic Christians, etc.).

Our target groups were selected according to three criteria: frequent mentions in the podcasts we analyzed, direct relevance to Canada, and prominence in major events during the study period.

COVID-19’s origins in China and rising US–China competition
The war in Gaza
Rhetoric on Canadian annexation and US-Canada tariff disputes
The creation of the US Task Force to Eradicate Anti-Christian Bias

Through this process, we identified seven groups for analysis: Chinese, Jewish, Israeli, Palestinian, Muslim, Christian, and Canadian people. Each of these groups received a custom lexicon, validation process, and longitudinal tracking.

While we acknowledge that hostile rhetoric in these podcasts extends beyond the groups analyzed, our study focused on this defined subset.

Framework

Our study begins with compiling instances of hostile expression directed at our target groups. This database is designed to answer the question: Did Podcast X express hostility toward group Y during a given month?

To do this, we divide the five-year study into monthly time periods. We then assess each podcast's general attitude toward a group by identifying clear and representative statements made about that group. For a statement to qualify for analysis, it must meet two requirements: one of our target groups must be unambiguously identified as the subject of discussion, and hostile speech must be clearly expressed about that group.

Hostility Snapshot
(Qualifying statements about the target group in a calendar month)

All qualifying statements are then grouped by podcast, month, and target group to form what we refer to as a hostility snapshot. Each snapshot represents the collection of detected hostile remarks directed at a group during that period. To ensure that these snapshots reflect how a podcast repeatedly spoke about a group over the course of a month, rather than a one-off comment, we require multiple qualifying statements per snapshot.

Darker shades indicate greater hostility.

By collecting hostility snapshots month by month and organizing them into a timeline, we can clearly see how each podcast’s attitudes toward different target groups have changed and, in some cases, intensified during the study period.

Characterizing Hostile Expression

We assess hostile expression through a set of hallmarks. Each hallmark is identified based on the language used in the segment, without requiring any judgment about whether the underlying claims are true. Focusing on the linguistic patterns allows for more consistent judgments, even when the surrounding narrative is contested.

We developed the hallmarks through a bottom-up process informed by established frameworks, including the Dangerous Speech Project, genocide risk indicators, the IHRA working definition, and the UN Rabat Plan of Action.

We iteratively reviewed statements expressing negativity toward our target groups and tested which indicators best captured observed patterns. This process yielded a set of hallmarks that cover the vast majority of hostile expression in our corpus.

Colors indicate which hostility type was most prevalent during each snapshot.

Severity Assessment

Each hallmark is coded on a four-point severity scale (0–3), where higher levels correspond to more extreme forms of hostile rhetoric. For example, Threat Construction ranges from (1) vague unease about a group’s influence, to (2) claims that the group poses an ongoing danger, to (3) depictions of the group as an existential threat. A single segment may exhibit multiple hallmarks, each at different severity levels.

The complete severity scale definitions are available here.

Once coded, we combine a segment’s hallmark–severity codes into an overall severity score. Each hallmark at each severity level is assigned a weight that captures how strongly that pattern pushes the overall hostility score upward when it is present.

We then aggregate all segment-level severity scores for each podcast–group–month combination and take the median value. The median captures the general tenor of rhetoric in that period rather than isolated extreme outbursts.

Colors indicate the most severe level of hostility expressed toward any group during each monthly snapshot, with darker shades representing higher severity.

Qualitative Synthesis

All segments identified as containing hostility within a podcast–month–group snapshot then undergo a processing stage in which the hostile speech acts are isolated, stripped of surrounding commentary, and restated in a clear, standardized form. These extractions are subsequently merged to remove redundant or overlapping themes. The resulting summary provides a concise record of hostility for that period, making it easier to identify recurring themes and track patterns across the dataset.

Hovering over a snapshot displays its qualitative summary.

Detecting Hostile Expression at Scale

Analyzing thousands of podcast episodes requires an automated approach. We employ a two-phase system to detect hostile statements across our corpus.

In the first phase, we surface explicit hostility: statements where speakers directly name a target group and express clear negativity toward them. Custom lexicons identify segments mentioning these groups, and a trained classifier then evaluates whether the content is genuinely hostile or simply a neutral reference.

The second phase expands our detection to capture implicit hostility: rhetoric that doesn't name groups directly but conveys hostility through coded terms, euphemisms, or indirect references. By building on the explicit statements found in phase one, we can identify similar patterns of language even when groups aren't explicitly mentioned.

Together, these phases produce the qualifying statements that characterize each podcast's stance toward different groups over time, enabling the longitudinal analysis described in our framework.

Identifying Group References Through Custom Lexicons

We identify moments when speakers explicitly refer to specific communities using custom lexicons developed for each target group. To create these lexicons, we began with neutral group names and incorporated additional pejorative and politically charged terms drawn from resources such as Hatebase and Weaponized Word. These initial term lists were used to locate potential podcast segments, which we then reviewed to identify additional relevant vocabulary.

This discovery process was necessary because most available resources were developed for written online text. The distinct patterns of spoken discourse, combined with artifacts introduced by speech-to-text conversion, revealed unique lexical patterns and variants not captured in text-based sources. Through iterative manual review and filtering, we refined each lexicon to reliably capture how groups are referenced in spoken discourse.

Throughout this process, we deliberately constructed lexicons that capture references to people, rather than to states, governments, militaries, or prominent individuals. Terms such as "Israel" or "China" were generally excluded, since they more commonly describe state actions.

It is important to note that this approach identifies explicit mentions of groups but does not capture coded, euphemistic, or implicit references. These cases are addressed in a later section of the report. While our lexicons capture only overt language, their consistent application across the corpus provides a reliable basis for comparing explicit references.

Colors show which group was most discussed during that time period.

Screening for Hostile Content

Among all transcript segments that reference the target groups, the majority are non-hostile. These segments may include neutral descriptions, positive portrayals, or discussions of current events that do not employ hostile framing.

To identify segments that contain hostile expression, we used a text classification model trained on our hostility framework. The model evaluates each segment strictly on the basis of its lexical content, without reference to the podcast, speaker, or broader episode context. This approach reduces the risk of bias linked to particular hosts or shows, but it also means the model is optimized for detecting explicit textual hostility and may overlook more context-dependent forms of expression.

When a segment is identified as containing hostile expression, we revisit the transcript and widen the surrounding text window to include more of what came immediately before and after. This added context can disambiguate language and referents, clarify speaker intent, and reduce the risk of misclassification.

Rather than relying on a single classification, each potentially hostile segment undergoes a series of analyses that apply the hostility framework from complementary perspectives. These analyses assess whether hostility hallmarks are present, identify the exact portion of the transcript in which the expression occurs, flag ambiguities that may affect judgment, assign severity scores to the hallmarks identified, evaluate how hostile claims are attributed, assess whether those claims are endorsed by the speaker, and generate a summary of the hostile content grounded in specific excerpts. At any point in this process, the model may determine that no hostility is present. Only segments that register as hostile across all components of the workflow are carried forward.

Furthermore, hostile language meets our inclusion threshold only when it appears more than once, either at separate points within an episode or across multiple episodes. Hostile remarks confined to a single brief exchange do not qualify. This requirement ensures that our analysis captures repeated hostility rather than an isolated instance.

A summary of the inclusion criteria is available here.

These confirmed hostile statements serve two functions: they contribute to the evidence base for our longitudinal analysis, and they provide the foundation for discovering implicit forms of hostile expression in the next phase.

Colors indicate which group was most discussed in hostile terms during that period.

Detecting Implicit Group References

The lexicon-based approach reliably identifies explicit references to target groups, but hostile rhetoric often operates more subtly. Speakers may use euphemisms, coded language, or rhetorical framing that conveys hostility without naming groups directly. A segment discussing “our neighbors to the north” may target Canadians without using any terms from our lexicons. Similarly, invoking “those from across the pond” may refer to British communities without naming them directly.

To surface these implicit cases, we use the confirmed hostile statements from phase one as templates for finding similar rhetoric elsewhere. We extract the core hostile expressions from these statements and use a large language model to generate linguistic variations: paraphrases, different phrasings, and alternative framings that convey the same hostile idea. For instance, the concept "Muslims don't belong in Western countries" might appear as "Islam is incompatible with Western civilization," "we need to keep Muslims out," or "Islamic values threaten our way of life." While these statements use different vocabulary, they express the same underlying hostile idea.

We then use semantic similarity to search the entire corpus for segments that resemble these hostile patterns, even when they don't mention groups explicitly. Segments with high similarity scores are retrieved and evaluated by the same classification model used in phase one to confirm they actually express hostility rather than just superficial resemblance.

This approach has limitations. It depends on the quality of our initial explicit hostile statements, assumes that implicit and explicit hostility share similar rhetorical structures, and analyzes only the most similar segments rather than the entire corpus for efficiency. Despite these constraints, semantic similarity detection substantially expands our ability to identify hostile expression beyond keyword matching, capturing rhetorical patterns that would otherwise remain invisible.

Colors indicate which group was most discussed in hostile terms through implicit language during that time period.

Validation

To assess the reliability of our automated hostility detection system, we recruited three research assistants and trained them on the hostility coding rubric. Each analyst independently reviewed a stratified sample of 223 segments covering all time periods, target groups, and podcasts in the corpus. For each segment, coders applied the framework, indicating which hostility hallmarks were present and assigning each detected hallmark a severity score on a four-point scale.

The hallmark framework provides a structured basis for identifying and assessing hostility. Each hallmark captures a distinct rhetorical pattern, but in this analysis they are treated collectively rather than as separate outcomes. The goal is to determine whether a segment expresses hostility at all. For every human-coded segment, we assign a binary label of hostility present, marked as 1 if any hallmark is rated at level 1 or higher and 0 otherwise. The automated classifier’s outputs are processed in the same way, enabling direct comparison between human and model evaluations.

Human Coding Reliability

We began by examining how consistently human coders agreed with each other when applying the hostility rubric. Agreement was measured using Cohen’s κ, a statistic that accounts for the amount of agreement that could occur by chance. Across all segments that had been reviewed by more than one person, κ values ranged from 0.33 to 0.42, which corresponds to low to moderate consistency. Because only a small number of segments were coded by all three assistants, we rely on these pairwise comparisons as our main indicator of baseline human reliability.

After the initial coding was completed, cases where coders disagreed was reviewed by experts for closer examination. The goal was to identify the sources of these differences in judgment, such as unclear definitions, ambiguous phrasing in the transcripts, or inconsistent use of the rubric. Most disagreements stemmed from variation in how coders interpreted and applied the rubric rather than issues with the framework itself. When the relevant hallmark definitions were revisited to assess whether the criteria for hostility were met, consensus was typically reached.

Coding natural speech takes practice, especially when speakers use indirect phrasing, emotional appeals, or deliberate misdirection. A more structured training process that included guided exercises, feedback on coding decisions, examples of correct and incorrect applications of the rubric, and discussion of borderline cases would likely have strengthened agreement among coders without requiring any changes to the rubric itself.

Evaluation on Consensus-Labeled Segments

To evaluate how accurately the automated classifier detects hostility, we focused on segments where human coders independently reached the same conclusion. These consensus cases likely represent instances where the rubric was applied more consistently and the determination of presence or absence of hostility was relatively unambiguous. Although consensus does not guarantee correctness, it provides a reasonable, higher-confidence reference set for comparing the model’s classifications with human judgments.

To construct the reference set, we examined all segments that had been independently reviewed by at least two human coders. For each segment, we assigned a label based on the majority decision, meaning whichever judgment, hostile or non-hostile, was chosen by most coders. This process yielded 155 segments (out of 223) with clear agreement among coders, including 69 labeled as hostile and 86 as non-hostile.

Using the human consensus labels as a reference, the classifier correctly identified 46 hostile segments (true positives) and 84 non-hostile segments (true negatives), while producing 2 false positives and 23 false negatives. These results correspond to a precision of 95.8%, and a recall of 66.7%. The agreement between the classifier and the human consensus, measured using Cohen’s κ, was 0.663.

	Sample Size	Precision	Recall
Classifier	155	95.8%	66.7%

While the Cohen’s κ score provides a broad indication of agreement, it does not fully capture the model’s performance profile. The classifier’s precision is very high, meaning it rarely labels a segment as hostile when human coders do not. Most disagreements result from missed detections, reflecting a more cautious approach than that of the human coders. In practice, the system is highly reliable when it identifies hostility but fails to detect roughly one third of the segments that humans classify as hostile.

This conservative configuration is intentional. It reduces the likelihood of incorrectly labeling non-hostile content as hostile, which is essential for producing dependable hostility snapshots. At the same time, it means that our corpus-level analyses likely understate the overall prevalence of hostile expression in the podcast corpus.

When false positives do occur, they typically involve podcast hosts quoting, paraphrasing, or otherwise invoking hostile statements without expressing endorsement. In these instances, the surface language may meet the formal criteria of the hostility rubric even though the speaker’s intent is descriptive, critical, or contextual rather than expressive. Distinguishing between endorsement and quotation is challenging for automated systems, particularly given limited contextual signals and the conversational formats in which speakers frequently recount others’ views, react to prior statements, or simulate opposing arguments. Although infrequent overall, such cases account for a substantial share of the classifier’s false positives and reflect a fundamental limitation of text-based hostility detection.

Ambiguous and Disagreement Segments

Human coders disagreed on approximately one third of the sample (68 out of 223 segments) when determining whether a passage contained hostility. It is informative to examine how the automated classifier handled these disputed cases. Among the 68 segments, the classifier labeled 50 as non-hostile and 18 as hostile, again reflecting a generally cautious orientation. In most instances, subsequent expert review concluded that the classifier’s judgments were consistent with a reasonable interpretation of the rubric.

These results further indicate that, while the human-generated labels provide a valuable benchmark, they should be regarded as a noisy reference rather than a definitive ground truth.

Year-by-Year Performance

To test whether the classifier’s performance remained stable over time, we evaluated results separately by publication year using the consensus-labeled segments. For each year, we report precision (the share of model-identified hostile segments confirmed by human coders) and recall (the share of human-identified hostile segments detected by the model). These measures capture how reliably the classifier detects hostility and how cautious it is in assigning the label.

Year	Sample Size	Precision	Recall
2025	45	100.0%	66.7%
2024	44	92.9%	72.2%
2023	30	90.0%	60.0%
2022	24	100.0%	66.7%
2021	10	100.0%	57.1%
2020	2	100.0%	100.0%

Across all years, precision remains consistently high, typically above 90 percent, indicating that the classifier rarely labels non-hostile content as hostile. Recall fluctuates between roughly 57 and 72 percent, suggesting that while the system sometimes misses hostility identified by human coders, it does so without any systematic trend over time. Although earlier years contain relatively few samples, the overall pattern suggests stable calibration and no evidence of temporal drift that would bias longitudinal patterns.

Podcast-by-Podcast Performance

We also examined performance across podcasts using the same evaluation set. Because most podcasts contributed only a small number of consensus-labeled segments, these comparisons should be interpreted cautiously. The results below report precision and recall for podcasts with the largest sample sizes.

Podcast	Sample Size	Precision	Recall
Mark Levin Podcast	21	100.0%	71.4%
The Young Turks	18	88.9%	100.0%
The Ben Shapiro Show	11	100.0%	80.0%
The Alex Jones Show – Infowars.com	9	100.0%	50.0%
Piers Morgan Uncensored	8	100.0%	100.0%
The Victor Davis Hanson Show	7	100.0%	100.0%
...	...	...	...

Given these limited sample sizes, podcast-level differences are best viewed as descriptive checks for gross anomalies rather than as precise estimates of podcast-specific performance.

Target Group Performance

Finally, we evaluated classifier performance by the group targeted in each segment. This comparison tests whether detection accuracy differs across groups commonly referenced in hostile speech. Because the number of annotated excerpts per group is limited, these results are exploratory.

Target Group	Sample Size	Precision	Recall
Palestinians	20	95.0%	100.0%
Israelis	14	92.9%	100.0%
Jews	6	100.0%	100.0%
...	...	...	...

These consistent results provide no indication that classifier performance differs systematically across groups, although small sample sizes limit the strength of the inferences that can be drawn from these subgroup comparisons.

Database

The database below presents the cases of hostile expression that were surfaced by our detection pipeline across more than 70,000 hours of podcast content. Only segments that registered as hostile across every component of our deliberately conservative workflow are displayed. These results should be interpreted as a lower bound; the underlying source material contains substantially more hostile rhetoric than what appears in this visualization, and the absence of a detection should not be taken as evidence that no hostility occurred.

Use the filters below to explore hostility by target group, and click any snapshot to view the underlying source material.

Detections in this figure are generated by an automated classifier with an approximate 5% false positive rate.

Results

In the following graph, we show how hostility toward each target group evolved across the study period. Earlier sections provided podcast-specific snapshots, but here we pool all detected cases across every show to provide an overview of how attitudes shifted over time.

Each point's vertical position reflects the typical severity of hostile remarks detected in that month, derived from the levels assigned to the underlying hostility hallmarks. Higher points indicate that rhetoric in those periods was more extreme on average, while lower points indicate that the hostile language that did appear was generally milder. The size of each marker reflects how many hostile statements were detected. Larger markers indicate a greater volume of hostility, while smaller markers indicate fewer instances.

Breaks in the trend lines appear when too few cases are detected to calculate a reliable severity score. These gaps do not mean hostility did not occur, only that there was not enough qualifying evidence to assign a score in those months. Because all groups are evaluated using the same detection and scoring process, an unscored month indicates there were likely fewer hostile remarks toward those groups in that period under the framework used.

The following figure shifts from a pooled overview to a podcast level view. For a selected podcast, it shows which groups were targeted by hostile rhetoric and how the typical intensity changed over the study period. Because this view analyzes one podcast at a time, fewer hostile statements are available, and fewer cases meet the evidence threshold required for scoring, leading to more gaps in the trend lines. When a month is unscored, it does not mean hostility was absent, only that not enough qualifying statements were detected to generate a reliable severity score under the study framework.

Whereas the previous figures show month-by-month trends in the typical severity of hostile content for each group, the next figure aggregates all hostile remarks across the entire study period into a single severity score for each podcast. Podcasts are ordered along the horizontal axis from least to most severe, allowing direct comparison of how intensely different shows expressed hostility toward the selected group. A podcast’s absence from the figure should not be interpreted as evidence that hostility never occurred. Rather, it indicates there were not enough qualifying hostile remarks to generate a reliable score under the study’s criteria.

Case Study | War in Gaza

We draw on our curated database of hostility incidents to present a case study examining patterns of hostile rhetoric linked to the war in Gaza. We focus on the groups most directly affected by the conflict: Palestinians, Israelis, and their major religious affiliations. This case study analyzes overall trends in hostile rhetoric, key targets, and primary contributors, as well as salient inflection points in the data and their interaction with external developments.

Post-October 7 Surge

Hostile rhetoric targeting specific groups has risen sharply since October 2023, as of analysis conducted through July 2025. The data shows a clear break in continuity: before October 2023, hostile remarks toward Israeli, Palestinian, Jewish, or Muslim people appeared only sporadically. Following October 7, the frequency and severity increased substantially, closely tracking the conflict’s timeline and signaling heightened public tensions around Gaza-related issues.

Focusing on Muslim and Jewish communities, the pattern of hostile rhetoric shows a clear shift after January 2025 toward a sustained, linear increase, rather than episodic spikes. Earlier periods in the data showed sharp increases followed by cooling phases, reflecting a reactive, event-driven pattern tied to specific conflict incidents. After January 2025, the trajectory becomes steadier, with extended stretches of escalating severity. This change indicates a rising baseline of hostility, suggesting that rhetoric is becoming more persistent, rather than fluctuating in response to individual events.

In North American podcast vernacular, hostile commentary directed at the side of the conflict associated with Israel rarely targets Israelis as a national group. Instead, such remarks far more often invoke Jewish identifiers, reflecting a tendency in public discourse to blur Jewish and Israeli categories. When the focus shifts to geopolitical developments, antagonistic comments typically reference the state of Israel or the Israel Defense Forces rather than Israelis themselves.

Narrative Analysis

Beyond measuring how much hostility appears and how severe it is, we examined what specific messages are being repeated about each group. We analyzed all hostile statements to identify themes that appeared repeatedly across different podcasts and time periods. Statements that conveyed the same underlying message, even when worded differently, were grouped together and summarized into a single, representative description. The visualizations below show common characterizations and how they evolved over the course of the study.

Group Reference	The segment must unambiguously reference one of the designated target groups. Ambiguous or unclear referents do not qualify.
Hostile Expression	The segment must be flagged as hostile based on the text alone, without drawing on information about the speaker or the episode. This judgment must remain consistent across all analyses in our process.
Speaker Endorsement	The speaker must be expressing the hostile content as their own view, not quoting, reporting, or presenting it for critique.
Delivery Tone	The segment qualifies regardless of delivery style. A humorous, joking, sarcastic, or self-deprecating tone does not negate hostility.
Speaker Source	The segment qualifies regardless of who is speaking—host, co-hosts, guests, or other voices on the program.
Minimum Severity	The segment must meet a minimum severity threshold based on the type and intensity of hostile language. Segments scoring low relative to the overall distribution are excluded.
Repeated Occurrence	Hostility directed at a group must occur more than once, either at distinct points within a single episode or across multiple episodes. If limited to a single episode, intensity should be comparable to the treatment of other groups. Isolated instances do not qualify.
Outlier Removal	Each podcast is evaluated in the context of how it typically discusses a given group. Segments that are significantly more extreme than the usual tone toward that group are excluded.

Hostility Hallmark Severity Scales

How We Determine Hostility Weights

Inclusion Criteria for Hostile Expressions