Vocal Aesthetics, AI Imaginaries

Reconfiguring Smart Interfaces

Alex Borkowski 29.10.2024Article, Issue 01

Introduction

Smart technology seems to be gripped by something of a fervor for voice. Intelligent agents, such as Amazon’s Alexa, Apple’s Siri, and Google’s Assistant, who respond to a variety of user queries and execute tasks within the broader network of the smart home—have become increasingly central to daily life as one of the most ubiquitous applications of natural language processing and synthesis. While voice assistants are inarguably popular, there is something curious about the rhetorical invocation of paradigm shifts and sensory revolutions—the aural poised to overcome the visual and haptic. Vocal interfaces have long been a part of the cultural imaginary, as evinced by the myriad conversational computers that appear in science fiction. Yet today’s voice assistants are posited, by virtue of their vocal nature as such, as entirely new and disruptive. Indeed, Amazon promises that “experiences designed around the human voice will fundamentally improve the way people use technology.”^{11”The Alexa Fund,” Amazon, accessed October 4, 2024.} This framing of the turn to voice as novel and groundbreaking by tech companies, commentators, and researchers alike, might be understood alongside the tendency in scholarly perspectives to situate sound as a radical modality that offers a challenge to dominant visual paradigms. Rather than situating the contemporary technoscape as one in transition from text-based to voice-based interaction, I suggest that the very promise of such a transition itself constitutes a vocal imaginary, one laden with ideological baggage regarding communication, agency, and the parameters of human subjectivity.

As daily experience is ever more permeated with interfaces, voice provides a mode to insert yet another point of contact with digital platforms without displacing existing ones. Users can engage with Amazon and Google aurally, while simultaneously typing and scrolling; thus, this hands-free interaction afforded by smart speakers is posited as a way to eliminate friction in,^{22Emily West, Buy Now: How Amazon Branded Convenience and Normalized Monopoly (MIT Press, 2022), 134.} and introduce new channels for consumers’ engagement with the digital marketplace. In this respect, as one among many interfaces, voice assistants might be understood in accordance with Alexander R. Galloway’s framing of the interface as a threshold device, a passage or point of transition, whose success is measured by its transparency: “for every moment of virtuosic immersion and connectivity … of inopacity, the threshold becomes one notch more invisible, one notch more inoperable.”^{33Alexander R. Galloway, The Interface Effect (Polity Press, 2012), 25.} Indeed, this logic is evident in claims that voice offers a more organic, transparent, and accessible mode of engaging with digital technology, as voice-operated devices are posited as qualitatively distinct from their visual counterparts—even more inoperable, achieving more by doing less, by virtue of their recourse to spoken language as an inherently “natural user interface.”^{44Thao Phan, “The Materiality of the Digital and the Gendered Voice of Siri,” Transformations 29 (2017): 23–33, here: 28.} Embedded in such propositions are assumptions regarding the innate or universal traits of voice, many of which are predicated upon a belief in voice’s inherent “naturalness” or proper relationship to a human subject.^{55For a more robust discussion of prominent ideological assumptions regarding voice, and their audist and ableist underpinnings, see Jonathan Sterne, Diminished Faculties: A Political Phenomenology of Impairment (Duke University Press, 2022).} Potent metaphors abound regarding finding and giving voice, such that it is granted exceptional status as a marker of agency, self-possession, and unmediated self-presence. Voice is necessarily personal—stable, essential, and singular—as well as necessary to one’s civic personhood; it is “the ticket to entrance into the human community.”^{66Dominic Pettman, Sonic Intimacy: Voice, Species, Technics (or, How to Listen to the World) (Stanford University Press, 2017), 4.} Indeed, Amazon’s rhetorical invocation of “the human voice” is striking, as one never hears technologies that involve manual typing or swiping described as experiences around “the human touch.”

Voice assistants might, therefore, be understood as a sociotechnical nexus in which this vocal imaginary is entwined with what Claudia Schmuckli calls an “AI imaginary”—“the trove of images and symbols derived from the metaphors that guide and describe the design, operations, and applications of AI.”^{77Claudia Schmuckli, “Automatic Writing and Statistical Montage,” in Beyond the Uncanny Valley: Being Human in the Age of AI, exhibition catalogue, Fine Arts Museums of San Francisco, 2020, 9.} Such metaphors flourish precisely because of the slipperiness of the term “AI” itself;^{88Sarah T. Roberts, “Your AI is A Human,” in Your Computer is on Fire, ed. Thomas S. Mullany, Benjamin Peters, Mar Hicks, and Kavita Philip (MIT Press, 2021), 51–70, here: 52.} it proliferates through cultural conversations regarding the power and possibilities of artificial intelligence—both utopian and dystopian—that rarely ask what precisely it is.^{99Yarden Katz, Artificial Whiteness: Politics and Ideology in Artificial Intelligence (Columbia University Press, 2020), 3.} Meredith Broussard writes regarding popular representations and perceptions of AI that “it’s easy to confuse what we imagine and what is real.”^{1010Meredith Broussard, Artificial Unintelligence: How Computers Misunderstand the World (MIT Press, 2018), 31.} While users would certainly be mistaken in perceiving voice assistants’ conversational abilities as genuine understanding, as Broussard notes,^{1111Broussard, Artificial Unintelligence, 38.} such imaginings are “real” in the sense that they are operative in structuring day-to-day engagements with such technologies.^{1212Indeed, voice assistants are posited in their design and branding as intelligent entities. Amazon selected the name Alexa as a reference to the Library of Alexandria, alluding to the depth of knowledge cultivated by the agent. See Alexa Juliana Ard, “Amazon, can we have our name back?” Washington Post, December 3, 2021.} I therefore suggest that it is precisely the looseness of the term “AI” in its popular usage that facilitates its ideological power. It is the overall absence of a simple definition of AI that makes it possible for convictions about the innateness, intimacy, and authenticity of voice to become so readily wedded to smart technologies as a testament to their purported impartiality, efficiency, and accuracy. Given the copious evidence attesting to the coalescence of machine learning with surveillance capitalism, the biases embedded in algorithmic processes, and the ways that such seemingly immaterial tech is built upon natural resource extraction and exploited human labor, this paper therefore proposes that critical thinking about AI might be advanced and elaborated by simultaneously thinking critically about voice—how are these imaginaries mobilized in harmony with one another and to what ends?

In what follows, I interrogate a synchronicity between theories of voice and an apocryphal origin story of AI, locating their entangled conceptual roots in the late eighteenth century. Specifically, the pairing of speaking and “thinking” automata prefigures the ways that vocal imaginaries continue to be mobilized as a means to lend credibility to purportedly intelligent machines today.Yet this historical antecedent simultaneously begins to unravel the humanist paradigms on which such an imaginary relies. Attending to voice in its affective, extra-communicative and even uncanny dimensions opens up a counter-discourse that disrupts the purported seamlessness afforded by vocal interfaces. This obverse framework, which emerges from and remains embedded in synthesized voices, provides the theoretical ground on which I examine Holly Herndon’s electropop album PROTO (2019). Herndon’s work undermines the claim that voice comprises the most natural, and therefore most invisible, interface, instead deploying voice as a mode of digging into, rather than glossing over, the complexities, glitches, and limitations of machine learning.

Invisible Voices and Speaking Machines

The work of Wolfgang von Kempelen, a Hungarian inventor who debuted several mechanical curiosities in the courts of the Habsburg Empire, recurs as a touchstone both for scholars of AI and theories of voice. Von Kempelen rose to fame in 1770 with the invention of an elaborate chess-playing automaton, comprised of a life-sized wooden figure seated behind a cabinet, costumed in fur-trimmed robes and a turban. Indeed, in contrast to other popular automata of its day, the Mechanical Turk was purportedly capable not only of automated movement, but of autonomous thought.^{1313Elizabeth Stephens, “The Mechanical Turk: A Short History of ‘Artificial Artificial Intelligence,’” Cultural Studies 37, no. 1 (2023): 65–87, 66.} Von Kempelen described his automaton as a “thinking machine,” capable of deciding its own moves and masterfully executing a winning chess game on the basis of its own intelligence. The chess player was in fact an elaborate illusion controlled by a concealed human operator, yet this invention is nonetheless significant for the questions it inspired regarding the possibility of artificial intelligence as such.

The continued citation of the Mechanical Turk comprises an ongoing, if backhanded, disclosure that the concealment of human labor is integral to the functioning of seemingly intelligent machines.^{1414Kate Crawford, Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence (New Haven, CT: Yale University Press, 2021), 66.} Most prominently, in 2005 Amazon publicly launched its Mechanical Turk (MTurk) platform, which comprises a massive invisible workforce (500,000 people worldwide as of 2015)^{1515Paul Hitlin, “Research in the Crowdsourcing Age, a Case Study,” Pew Research Center, July 11, 2016.} that completes simple “human intelligence tasks” (HITs)—such as image tagging, audio transcribing, copywriting, data verification, and de-duplication^{1616“Amazon Mechanical Turk,” Amazon, accessed October 4, 2024,}—that exceed the abilities of an algorithm. Seemingly automated operations are thus propped up by piecemeal human cognitive labor in a phenomenon that Jeff Bezos has cheekily described as “artificial artificial intelligence.”^{1717Jason Pontin, “Artificial Intelligence, with Help from the Humans,” New York Times, March 25, 2007.} The platform—which reduces business clients to “requesters” rather than employers, and workers to anonymous “Turkers”—thus cultivates the expectation of inexpensive and frictionless completion of tasks that necessarily treats humans as machines.^{1818Crawford, Atlas of AI, 64–65.} Mary L. Gray and Siddarth Suri have suggested that MTurk, as one of the first commercially available platforms for crowdsourced labor, set the standards for what they term “ghost work,” a veiled “digital assembly line [that] aggregates the collective input of distributed workers.”^{1919Mary L. Gray and Siddarth Suri, Ghost Work: How to Stop Silicon Valley From Building a New Global Underclass (Harper Collins, 2019), ix.} The evaporation of accountability facilitated by platforms such as MTurk creates conditions of “algorithmic cruelty”^{2020 Gray and Suri, Ghost Work, xxx.}; indeed, a 2018 study revealed that Turkers earn a median wage of approximately $2 per hour.^{2121Kotaro Hara, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, and Jeffrey P. Bigham, “A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk,” Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–15, here: 1.}

Yet such disclosures can hardly be considered revelations, since the very allusion to Mechanical Turk brings ghost workers into the light of day. Elizabeth Stephens draws a parallel between the kind of open secret of the Mechanical Turk and the ways in which Amazon puts forward the term “artificial artificial intelligence” as a “distractingly interesting concept” that invites a “gentle puzzlement,”^{2222Stephens, “The Mechanical Turk,” 14–15.} redirecting attention from MTurk’s fundamentally exploitative business model. Observing that von Kempelen’s claims regarding his thinking machine’s cognitive abilities were met with public skepticism from the moment it debuted, Stephens argues that the exoticized characterization of the chess playing figure was meant to aesthetically connote its fakery, like “a kind of magic trick whose success lay in fooling an audience aware they were being hoodwinked.”^{2323Stephens, “The Mechanical Turk,” 11.} In this respect, Amazon’s callback to the Mechanical Turk alludes not only to the integral role of concealed human labor in seemingly intelligent machines, but also to the aesthetic and political conditions of that concealment—a kind of hiding in plain sight. It is precisely this exhibition modality that is constitutive of an AI imaginary—a generative dissemblance that produces new illusions and affects.

Voice assistants, despite their promise to accomplish all manner of administrative and household tasks like magic, are also better understood as aggregates and coordinators of human labor. Vocal interfaces enlist the labor of their users to ameliorate their language processing skills, as recorded speech inputs and outputs provide ample linguistic data for machine learning.^{2424West, Buy Now, 123.} Further, as is common practice in natural language processing, Alexa, Siri, and Google Assistant rely upon “thousands of low-paid humans who annotate sound snippets”^{2525Austin Carr, Matt Day, Sarah Frier, and Mark Gurman, “Silicon Valley Is Listening to Your Most Intimate Moments,” Bloomberg, December 11, 2019.} in order to refine their conversational abilities.^{2626West, Buy Now, 123.} There are always humans in the assemblage that comprises the nonhuman speech of digital assistants, and indeed users have likely consented to participate through an end-user license agreement. Yet, as with the Mechanical Turk, such technologies vacillate between transparency and obfuscation, all in the service of ever more frictionless interfacing with digital platforms.

Further to positing speech as a purportedly natural user interface, Thao Phan suggests that the success of voice assistants requires “the perfect mimesis of the social order within the speech acts of the algorithm itself.”^{2727Phan, “The Materiality of the Digital,” 28.} As digital assistants strive to facilitate frictionless encounters between users and cloud platforms, “the category of the invisible becomes, then, a performance of the socially invisible.”^{2828Phan, “The Materiality of the Digital,” 28.} Numerous scholars have indeed suggested that the prevalent use of feminized voices in such interfaces aligns with ingrained gendered stereotypes regarding power relations in domestic and professional contexts—recalling a plucky personal assistant or submissive domestic laborer—and thereby mollify users’ anxieties regarding surveillance and data mining.^{2929See Heather Suzanne Woods, “Asking More of Siri and Alexa: Feminine Persona in Service of Surveillance Capitalism,” Critical Studies in Media Communication 35, no. 4 (2018): 334–49; and Amy Schiller and John McMahon, “Alexa, Alert Me When the Revolution Comes: Gender, Affect, and Labor in the Age of Home-Based Artificial Intelligence,” New Political Science 41, no. 2 (2019): 173–91.} Phan elsewhere comments upon the ways in which the utterances of digital assistants “evade specific identifying cultural inflections,”^{3030Thao Phan, “Amazon Echo and the Aesthetics of Whiteness,” Catalyst: Feminism, Theory, Technoscience 5, no. 1 (2019): 1–39, here: 21.} while adhering to American, British, or Australian national accents (which rarely reflect actual regional specificities), thus advancing an aesthetic that conflates neutrality with a generalized whiteness. Referring specifically to Amazon, Phan suggests that all manner of tasks behind the production and functioning of Alexa are performed by predominantly racialized workers—from assembly line workers building smart devices to gig economy service workers realizing consumers’ demands—and that their labor is obfuscated by the interface’s white and feminized voice. The invisibility of vocal interfaces is therefore further bolstered by the mobilization of social invisibility, which is itself undergirded by the logics of whiteness and hetero-patriarchy. Thus, while the human labor integral to the functioning of voice assistants is to some degree hidden in plain sight, the stakes of these aesthetic conventions are nonetheless political.

The functionality of voice in this paradigm is both evinced and complicated by the pairing of von Kempelen’s Turk with his “speaking machine,” unveiled in 1783. In contrast to the illusions and trickery upon which he relied with the Mechanical Turk, the speaking machine was a meticulous attempt to replicate the acoustic productions of the human vocal apparatus. An accordion-like pump referred to as the “bellows” acted as the lungs, creating a “wind” that flowed into a “windchest” containing various mechanisms that could be manipulated with levers to produce different consonants.^{3131Leibniz Association, “The ‘Kempelen’ Speaking Machine,” Google Arts and Culture, accessed October 4, 2024.} While the mechanics within the windchest were concealed in a wooden box, the presentation of the device bore none of the anthropomorphism or exoticism of the Mechanical Turk. Von Kempelen adamantly advocated for the scientific significance of the machine, publishing a book in 1791 titled Mechanism of Human Speech and Language that detailed his research and experiments. However, Mladen Dolar crucially notes that the speaking machine and the Mechanical Turk were often publicly exhibited together as a kind of double bill or “double device” when von Kempelen toured across Europe in the 1780s.^{3232Mladen Dolar, A Voice and Nothing More (MIT Press, 2006), 8.} The speaking machine was often presented first, as a kind of prelude to the thinking machine: “The former made the latter plausible, acceptable, endowed with an air of credibility.”^{3333Dolar, A Voice and Nothing More, 9.}

Thus, in this particular origin story of artificial artificial intelligence, voice is already an integral component in sustaining the illusion of such intelligence. Jessica Riskin describes how in eighteenth-century debates regarding the possibility and limitations of mechanical imitations of life, spoken language, along with perpetual motion, was situated “at the crux of the distinction between animate and inanimate, human and nonhuman.”^{3434Jessica Riskin, “The Defecating Duck, or, the Ambiguous Origins of Artificial Life,” Critical Inquiry 29, no. 4 (Summer 2003): 599–633, here: 617.} While the Turk itself never spoke, it performed on the epistemological stage established by the demonstration of mechanical speech.

The fact that machine intelligence appeared plausible when coupled with the innately human characteristic of speech attests to the power of the vocal imaginary, as if the agential properties of voice could be transferred to the Mechanical Turk by proximity. Yet this demonstration also unsettled these very distinctions, defying dominant beliefs that voice was too organic a process ever to be simulated. This ambivalence inherent to von Kempelen’s double device is proffered as an exemplary anecdote for Dolar’s theory of voice—a framework that problematizes the emphasis on innateness and invisibility outlined above. Despite the transparency of the material components that together generate the speaking machine’s utterance, and their alignment with the familiar elements of the human speech apparatus, the device was nonetheless received by the public as an eerie enigma. As Dolar describes,

“There is an uncanniness in the gap which enables a machine, by purely mechanical means, to produce something so uniquely human as voice and speech. It is as if the effect could emancipate itself from its mechanical origin, and start functioning as a surplus—indeed, as the ghost in the machine; as if there were an effect without a proper cause, an effect surpassing its explicable cause.”^{3535Dolar, A Voice and Nothing More, 7–8.}

This observation regarding the strangeness of the nonhuman voice crucially exposes, for Dolar, a vocal topology shared with the human voice. In every instance, voice is always irreducible to the means of its production, whether by the fleshy apparatus of the lungs and larynx or other mechanical means.^{3636Dolar, A Voice and Nothing More, 70.} The very fact that voice can be produced by machines situates it in “a zone of undecidability, of a between-the-two, an intermediacy,” which marks “one of the paramount features of the voice.”^{3737Dolar, A Voice and Nothing More, 13.} The relationship between synthesized speech and thinking machines is thus, in this telling, more convoluted than it initially appears: more than a design feature that obfuscates, naturalizes, and lends credibility to the latter, this presentation of voice generates unintended and uncanny affects that inspire a renewed consideration of the metaphysics of voice as such.

Similar, perhaps, to the ways in which voice is deployed in smart technology as an invisible interface, Dolar’s formulation describes voice as a “vanishing mediator,”^{3838Dolar, A Voice and Nothing More, 15.} a material support that disappears in the meaning that emerges through it. Yet voice simultaneously refuses the reduction to meaning, always leaving audible remainders such as timbre, accent, and intonation—what Dolar calls an “excrement of the signifier.”^{3939Dolar, A Voice and Nothing More, 20.} Further, it is precisely the purging of these extra-signifying elements in mechanical voices that paradoxically allow its “disturbing and uncanny nature”^{4040Dolar, A Voice and Nothing More, 22.} to emerge. Voice can, therefore, never quite vanish; it operates in remainders and excesses that unsettle the perception of speech as frictionless communication. As an interface, it is always unworkable. While the selection of white, feminine, or otherwise “socially invisible” voices might be understood as an attempt to suture the uncanny valley that opens up as humanoid machines approach realism^{4141AO Roberts, “Echo and the Chorus of Female Machines,” Sounding Out!, March 2, 2015.}, it is crucial that such affects remain and indeed flourish. Indeed, popular counter-discourses highlight communicative glitches and anomalous utterances produced by voice assistants,^{4242See, for instance, Paul Lamkin, “The Creepiest, Freakiest Things Alexa Has Ever Said and Done,” Ambient, February 11, 2022, and Katie Teague, “10 Weirdest Things Alexa Can Do on Your Amazon Echo,” CNET, September 20, 2020.} such as spontaneous outbursts of laughter.^{4343Julia Carrie Wong, “Amazon Working to Fix Alexa After Users Report Random Burst of ‘Creepy’ Laughter,” Guardian, March 7, 2018.}

I do not wish to install Dolar’s metaphysical claims as a more accurate or entirely unproblematic way to think about voices,^{4444For a critique of Dolar’s perspective see Mickey Vallee, “Possibility, Performance, Politics: On the Voice and Transformation,” Parallax 23, no. 3 (2017): 330–41.} but his perspective does offer an intriguing point of entry to consider these affective perturbations and their implications for a vocal imaginary as it is wedded to an AI imaginary. Rather than considering synthesized voices as impoverished or falsified renderings of a “real” voice, this framework places the vocalizations of human and nonhuman actors in tandem, thereby rattling the ideological foundations that exceptionalize the voice as a privileged and innately human modality. Herndon’s work with and around AI takes up this invitation and further explores the possibilities afforded through a consideration of voice as a transindividual process;^{4545Rachelle Chadwick, “Theorizing Voice: Toward Working Otherwise with Voices,” Qualitative Research 21, no. 1 (February 2021): 76–101.} voices are not stable, singular entities, but collectively and constantly formed and reformed in relation to a sociotechnical milieu.

“ ... Better Stories about AI”^{4646Joanna Zylinska, AI Art: Machine Visions and Warped Dreams (Open Humanities Press, 2020), 31.}

Herndon’s album PROTO was created in collaboration with a voice-processing neural network built with her partner/fellow artist Mat Dryhurst and developer Jules LaPlace. Herndon’s process involved creating an original piece of music, which she then recorded or taught to a choral ensemble. Her composition then acted as a dataset to train the neural net using sonic inputs, which in turn generated its own audible outputs.^{4747For additional details on this process, see Katie Hawthorne, “Holly Herndon: The Musician Who Birthed an AI Baby,” Guardian, May 2, 2019, and Holly Herndon, “Holly Herndon—AI Is Not Going to Kill Us; It Might Make Us More Human,” interview by Stuart Stubbs, Loud and Quiet, April 30, 2019.} Over the course of creating the album, a group of vocalists met to “feed” Herndon’s custom AI housed in a gaming PC, which she called “Spawn” and assigned she/her pronouns. The members of the choral ensemble sang and talked to Spawn, who likewise sang and spoke back. Herndon concedes that the process of training Spawn was arduous, remarking that “everything sounded like ass” for the first six months of experimentation.^{4848Herndon, “Holly Herndon—AI Is Not Going to Kill Us.”} The final results as they are heard on PROTO are eclectic—buzzing choral melodies, thunderous clashing beats, and indecipherable spoken-word samples from human and nonhuman collaborators alike.^{4949Herndon, “Holly Herndon—AI Is Not Going to Kill Us.”}

Although Herndon’s work hardly resembles the conversational tone or linguistic clarity of a voice assistant, a comparison might be made between the ways that they operate, at their most basic level, as voice-processing interfaces. Voice assistants convert users’ utterances into executable commands and respond to them, as well as capture these utterances as training data to further advance their vocal capabilities. Spawn likewise listens, responds, and cultivates her voice, albeit using sonic data inputs from a small group of consenting and compensated professional singers and musicians rather than a vast number of mostly unwitting consumers of smart speakers. Further, by attending to musical parameters and audible vocal traits rather than linguistic meaning, Spawn exceeds the command/response model that dominates interactions with digital assistants^{5050Simone Natale and Henry Cooke, “Browsing with Alexa: Interrogating the Impact of Voice Assistants as Web Interfaces,” Media, Culture & Society 43, no. 6 (2021): 1000–16, here: 1007.} and generates unexpected outputs. Herndon takes up a malleable approach to working with machine learning; Spawn acts as a compositional tool, a musical instrument, and a performer within a broader ensemble, with these roles shifting and recombining in different ways on different tracks on PROTO. “Canaan” and “Evening Shades” are both described as “live training” sessions; the former a lyrical a cappella performance by three singers, including Herndon, offering up their voices as data, and the latter a call-and-response between a large choral ensemble and Spawn, who sings back in a hissing echo. Whereas for “Godmother,” Spawn was trained with percussion tracks by footwork electronic musician and producer Jlin and generates her own stuttering beats using Herndon’s voice.

In several respects, Herndon approaches, yet convolutes, the metaphors and conventions that characterize popular understandings of AI. The anthropomorphizing of Spawn certainly resonates with the ways in which vocal interfaces are also assigned feminized personas. Yet, this figuration is crucially different as she refers to Spawn as her “inhuman child”^{5151Holly Herndon, “Inhuman After All,” interview by Gabriela Tully Claymore, Stereogum, May 6, 2019.} and metaphors of AI babies abound in reviews and interviews surrounding the release of PROTO. Unlike Siri, Alexa, or Google Assistant, who present themselves as fully formed, already “smart,” and ready at their users’ beck and call, Spawn more readily reveals the constant care and human attention that such systems demand.^{5252Gray and Suri, Ghost Work, xviii.} She “requires a community to raise her”^{5353Herndon, “Inhuman After All.”} and is entirely susceptible to the aesthetic intentions, inputs, and biases of that human community. Insofar as the voices heard on PROTO comprise “a weird hybrid ensemble where people are singing with models of themselves,”^{5454“Holly Herndon on the Power of Machine Learning and Developing Her ‘Digital Twin’ Holly+,” interview by Jordan Darville, The FADER, July 27, 2021.} Spawn might also be interpreted in relation to “statistical doubles”^{5555Schmuckli, “Automatic Writing and Statistical Montage,” 15.} of human consumers, which are generated through algorithmic data capture and designed to predict our desires and purchasing patterns. For Schmuckli, such doubles reside in what she calls a “reconfigured uncanny valley”^{5656Schmuckli, “Automatic Writing and Statistical Montage,” 9.} that exceeds the feeling of disorientation caused by humanoid automata and is instead “mapped by the inscrutable calculations of algorithms that are designed to mine and analyze humans’ behavior and project it into tradable futures.”^{5757Schmuckli, “Automatic Writing and Statistical Montage,” 15.} Indeed, Spawn holds up a vocal mirror to her trainers, singing their own voices back with a difference.^{5858Following PROTO, Herndon’s project Holly+ (2021–present) offers a higher fidelity model of her voice to other creators, inviting them to make original works with Herndon’s digital likeness. This work—which disburses Herndon’s voice yet affirms her proprietary relationship to it—presents a different approach to vocal imaginaries and AI than that explored in her earlier work. While discussion of Holly+ exceeds the scope of this paper, it merits further consideration. See Holly Herndon, “Holly+.”} This dynamic is clearly audible in “Evening Shades,” where the chorus sings and Spawn sings back, dropping notes and garbling lyrics. Spawn is incapable of perfectly recreating the complexity of human voices, just as statistical doubles can never entirely reflect one’s complete self, motivations, and desires—although they might get eerily close. PROTO might, therefore, be understood as a dwelling within this uncanny doubling, reveling in the gaps and discrepancies between Spawn and her trainers, as much as the resemblances.

In this respect, Herndon’s approach is also distinguished from approaches to generative art that consider fidelity as an affirmation of intelligence. Indeed, Herndon is critical of applications that involve statistical analysis of musical scores in order to emulate an artist or style, such as Amper or Jukebox; this constitutes “an aesthetic cul-de-sac”^{5959Holly Herndon, “Holly Herndon on Her AI Baby, Reanimating Tupac, and Extracting Voices,” interview by Emily McDermott, Art in America, January 7, 2020.} in which the measure of AI’s “creativity” is its ability to mimic that which already exists. Instead, Herndon’s emphasis upon Spawn’s role as part of a collaborative musical ensemble seems to draw an unexpected parallel, perhaps, between training an algorithm and the vocal training undertaken by human singers and choruses. The effect is not to endow Spawn with a kind of elevated or anthropomorphic agency associated with human voices, but rather to highlight that all musical expressions of voice are necessarily technological. Indeed, Western vocal pedagogy, which is crucially informed by medical measuring and imaging practices such as laryngoscopy, advances an understanding of voice as an instrument.^{6060Sterne, Diminished Faculties, 97.} A somewhat paradoxical presumption emerges from this discourse: that one’s authentic voice can only be “discovered” through instruction, rehearsal, and exercise.^{6161Katherine Meizel, Multivocality: Singing on the Borders of Identity (Oxford University Press, 2020), 25–28.} As Herndon remarks, “the voice isn’t necessarily individual; it belongs to a community, to a culture, to a society.”^{6262Herndon, “Holly Herndon on Her AI Baby.”} Spawn’s voice, like those of her human trainers and fellow performers, is ever becoming entangled with all manner of material and discursive objects and relations. PROTO is thus also emblematic of a transindividual approach to voice—a “process that happens between bodies, locations, affective and discursive histories.”^{6363Chadwick, “Theorizing Voice,” 91.}

Conclusion

I have argued that voice, and the ideological convictions that surround it, play a crucial role in the obfuscation central to the very functioning of AI. Smart devices rely upon an imaginary predicated upon the naturalness of speech—mobilized alongside mythical whiteness and femininity—to render vocal interfaces, and the human labor that supports them, invisible. Voice is situated as a frictionless, mediation-less medium, one that lends credence to the seamless integration, intelligent capacities, utility, and neutrality of algorithms more broadly. Therefore, artistic practices that situate voice in varied and reciprocal relations to machine learning offer fecund ground for thinking with and beyond the contemporary convergence of vocal and AI imaginaries. Rather than mobilizing voice as a means to render interfaces transparent, vocal creations like Spawn amplify frictions within this zone of indecision. By presenting voice as transindividual process—the property of neither human subjects nor machines, but as relational sites that emerge between them—it is possible to unsettle the ideological foundations upon which such ebullient stories about voice, and indeed about AI, rely, thus creating an opening for thinking otherwise.

This is a revised version of an article initially published in Afterimage 50, no. 2 (2023): 129–49. https://doi.org/10.1525/aft.2023.50.2.129.

Alex Borkowski (she/her) is a PhD candidate in the joint graduate program in Communication & Culture at York University and Toronto Metropolitan University.

Vocal Aesthetics, AI Imaginaries

Introduction

Invisible Voices and Speaking Machines

“ ... Better Stories about AI”4646Joanna Zylinska, AI Art: Machine Visions and Warped Dreams (Open Humanities Press, 2020), 31.

Conclusion

“ ... Better Stories about AI”^{4646Joanna Zylinska, AI Art: Machine Visions and Warped Dreams (Open Humanities Press, 2020), 31.}