Holly Herndon, PROTO, 2019. Photo: Boris Camaca.

Vocal Aesthetics, AI Imaginaries

Reconfiguring Smart Interfaces

Alex Borkowski 29.10.2024Article, Issue 01

Introduction

Smart technology seems to be gripped by something of a fervor for voice. Intelligent agents, such as Amazon’s Alexa, Apple’s Siri, and Google’s Assistant, who respond to a variety of user queries and execute tasks within the broader network of the smart home—have become increasingly central to daily life as one of the most ubiquitous applications of natural language processing and synthesis. While voice assistants are inarguably popular, there is something curious about the rhetorical invocation of paradigm shifts and sensory revolutions—the aural poised to overcome the visual and haptic. Vocal interfaces have long been a part of the cultural imaginary, as evinced by the myriad conversational computers that appear in science fiction. Yet today’s voice assistants are posited, by virtue of their vocal nature as such, as entirely new and disruptive. Indeed, Amazon promises that “experiences designed around the human voice will fundamentally improve the way people use technology.”1 This framing of the turn to voice as novel and groundbreaking by tech companies, commentators, and researchers alike, might be understood alongside the tendency in scholarly perspectives to situate sound as a radical modality that offers a challenge to dominant visual paradigms. Rather than situating the contemporary technoscape as one in transition from text-based to voice-based interaction, I suggest that the very promise of such a transition itself constitutes a vocal imaginary, one laden with ideological baggage regarding communication, agency, and the parameters of human subjectivity.

As daily experience is ever more permeated with interfaces, voice provides a mode to insert yet another point of contact with digital platforms without displacing existing ones. Users can engage with Amazon and Google aurally, while simultaneously typing and scrolling; thus, this hands-free interaction afforded by smart speakers is posited as a way to eliminate friction in,2 and introduce new channels for consumers’ engagement with the digital marketplace. In this respect, as one among many interfaces, voice assistants might be understood in accordance with Alexander R. Galloway’s framing of the interface as a threshold device, a passage or point of transition, whose success is measured by its transparency: “for every moment of virtuosic immersion and connectivity … of inopacity, the threshold becomes one notch more invisible, one notch more inoperable.”3 Indeed, this logic is evident in claims that voice offers a more organic, transparent, and accessible mode of engaging with digital technology, as voice-operated devices are posited as qualitatively distinct from their visual counterparts—even more inoperable, achieving more by doing less, by virtue of their recourse to spoken language as an inherently “natural user interface.”4 Embedded in such propositions are assumptions regarding the innate or universal traits of voice, many of which are predicated upon a belief in voice’s inherent “naturalness” or proper relationship to a human subject.5 Potent metaphors abound regarding finding and giving voice, such that it is granted exceptional status as a marker of agency, self-possession, and unmediated self-presence. Voice is necessarily personal—stable, essential, and singular—as well as necessary to one’s civic personhood; it is “the ticket to entrance into the human community.”6 Indeed, Amazon’s rhetorical invocation of “the human voice” is striking, as one never hears technologies that involve manual typing or swiping described as experiences around “the human touch.”

Voice assistants might, therefore, be understood as a sociotechnical nexus in which this vocal imaginary is entwined with what Claudia Schmuckli calls an “AI imaginary”—“the trove of images and symbols derived from the metaphors that guide and describe the design, operations, and applications of AI.”7 Such metaphors flourish precisely because of the slipperiness of the term “AI” itself;8 it proliferates through cultural conversations regarding the power and possibilities of artificial intelligence—both utopian and dystopian—that rarely ask what precisely it is.9 Meredith Broussard writes regarding popular representations and perceptions of AI that “it’s easy to confuse what we imagine and what is real.”10 While users would certainly be mistaken in perceiving voice assistants’ conversational abilities as genuine understanding, as Broussard notes,11 such imaginings are “real” in the sense that they are operative in structuring day-to-day engagements with such technologies.12 I therefore suggest that it is precisely the looseness of the term “AI” in its popular usage that facilitates its ideological power. It is the overall absence of a simple definition of AI that makes it possible for convictions about the innateness, intimacy, and authenticity of voice to become so readily wedded to smart technologies as a testament to their purported impartiality, efficiency, and accuracy. Given the copious evidence attesting to the coalescence of machine learning with surveillance capitalism, the biases embedded in algorithmic processes, and the ways that such seemingly immaterial tech is built upon natural resource extraction and exploited human labor, this paper therefore proposes that critical thinking about AI might be advanced and elaborated by simultaneously thinking critically about voice—how are these imaginaries mobilized in harmony with one another and to what ends?

In what follows, I interrogate a synchronicity between theories of voice and an apocryphal origin story of AI, locating their entangled conceptual roots in the late eighteenth century. Specifically, the pairing of speaking and “thinking” automata prefigures the ways that vocal imaginaries continue to be mobilized as a means to lend credibility to purportedly intelligent machines today.Yet this historical antecedent simultaneously begins to unravel the humanist paradigms on which such an imaginary relies. Attending to voice in its affective, extra-communicative and even uncanny dimensions opens up a counter-discourse that disrupts the purported seamlessness afforded by vocal interfaces. This obverse framework, which emerges from and remains embedded in synthesized voices, provides the theoretical ground on which I examine Holly Herndon’s electropop album PROTO (2019). Herndon’s work undermines the claim that voice comprises the most natural, and therefore most invisible, interface, instead deploying voice as a mode of digging into, rather than glossing over, the complexities, glitches, and limitations of machine learning.

Holly Herndon, photo by Boris Camaca
Holly Herndon, PROTO, 2019. Photo: Boris Camaca.

Invisible Voices and Speaking Machines

The work of Wolfgang von Kempelen, a Hungarian inventor who debuted several mechanical curiosities in the courts of the Habsburg Empire, recurs as a touchstone both for scholars of AI and theories of voice. Von Kempelen rose to fame in 1770 with the invention of an elaborate chess-playing automaton, comprised of a life-sized wooden figure seated behind a cabinet, costumed in fur-trimmed robes and a turban. Indeed, in contrast to other popular automata of its day, the Mechanical Turk was purportedly capable not only of automated movement, but of autonomous thought.13 Von Kempelen described his automaton as a “thinking machine,” capable of deciding its own moves and masterfully executing a winning chess game on the basis of its own intelligence. The chess player was in fact an elaborate illusion controlled by a concealed human operator, yet this invention is nonetheless significant for the questions it inspired regarding the possibility of artificial intelligence as such.

The continued citation of the Mechanical Turk comprises an ongoing, if backhanded, disclosure that the concealment of human labor is integral to the functioning of seemingly intelligent machines.14 Most prominently, in 2005 Amazon publicly launched its Mechanical Turk (MTurk) platform, which comprises a massive invisible workforce (500,000 people worldwide as of 2015)15 that completes simple “human intelligence tasks” (HITs)—such as image tagging, audio transcribing, copywriting, data verification, and de-duplication16—that exceed the abilities of an algorithm. Seemingly automated operations are thus propped up by piecemeal human cognitive labor in a phenomenon that Jeff Bezos has cheekily described as “artificial artificial intelligence.”17 The platform—which reduces business clients to “requesters” rather than employers, and workers to anonymous “Turkers”—thus cultivates the expectation of inexpensive and frictionless completion of tasks that necessarily treats humans as machines.18 Mary L. Gray and Siddarth Suri have suggested that MTurk, as one of the first commercially available platforms for crowdsourced labor, set the standards for what they term “ghost work,” a veiled “digital assembly line [that] aggregates the collective input of distributed workers.”19 The evaporation of accountability facilitated by platforms such as MTurk creates conditions of “algorithmic cruelty”20; indeed, a 2018 study revealed that Turkers earn a median wage of approximately $2 per hour.21

Yet such disclosures can hardly be considered revelations, since the very allusion to Mechanical Turk brings ghost workers into the light of day. Elizabeth Stephens draws a parallel between the kind of open secret of the Mechanical Turk and the ways in which Amazon puts forward the term “artificial artificial intelligence” as a “distractingly interesting concept” that invites a “gentle puzzlement,”22 redirecting attention from MTurk’s fundamentally exploitative business model. Observing that von Kempelen’s claims regarding his thinking machine’s cognitive abilities were met with public skepticism from the moment it debuted, Stephens argues that the exoticized characterization of the chess playing figure was meant to aesthetically connote its fakery, like “a kind of magic trick whose success lay in fooling an audience aware they were being hoodwinked.”23 In this respect, Amazon’s callback to the Mechanical Turk alludes not only to the integral role of concealed human labor in seemingly intelligent machines, but also to the aesthetic and political conditions of that concealment—a kind of hiding in plain sight. It is precisely this exhibition modality that is constitutive of an AI imaginary—a generative dissemblance that produces new illusions and affects.

Voice assistants, despite their promise to accomplish all manner of administrative and household tasks like magic, are also better understood as aggregates and coordinators of human labor. Vocal interfaces enlist the labor of their users to ameliorate their language processing skills, as recorded speech inputs and outputs provide ample linguistic data for machine learning.24 Further, as is common practice in natural language processing, Alexa, Siri, and Google Assistant rely upon “thousands of low-paid humans who annotate sound snippets”25 in order to refine their conversational abilities.26 There are always humans in the assemblage that comprises the nonhuman speech of digital assistants, and indeed users have likely consented to participate through an end-user license agreement. Yet, as with the Mechanical Turk, such technologies vacillate between transparency and obfuscation, all in the service of ever more frictionless interfacing with digital platforms.

Further to positing speech as a purportedly natural user interface, Thao Phan suggests that the success of voice assistants requires “the perfect mimesis of the social order within the speech acts of the algorithm itself.”27 As digital assistants strive to facilitate frictionless encounters between users and cloud platforms, “the category of the invisible becomes, then, a performance of the socially invisible.”28 Numerous scholars have indeed suggested that the prevalent use of feminized voices in such interfaces aligns with ingrained gendered stereotypes regarding power relations in domestic and professional contexts—recalling a plucky personal assistant or submissive domestic laborer—and thereby mollify users’ anxieties regarding surveillance and data mining.29 Phan elsewhere comments upon the ways in which the utterances of digital assistants “evade specific identifying cultural inflections,”30 while adhering to American, British, or Australian national accents (which rarely reflect actual regional specificities), thus advancing an aesthetic that conflates neutrality with a generalized whiteness. Referring specifically to Amazon, Phan suggests that all manner of tasks behind the production and functioning of Alexa are performed by predominantly racialized workers—from assembly line workers building smart devices to gig economy service workers realizing consumers’ demands—and that their labor is obfuscated by the interface’s white and feminized voice. The invisibility of vocal interfaces is therefore further bolstered by the mobilization of social invisibility, which is itself undergirded by the logics of whiteness and hetero-patriarchy. Thus, while the human labor integral to the functioning of voice assistants is to some degree hidden in plain sight, the stakes of these aesthetic conventions are nonetheless political.

Wolfgang von Kempelen: The Turkish Chess Player. Copper engraving from the book: Karl Gottlieb von Windisch, Briefe über den Schachspieler des Hrn. von Kempelen, nebst drei Kupferstichen die diese berühmte Maschine vorstellen. 1783. Public Domain.

The functionality of voice in this paradigm is both evinced and complicated by the pairing of von Kempelen’s Turk with his “speaking machine,” unveiled in 1783. In contrast to the illusions and trickery upon which he relied with the Mechanical Turk, the speaking machine was a meticulous attempt to replicate the acoustic productions of the human vocal apparatus. An accordion-like pump referred to as the “bellows” acted as the lungs, creating a “wind” that flowed into a “windchest” containing various mechanisms that could be manipulated with levers to produce different consonants.31 While the mechanics within the windchest were concealed in a wooden box, the presentation of the device bore none of the anthropomorphism or exoticism of the Mechanical Turk. Von Kempelen adamantly advocated for the scientific significance of the machine, publishing a book in 1791 titled Mechanism of Human Speech and Language that detailed his research and experiments. However, Mladen Dolar crucially notes that the speaking machine and the Mechanical Turk were often publicly exhibited together as a kind of double bill or “double device” when von Kempelen toured across Europe in the 1780s.32 The speaking machine was often presented first, as a kind of prelude to the thinking machine: “The former made the latter plausible, acceptable, endowed with an air of credibility.”33

Thus, in this particular origin story of artificial artificial intelligence, voice is already an integral component in sustaining the illusion of such intelligence. Jessica Riskin describes how in eighteenth-century debates regarding the possibility and limitations of mechanical imitations of life, spoken language, along with perpetual motion, was situated “at the crux of the distinction between animate and inanimate, human and nonhuman.”34 While the Turk itself never spoke, it performed on the epistemological stage established by the demonstration of mechanical speech.

The fact that machine intelligence appeared plausible when coupled with the innately human characteristic of speech attests to the power of the vocal imaginary, as if the agential properties of voice could be transferred to the Mechanical Turk by proximity. Yet this demonstration also unsettled these very distinctions, defying dominant beliefs that voice was too organic a process ever to be simulated. This ambivalence inherent to von Kempelen’s double device is proffered as an exemplary anecdote for Dolar’s theory of voice—a framework that problematizes the emphasis on innateness and invisibility outlined above. Despite the transparency of the material components that together generate the speaking machine’s utterance, and their alignment with the familiar elements of the human speech apparatus, the device was nonetheless received by the public as an eerie enigma. As Dolar describes,

“There is an uncanniness in the gap which enables a machine, by purely mechanical means, to produce something so uniquely human as voice and speech. It is as if the effect could emancipate itself from its mechanical origin, and start functioning as a surplus—indeed, as the ghost in the machine; as if there were an effect without a proper cause, an effect surpassing its explicable cause.”35

This observation regarding the strangeness of the nonhuman voice crucially exposes, for Dolar, a vocal topology shared with the human voice. In every instance, voice is always irreducible to the means of its production, whether by the fleshy apparatus of the lungs and larynx or other mechanical means.36 The very fact that voice can be produced by machines situates it in “a zone of undecidability, of a between-the-two, an intermediacy,” which marks “one of the paramount features of the voice.”37 The relationship between synthesized speech and thinking machines is thus, in this telling, more convoluted than it initially appears: more than a design feature that obfuscates, naturalizes, and lends credibility to the latter, this presentation of voice generates unintended and uncanny affects that inspire a renewed consideration of the metaphysics of voice as such.

Similar, perhaps, to the ways in which voice is deployed in smart technology as an invisible interface, Dolar’s formulation describes voice as a “vanishing mediator,”38 a material support that disappears in the meaning that emerges through it. Yet voice simultaneously refuses the reduction to meaning, always leaving audible remainders such as timbre, accent, and intonation—what Dolar calls an “excrement of the signifier.”39 Further, it is precisely the purging of these extra-signifying elements in mechanical voices that paradoxically allow its “disturbing and uncanny nature”40 to emerge. Voice can, therefore, never quite vanish; it operates in remainders and excesses that unsettle the perception of speech as frictionless communication. As an interface, it is always unworkable. While the selection of white, feminine, or otherwise “socially invisible” voices might be understood as an attempt to suture the uncanny valley that opens up as humanoid machines approach realism41, it is crucial that such affects remain and indeed flourish. Indeed, popular counter-discourses highlight communicative glitches and anomalous utterances produced by voice assistants,42 such as spontaneous outbursts of laughter.43

I do not wish to install Dolar’s metaphysical claims as a more accurate or entirely unproblematic way to think about voices,44 but his perspective does offer an intriguing point of entry to consider these affective perturbations and their implications for a vocal imaginary as it is wedded to an AI imaginary. Rather than considering synthesized voices as impoverished or falsified renderings of a “real” voice, this framework places the vocalizations of human and nonhuman actors in tandem, thereby rattling the ideological foundations that exceptionalize the voice as a privileged and innately human modality. Herndon’s work with and around AI takes up this invitation and further explores the possibilities afforded through a consideration of voice as a transindividual process;45 voices are not stable, singular entities, but collectively and constantly formed and reformed in relation to a sociotechnical milieu.

Design for parts of a speaking machine from a 1791 treatise by Wolfgang von Kempelen. The bellows act as lungs feeding air into the voice box, which is fitted with a vibrating reed whose sounds are shaped by opening and closing valves. Not pictured here is a rubber “mouth” attachment, which connects to “o” with a flange that bears holes resembling nostrils. Source: https://digital.slub-dresden.de/werkansicht/dlf/11112/1?tx_dlf%5Bpagegrid%5D=1&cHash=b4db56391c1bb332830d76c2e3d08615.
Holly Herndon PROTO Cover Art
Holly Herndon, PROTO Cover Art, 2019.

“ ... Better Stories about AI”46

Herndon’s album PROTO was created in collaboration with a voice-processing neural network built with her partner/fellow artist Mat Dryhurst and developer Jules LaPlace. Herndon’s process involved creating an original piece of music, which she then recorded or taught to a choral ensemble. Her composition then acted as a dataset to train the neural net using sonic inputs, which in turn generated its own audible outputs.47 Over the course of creating the album, a group of vocalists met to “feed” Herndon’s custom AI housed in a gaming PC, which she called “Spawn” and assigned she/her pronouns. The members of the choral ensemble sang and talked to Spawn, who likewise sang and spoke back. Herndon concedes that the process of training Spawn was arduous, remarking that “everything sounded like ass” for the first six months of experimentation.48 The final results as they are heard on PROTO are eclectic—buzzing choral melodies, thunderous clashing beats, and indecipherable spoken-word samples from human and nonhuman collaborators alike.49

Although Herndon’s work hardly resembles the conversational tone or linguistic clarity of a voice assistant, a comparison might be made between the ways that they operate, at their most basic level, as voice-processing interfaces. Voice assistants convert users’ utterances into executable commands and respond to them, as well as capture these utterances as training data to further advance their vocal capabilities. Spawn likewise listens, responds, and cultivates her voice, albeit using sonic data inputs from a small group of consenting and compensated professional singers and musicians rather than a vast number of mostly unwitting consumers of smart speakers. Further, by attending to musical parameters and audible vocal traits rather than linguistic meaning, Spawn exceeds the command/response model that dominates interactions with digital assistants50 and generates unexpected outputs. Herndon takes up a malleable approach to working with machine learning; Spawn acts as a compositional tool, a musical instrument, and a performer within a broader ensemble, with these roles shifting and recombining in different ways on different tracks on PROTO. “Canaan” and “Evening Shades” are both described as “live training” sessions; the former a lyrical a cappella performance by three singers, including Herndon, offering up their voices as data, and the latter a call-and-response between a large choral ensemble and Spawn, who sings back in a hissing echo. Whereas for “Godmother,” Spawn was trained with percussion tracks by footwork electronic musician and producer Jlin and generates her own stuttering beats using Herndon’s voice.

In several respects, Herndon approaches, yet convolutes, the metaphors and conventions that characterize popular understandings of AI. The anthropomorphizing of Spawn certainly resonates with the ways in which vocal interfaces are also assigned feminized personas. Yet, this figuration is crucially different as she refers to Spawn as her “inhuman child”51 and metaphors of AI babies abound in reviews and interviews surrounding the release of PROTO. Unlike Siri, Alexa, or Google Assistant, who present themselves as fully formed, already “smart,” and ready at their users’ beck and call, Spawn more readily reveals the constant care and human attention that such systems demand.52 She “requires a community to raise her”53 and is entirely susceptible to the aesthetic intentions, inputs, and biases of that human community. Insofar as the voices heard on PROTO comprise “a weird hybrid ensemble where people are singing with models of themselves,”54 Spawn might also be interpreted in relation to “statistical doubles”55 of human consumers, which are generated through algorithmic data capture and designed to predict our desires and purchasing patterns. For Schmuckli, such doubles reside in what she calls a “reconfigured uncanny valley”56 that exceeds the feeling of disorientation caused by humanoid automata and is instead “mapped by the inscrutable calculations of algorithms that are designed to mine and analyze humans’ behavior and project it into tradable futures.”57 Indeed, Spawn holds up a vocal mirror to her trainers, singing their own voices back with a difference.58 This dynamic is clearly audible in “Evening Shades,” where the chorus sings and Spawn sings back, dropping notes and garbling lyrics. Spawn is incapable of perfectly recreating the complexity of human voices, just as statistical doubles can never entirely reflect one’s complete self, motivations, and desires—although they might get eerily close. PROTO might, therefore, be understood as a dwelling within this uncanny doubling, reveling in the gaps and discrepancies between Spawn and her trainers, as much as the resemblances.

In this respect, Herndon’s approach is also distinguished from approaches to generative art that consider fidelity as an affirmation of intelligence. Indeed, Herndon is critical of applications that involve statistical analysis of musical scores in order to emulate an artist or style, such as Amper or Jukebox; this constitutes “an aesthetic cul-de-sac”59 in which the measure of AI’s “creativity” is its ability to mimic that which already exists. Instead, Herndon’s emphasis upon Spawn’s role as part of a collaborative musical ensemble seems to draw an unexpected parallel, perhaps, between training an algorithm and the vocal training undertaken by human singers and choruses. The effect is not to endow Spawn with a kind of elevated or anthropomorphic agency associated with human voices, but rather to highlight that all musical expressions of voice are necessarily technological. Indeed, Western vocal pedagogy, which is crucially informed by medical measuring and imaging practices such as laryngoscopy, advances an understanding of voice as an instrument.60 A somewhat paradoxical presumption emerges from this discourse: that one’s authentic voice can only be “discovered” through instruction, rehearsal, and exercise.61 As Herndon remarks, “the voice isn’t necessarily individual; it belongs to a community, to a culture, to a society.”62 Spawn’s voice, like those of her human trainers and fellow performers, is ever becoming entangled with all manner of material and discursive objects and relations. PROTO is thus also emblematic of a transindividual approach to voice—a “process that happens between bodies, locations, affective and discursive histories.”63

Holly Herndon & Jlin (feat. Spawn)—Godmother (Official Video), 2019, Screenshot.

Conclusion

I have argued that voice, and the ideological convictions that surround it, play a crucial role in the obfuscation central to the very functioning of AI. Smart devices rely upon an imaginary predicated upon the naturalness of speech—mobilized alongside mythical whiteness and femininity—to render vocal interfaces, and the human labor that supports them, invisible. Voice is situated as a frictionless, mediation-less medium, one that lends credence to the seamless integration, intelligent capacities, utility, and neutrality of algorithms more broadly. Therefore, artistic practices that situate voice in varied and reciprocal relations to machine learning offer fecund ground for thinking with and beyond the contemporary convergence of vocal and AI imaginaries. Rather than mobilizing voice as a means to render interfaces transparent, vocal creations like Spawn amplify frictions within this zone of indecision. By presenting voice as transindividual process—the property of neither human subjects nor machines, but as relational sites that emerge between them—it is possible to unsettle the ideological foundations upon which such ebullient stories about voice, and indeed about AI, rely, thus creating an opening for thinking otherwise.

This is a revised version of an article initially published in Afterimage 50, no. 2 (2023): 129–49. https://doi.org/10.1525/aft.2023.50.2.129.

Alex Borkowski (she/her) is a PhD candidate in the joint graduate program in Communication & Culture at York University and Toronto Metropolitan University.

00:00-00:00
  • Amina Abbas-Nazari, Polyphonic Embodiments: Materials

    Article (Issue 01)

  • Giulia Deval, Audio Excerpts from Pitch

    Article (Issue 01)

  • Hanne Lippard, Homework, Talk Shop, 2024

    review

  • Luïza Luz, Thunder, Music for Wild Angels

    Article (Issue 01)

  • Anna Bromley, No2 How Katrina Krasniqi almost gets lost

    Article (Issue 01)