How Voice is Seen by AI

Conversation with Amina Abbas-Nazari

Amina Abbas-Nazari, High Pitch29.10.2024interview, Issue 01

London-based designer, researcher, and performer Amina Abbas-Nazari explores the materiality of speech and how the body, technology, and environment shape human and non-human voices. She works with choirs but also with her own voice as a site of investigation to analyze conversational AI systems.

In her work Polyphonic Embodiment(s), she conducts a material experiment that critiques speech recognition technologies. By recreating an AI that claims to infer facial features from vocal characteristics, Abbas-Nazari exposes how these systems attempt to map voice onto identity, questioning the assumptions embedded in such technologies that treat voice as a biometric marker—an aspect of our body that can be used for immediate identification and long-term assessment.

Polyphonic Embodiment(s) is based on Speech2Face developed at the Massachusetts Institute of Technology. This AI uses specific characteristics^{11T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W.T. Freeman, M. Rubinstein, and W. Matusik, “Speech2Face: Learning the Face Behind a Voice.” Preprint, submitted May 23, 2019.}—including the height of the upper lip, the height of the nose, and the width of the jaw—to correlate them to the acoustic properties of the voice. Unlike other voice recognition systems that classify individuals according to predefined categories like age, nationality, or gender, Speech2Face focuses on constructing facial profiles, highlighting a different approach to linking voice and identity.

Conversation with Amina Abbas-Nazari

HPM: What are the ethical problems with using sound to guess what someone looks like? Don’t we all do this daily?

AAN: AI voice analysis algorithms are often developed without rigorous real-world testing, yet they influence critical decisions about policing, home ownership, insurance, immigration, surveillance, and job opportunities. For example, the commercial company Clearspeed develops AI voice analysis tools for businesses that want to screen job applicants to understand their trustworthiness. Potential employees are called and asked a series of yes-no questions via an automated questionnaire. An assessment is then made based on the audio-derived data from these monosyllabic utterances.

AI voice analysis relies on normative and stereotypical representations of human beings, which causes actual harm to people, especially those marginalized in society. It involves categorizing and labeling people or human characteristics to correlate and reference a data set on which it is built. This perpetuates normative expectations within an AI system and its real-world application. Anyone not fitting into this AI system risks being marginalized, misrepresented, and/or potentially harmed. From an intersectional position, this could include, for example, BAME, queer, trans, disabled and displaced people. For example, the pitch of an adult woman is typically between 165 and 255 Hz, but this is not always the case. Voice and identity should not be understood as directly linked and congruent.

HPM: What interests you in the relationship between voice and identity as a design researcher?

AAN: I’m a designer, researcher, and vocal performer. The combination of these fields of interest has led me to research the voice in conjunction with emerging technologies, primarily through practice-led work, since 2008. I’ve recently completed a PhD on the sound and sounding of voices in artificially intelligent conversational systems. During my PhD research, I became interested in how AI and the AI industry define, design, and understand voice (both human and synthesized). It became apparent that the field of AI perpetuates the idea that one person has one voice and that analysis of their voice can provide a detailed description of their identity. AI analysis makes far-ranging claims, including the ability to measure someone’s age, hormone levels, heart rate, blood pressure, dominance, leadership, and public and private behavior.^{22Rita Singh, Profiling Humans from Their Voice (Springer, 2019).} Edward B. Kang, Assistant Professor of Critical Digital Studies at NYU, affirms that this industry relies on understanding the voice as a “fixed, extractable, and measurable ‘sound object’ located within the body.”^{33Edward B Kang, “Biometric Imaginaries: Formatting Voice, Body, Identity to Data,” Social Studies of Science 52, no. 4 (2022).}

However, as a singer, I felt compelled to question these claims and AI’s particular comprehension of voice. For me, a voice is an ever-changing, shape-shifting, and malleable material that I sculpt in co-creation with my body and the environment around me. Looking at sound and music practice and the work of female experimental vocalists, I have found support for my ideas. For example, experimental vocalist Jennifer Walshe describes her body as a “staging area” for all the things she has heard and all the places she has lived: “I don’t have a voice. I have many, many voices.”^{44Jennifer Walshe, “Ghosts of the Hidden Layer,” talk given at the Darmstädter Ferienkurse, July 25, 2018. Posted on Milker Corporation.} I’m also drawn to the writing of Professor of Musicology Nina Sun Eidsheim, who describes how “a specific voice’s sonic potentiality […] [in] its execution can exceed imagination”^{66Nina Sun Eidsheim, The Race of Sound (Duke University Press, 2019).} and discusses voices as having “an infinity of unrealized manifestations.”^{77Eidsheim, The Race of Sound.}

HPM: You developed the AI with the help of Sitraka Rakotoniaina. What datasets were used for training?

AAN: Polyphonic Embodiment(s) used the same dataset as MIT’s Speech2Face, called AVSpeech, which comprises 3-10 second audio-visual clips from 290k YouTube videos, where the audible sound belongs to a single, speaking person. We didn’t create an original dataset because this project was not about creating an intervention at the dataset level but more about the AI itself, as a complete system that tries to construct an individual’s facial appearance by using data from the sounding of their voice. Since we had limited resources and computing power, 2500 samples were randomly selected from the AVSpeech dataset, aiming to imitate the recognition capabilities of Speech2Face.

HPM: In the video, you use only your own voice for the experiment. Why is that, and how does it affect the project's results?

AAN: Polyphonic means “many voices.” By exploring the expressive possibilities of one voice, I wanted to highlight vocal diversity, creativity, and potential. Through this exploration, I tried to resist the claims of AI's voice analysis while exposing the system's shortcomings. This challenges the understanding of voice that AI uses and promotes. I believe emancipation from AI's oppressive ordering and categorization of humans can be achieved by valuing humans and their voices as polyphonic.

HPM: You used very basic household objects to morph your voice. This seems like a counterpoint to the “complexity” of AI technology.

AAN: AI is a top-down system billed as a panacea supertechnology. However, as a society, we have little control over how companies use our data, and there is currently inadequate testing and regulation of how and where AI is employed. For example, AI facial recognition has received much criticism for its significant racial bias and poor performance, yet there are frequent reports of this technology being used in UK policing.^{77Matt Burgess, “Police Use of Face Recognition Is Sweeping the UK,” Wired, November 9, 2023.}

In Polyphonic Embodiment(s), the devices represent a bottom-up, low/no-tech, DIY intervention to expose and resist AI’s inadequacies while giving the voice greater agency and authority within AI systems. I used simple materials and household objects to shape my body, morphing my voice and vocal qualities as an exaggerated expression of vocal materiality and materialism. The recreated Speech2Face AI then analysed the experiments exploring my many voices.

The video documentation and final output of Polyphonic Embodiment(s) show the ten devices being used and the resulting face produced by the AI we created on the left side of the frame. Some images resemble my face? (e.g. Device #8). Some might be deemed more masculine? (e.g. Device #10). And some are just plain disturbing! (e.g. Device #4). Moreover, this shows that the AI failed to comprehend my voice and, in turn, my face when I utilized the polyphonic potential of my vocal materiality.

Amina Abbas-Nazari (she/her) is a designer, researcher and vocal performer based in London.