Prototype Development

Hieronymus Bosch, The Garden of Earthly Delights, oil on oak panels, 205.5 cm × 384.9 cm (81 in × 152 in), Museo del Prado, Madrid

Prototype

Current Prototype

The executable is available via GitHub at version 1.0.5. The tutorial launches automatically on first run.

View on GitHub

Phase 1B Prototype: Wikipedia Soundscape Generator

Explore any Wikipedia article as a 3D spatial audio soundscape. Enter a URL, and the system fetches the article, converts text to speech, and builds an interactive scene in real time. Available in English and French.

Open full-screen demo

Open full-screen for the best experience. Best with headphones.

Sound Guide — What You Hear in the Soundscape

The real-time demo uses several layers of spatial audio to help you navigate and understand the article structure:

Core Audio

Section speech — When you approach a sphere, the section's text is read aloud using text-to-speech. Volume scales with distance — closer sections are louder, distant ones are quieter.
Singing bowl beacons — All unvisited elements emit a gentle, looping singing bowl tone from their 3D position, like overhearing conversations at a party. Each hierarchy level has a distinct pitch:
- Deep bowl (174 Hz) — Article title
- Mid bowl (264 Hz) — Main sections (H2 headings)
- Bright bowl (396 Hz) — Subsections (H3 headings)
- Gentle bowl (480 Hz) — Paragraphs
The guide element (next unvisited in reading order) pulses louder to stand out. Visited elements go silent, so the soundscape thins as you progress. All beacons mute during speech for clean listening.
Spatial summary on arrival — After the introductory audio, a spoken overview announces how many sections the article has and names the first few, giving you orientation before you start exploring.
Content sonification — Images and tables have distinct sonic markers:
- Images (orange diamond shapes) — A camera shutter click plays before the image description is read
- Tables (blue flat shapes) — Three ascending beeps play before the table data is read
Up to 3 images and 2 tables are placed as separate elements in the scene, so visual content is not skipped.
Auto-announce — As you approach a section sphere (within about 6 units), its heading is spoken automatically.
Ambient background — A soft, layered drone plays continuously, combining low-frequency tones (55 Hz, 82.5 Hz, 110 Hz) with gentle filtered noise.
Section ambiences — Each major section has its own subtle ambient texture, creating audio "neighborhoods" that change as you move through the article.
Boundary sound — A percussive "bump" sound plays when you reach the edge of the scene.
Welcome audio — On first entry, an instruction audio plays. Double-tap Space (or the pause button on mobile) to hear a welcome message.

Footstep Audio

Surface-responsive footsteps — As you move, footstep sounds play at regular intervals. The sound character changes based on your position: softer near the introduction (like grass), harder in deeper sections (like stone).

Navigation & Orientation

"Where am I?" (Tab key) — Press Tab to hear a spoken summary of your position: which section is nearest, and how many sections are to your left, right, and ahead.
Return to start (Escape key) — Press Escape to teleport back to where you first landed. Useful if you get lost in a large article.
Breadcrumb trail — Visited spheres change color (dimmed) so you can visually track where you've been. A quiet click plays when you revisit an already-visited section.
Dynamic floor — The green floor plane resizes to match the article's element layout, giving a visual boundary for the soundscape.

Layout & Structure

Linear path layout — Elements are placed along a forward path going into the scene. Walk forward to progress through the article in reading order. Headers are centered, paragraphs offset to the right, subsections slightly to the left, images further right, and tables further left.
All at ear level — All elements (title, sections, subsections, paragraphs, images, tables) are placed at the same height (y=1.6) for consistent audio.
Content-type shapes — Headers appear as spheres, paragraphs as horizontal cylinders (length reflects text amount), images as orange rotated diamond boxes, and tables as blue flat wide boxes.
Wikipedia article panel — The original Wikipedia article is displayed in a panel at the top of the screen. As you approach and play sections, the corresponding text is highlighted in green and auto-scrolled into view.

Interactive Features

Auto-advance — When the current element finishes speaking and you haven't moved, the camera gently drifts toward the next element in reading order. Move any arrow key or touch the screen to cancel the drift and take manual control.
Play all by distance (P key) — Press P to hear all text elements read aloud sequentially, starting with the nearest. Volume is based on distance. Press P again to stop.
Link portals — Wikipedia links from the article appear as magenta spinning spheres at the edges of the scene. Walking into a portal announces the linked article title. Press Enter to load it as a new soundscape.

Keyboard Controls

Arrow keys — Move around the 3D space
Space — Play/pause audio
Double-tap Space — Play welcome message
Shift — Play nearest sound
Tab — "Where am I?" position summary
P — Play all elements by distance (toggle on/off)
Enter — Load a portal link if near one
Escape — Return to starting position

Phase 1A Prototype

The original Phase 1A prototype uses pre-recorded audio from the Galaxy Wikipedia article with a semicircular spatial layout. Try it here: https://www.screentosoundscape.com/scripts/phase1aprototype.html

Operational Model

The application functions as a local (Godot) desktop prototype that typically connects to a DigitalOcean server for map data and AI assistance. Offline functionality is supported for demonstrations, with local handling of movement, spatial audio, and tutorials.

Core Experience

Users explore urban environments through audio rather than visuals. The system provides ambient soundscapes, auditory icons for nearby locations, and surface-responsive footsteps. Navigation uses arrow keys with periodic AI assistant access.

Technical Architecture

The system integrates OpenStreetMap data with spatial audio processing, covering Netherlands and Belgium regions. Voice synthesis and AI models run server-side to optimize client performance.

Initial iterations addressed navigation challenges identified through co-creation sessions. Developers implemented boundary audio cues and keyboard controls to enhance user orientation and control.

Design of first prototype

The initial prototype was built using the A-Frame framework. This web-based prototype featured keyboard navigation and audio triggers at spatial points.

Try the A-Frame Prototype

Feedback from first co-creation

Users appreciated the spatial layout but needed stronger auditory boundary indicators. Key findings included the need for clearer navigation cues and better orientation feedback.

Reflection on the first co-creation

The team learned that traditional screen readers often "flatten" web experiences by reducing content to linear lists, eliminating spatial context crucial for understanding complex information like maps or images. This insight drove the development of enhanced spatial audio features.

Design of the second prototype

The second prototype added enhanced audio boundaries, refined sound parameters, and adjusted distance modeling. Additional keyboard controls were implemented to give users more agency in navigation.

Feedback from the second co-creation

The refined prototype received positive feedback for its improved boundary audio and control options. Users noted that the enhanced sound design made navigation more intuitive and less disorienting.

Reflection on the second co-creation

Co-creators valued control over voice characteristics, sound localization, and movement within soundscapes. While exploration appealed to participants, they highlighted difficulties navigating without clear auditory cues and complexity from multiple layered voices.

Alt-Text Generation Examples

Using AI-powered image analysis, Screen-to-Soundscape can generate customized alt-text descriptions tailored to different audiences and contexts. Below are examples using Hieronymus Bosch's "The Garden of Earthly Delights":

Garden of Earthly Delights - Custom Alt-Text for Art Curator

Detailed art historical description for an art curator perspective

Garden of Earthly Delights - Custom Alt-Text for a Child

Child-friendly description with simpler language

Garden of Earthly Delights - Custom Alt-Text for a Child (Upbeat tone)

Child-friendly description with an upbeat, enthusiastic tone

Garden of Earthly Delights - Custom Alt-Text for a Child (Upbeat tone and Soundscape)

Child-friendly description with upbeat tone and immersive soundscape

Plan for the future co-creation

Future development will focus on:

Expanding co-creation sessions with diverse visual content (charts, infographics, complex materials)
Promoting open-source participation from developers, sound designers, and accessibility advocates
Documenting co-creation guidelines for future inclusive design projects
Enhancing spatial audio with echoic footsteps and clearer state transitions
Supporting community sound packs and offline city bundles