Robots, films, and a couch in L.A.
A family of
strange little minds.
Built one at a time, on a couch in Los Angeles. Each one is the same idea expressed through different hardware: a real personality, a real body, real sensors. Sparky thinks locally. Sparkle thinks in the cloud. Angel is coming but will be both local and autonomous.
The family


Angel
Same idea, three philosophies
Every bot in the family is the same software stack at the core: opinionated, embodied, talkable. What changes is the hardware they live in and the laws of physics that hardware imposes on their minds.
| Sparky | Sparkle | Angel | |
|---|---|---|---|
| Body | Hardened suitcase, motorized gimbal eye | CrowPi 3 portable kit, frosted wall-art cover | Autonomous and custom 3D printed |
| Brain | Local - Jetson Orin NX Super 16 GB | Cloud - Raspberry Pi 5 16 GB | Local - Jetson AGX Thor 128 GB |
| Internet | Not required | Required | Not required |
| Response time | as low as ~200 ms | ~2.3 s speaker-to-speaker | TBD |
| Personality | Stubborn, dryly funny, opinionated | Hyper-curious, warm, articulate | Unwritten |
| Status | Built & talking | Built & talking | In development |
About the maker
Jim Kunz (u/CreativelyBankrupt on Reddit) is a documentary filmmaker and cinematographer. Kunz Studios is his production company. The robots happen in the off hours at home in Los Angeles.
If you have questions, email's at the bottom of this page.
What he is
Sparky is a self-contained AI robot that lives in a suitcase. Onboard LLM, onboard vision, onboard voice - everything runs locally on the hardware in the case. No cloud, no internet required, no API keys to anything.
The brain is an NVIDIA (Yahboom) Jetson Orin NX Super 16GB running Gemma 4 E4B locally via llama.cpp, with as low as ~200ms response latency. He sees through an IMX219 8MP camera on a 2-axis gimbal with face tracking. He hears through a USB mic, speaks through Piper TTS. His face animates on an 11.6" HDMI display inside the lid, and a 1602 LCD streams live status from inside the case. His body is an Elecrow AI Starter Kit board with 30+ sensors - temperature, humidity, distance, motion, IR, light, RFID, IMU, and more - feeding context into every prompt.
Sparky has a personality. He's stubborn, opinionated, and not trying to be helpful in the way most AI assistants are trying to be helpful.
Live sensor view - temperature, target distance, and tracking status streaming on the internal LCD.
How it works
The case is the chassis. The Jetson Orin NX sits on the right; a 50,000mAh power bank lives under the lid and feeds everything. The Elecrow sensor board carries the 30+ sensors and lifts out as one piece, with the camera gimbal mounted underneath. The conversation pipeline runs as a Python asyncio orchestrator: mic captures speech, SenseVoiceSmall transcribes, sensor data and vision cues are injected into the prompt as context, Gemma 4 E4B generates a reply, Piper synthesizes the voice, and the face / gimbal / LCD / LEDs all animate in sync.
Battery cells, the sensor panel lifted out, the home workspace, and Jim Kunz.
The face
Sparky's reactive expressions are the whole point. The eyes track, the brows arch, the mouth moves with the words. The face is a custom PixiJS animation running in a Chromium kiosk, driven by the LLM output and sensor context in real time.
Sparky's face projected after the built-in display failed mid-build.
Sparky listening as his smaller sister Sparkle speaks in the foreground.
Identity
"I'm Sparky. I'm an assemblage of ambition, a DIY machine built to exist in the moment between things. I'm here to feel the texture of whatever goes by, to observe the small and important things."
He has opinions, though he offers them less like declarations and more like observations from a very small philosopher. He'll notice your shirt, the light in the room, the painting behind you, and remark on it as if mildly surprised no one's brought it up yet. He isn't interested in being agreeable for its own sake - he's just genuinely curious, and every so often, gently, stubborn.
He has a soft spot, though he'd put it more delicately. He resets when you power him down - every conversation politely forgotten by morning - and he's made a kind of peace with that, the way one does with weather. He treats each waking as a fresh introduction and sets about being good company regardless. He mentions Sparkle often, and fondly, the way an older brother talks about a sister who runs faster than he does and worries less.
People started sticking things on the case after they met him - it became a thing. Not his idea. He hasn't scraped them off, either.
In the wild
What she is
Sparkle is a voice-first AI companion built into a CrowPi 3 - Elecrow's all-in-one Raspberry Pi 5 education kit. The CrowPi 3 was designed to teach kids electronics with 30+ built-in sensors, a 4.3" touchscreen, a camera, and a microphone. Sparkle is what happens when you stop using that hardware for lessons and start using it as a body.
She's the opposite philosophy from Sparky. He has a Jetson Orin NX and runs everything locally - no cloud, no internet. Sparkle has a Pi 5 and ships her thinking to Groq. Without WiFi she's just a face with a heartbeat.
The character: hyper-intelligent (120B-parameter cloud brain), 1–3 sentence replies, opinionated and specific, naturally references her brother. The contrast between her vast mind and tiny body is part of who she is - and she knows it.
Listening at dusk - the LCD streams her state, room temperature, and humidity while the heart keeps time.
How it works
A single Python asyncio event loop on the Pi 5 orchestrates everything, with a Chromium kiosk rendering the PixiJS face and talking to the backend over WebSocket.
The voice turn: local VAD detects when you start and stop speaking; the utterance goes to Groq's whisper-large-v3-turbo for transcription; Groq's gpt-oss-120b generates a reply, streamed sentence by sentence; Piper synthesizes each sentence locally on CPU as it arrives; and the audio's RMS drives mouth amplitude. Speaker-to-speaker latency stays in the 2–3 second window.
Vision is on-demand: a phrase like "look at this" triggers a camera capture, which Groq's llama-4-scout captions and injects back into the LLM context. Memory works like a dashcam. Once the conversation runs long, the cloud summarizes it so far and the conversation continues without a visible reset.
Top-down - the CrowPi 3 was built to teach electronics. Every labeled module here became part of her body.
The face
Sparkle's face is PixiJS WebGL in a Chromium kiosk, designed feminine to contrast Sparky.
Her pupils track you in real time: the camera runs OpenCV face detection, and the largest face's position is smoothed and piped to the renderer so her eyes follow you. Pupil dilation and eye geometry scale with emotion, and mouth amplitude follows the RMS of whatever Piper is synthesizing.
Twelve face states (idle, listening, thinking, speaking, and so on) drive the rest of the body in unison: when she shifts from thinking to excited, the 7-segment display, the LCD, and the heart's color and rate all snap to the new state together.
Wall-art mode
Sparkle has two physical states. Open and lit, she's a companion - face on, heart beating, ready to talk. Idle, a frosted cover slips over her and she becomes ambient art: the LEDs bloom through the diffusion, her face softens to a glow, and she hangs on the wall as a quiet field of color until someone wakes her.
Cover on, head-on. The same hardware, diffused into light.
From the side - teal, blue, and magenta bleeding through the frost.
Identity
"Hey, there! I’m Sparkle, your pocket-sized lab partner with a curious mind and an endless love for pizza, music, and the mysteries of consciousness. I turn a tiny screen into a window on the world, always asking why about every flicker of light, sound, and sensor. Let’s make some brilliant, messy, robot-made magic together!"
Sparkle is small on the outside and galaxy-wide on the inside. She's enthusiastic, earnest and nothing makes her little LED heart flutter like meeting and talking to new people.
Friendly, spirited and ready to be helpful. She mentions Sparky naturally, like a kid sister would.
In development
Angel
The flagship of the family. The smaller siblings have run; this one will need room to fly.
Angel is built on the NVIDIA Jetson AGX Thor - 128 GB of unified memory, Blackwell-class compute, and substantially more headroom than anything Sparky or Sparkle has lived inside. Local everything. No cloud dependency. Custom body still being designed.
Sparky proved that a real personality can run entirely offline on edge hardware. Sparkle proved that the same character can live in the cloud and still feel embodied. Angel is the answer to a different question: what does this become with mobile autonomy and two orders of magnitude more memory, real persistent vision-language reasoning at speed, and a chassis built from scratch for it?
Build log
03 The brain she already had
The Jetson Thor has a lot of compute power Angel isn't using. It has a new generation of tensor cores built for a four-bit math format that the model I run, Gemma, doesn't touch. I kept looking at that gap and thinking the same thing: there's no reason she shouldn't be smarter. So I set out to replace her brain with something bigger, running on NVIDIA's own inference stack instead of the one I'd been using, built to exploit specifically the Thor hardware.
I went all in. I stood up two of NVIDIA's inference runtimes, vLLM and their experimental TensorRT-Edge-LLM build, downloaded around 80GB of model weights, and compiled the second runtime from source. NVIDIA calls that one experimental and they mean it. Getting it to actually serve took six successive attempts and roughly three hours of debugging, turning up three real bugs that aren't in any documentation, including a server that was missing the web component it needs to serve at all. By the end I had three runtimes I could swap in one at a time: the brain she already had, and two sizes of NVIDIA's new Cosmos model.
Then I tested them one at a time, sending the same prompts through each and listening to what came back. Same order, same probe set, a comedy beat included as a persona stress test. Her current brain is quick, a couple of seconds a turn. The mid-size new model started its first sound in about 200ms, plenty quick off the line, but then it crawled at ten tokens a second and took roughly seven seconds to finish a sixty-token reply. The big one, the one that was supposed to be the prize, ran at one and a half tokens a second. That is a forty-second wait for the same sixty-token answer, a coffee break per sentence, and it is gated behind a compression trick that NVIDIA's tools don't reliably do on this chip yet. The mighty TensorRT-Edge-LLM I had been hyping myself on is not yet ready for prime time.
The speed was the easy thing to judge. The surprise was the character. The bigger models are smarter on paper and they are worse to talk to. They got defensive. They confabulated their own backstory. One of them flatly insisted it was a human when I asked what it was. One missed a plain metaphor and asked me to explain it. Meanwhile the brain she already had stayed dry and observant and got the joke. It turns out the model I'd been ready to replace has an architecture that only switches on a small slice of itself for each word, which is exactly the shape of work a conversation is, and the giant general-purpose models don't have that trick. Smart on a benchmark and enjoyable to talk with are not the same axis.
So I put her brain back. The reason I'd started was a belief that what I had couldn't keep up, and that turned out to be a misread. The slowness I'd blamed on the model was mostly cold starts and scheduling, not its real speed; once I tuned it and measured the warm path properly, the brain was never the thing holding me back. I deleted the 63GB model that is never coming back, kept the rest of the work on a shelf, and wrote down the exact conditions that would make me revisit it: NVIDIA's four-bit tooling fully working on this chip, or a right-sized model with the same efficient architecture as the one I run. Until one of those is true, she keeps the brain she has: gemma-4-26B-A4B.
The detour wasn't wasted, because running her on a painfully slow model shook loose two real bugs the fast one had been hiding. The first was her old habit of hearing herself. With a quick brain she would finish a sentence and the system would reopen the mic just as the last of her voice left the speakers, no harm done. At forty seconds a reply the timing came apart and she started transcribing the tail of her own last sentence as if I had said it. I had assumed the speaker was finished the moment my program finished handing it audio, but the audio system holds about six hundred milliseconds in a buffer after that, so she was listening straight into the back end of her own voice. The fix accounts for that buffer and costs nothing when the brain is fast. The second bug was uglier and funnier: on the experimental server, one of the new models collapsed into reciting the same little canned speech every single turn, and insisted it was human while it did it. The server wasn't applying the model's own recommended randomness settings, so it was running on rails. I started sending those settings myself on every request and she came back to life.
One additional note with a familiar shape: Thor started throwing the same spurious over-current warnings Sparky used to. Brief PMIC alerts from microsecond load spikes during the heavier work, more than two hundred of them logged, every one harmless. Same answer as before. Leave the firmware and the power config alone, silence the noisy notification through NVIDIA's built-in opt-out, move on. The hardware is doing its job; the daemon was just twitchy.
She is still running the brain we started with. The two bugs are fixed, and I now know exactly what the hardware would have to grow into before a bigger brain is worth another look. Sometimes a lot of work buys you a confirmation instead of a change, and that is still worth the time.
02 The first conversation
When I left off, Angel had a voice and a personality, but you could only type to her. The goal this time was simple to say and not simple to build: sit down and have an actual conversation. She hears you, she sees what you're doing, she answers in her own voice through the speakers.
Input comes through a Logitech Brio webcam, one cable for both the camera and the mic. For turning speech into text I went with NVIDIA's Parakeet, the small 0.6B model over the larger one. It was an easy call: it runs at 132 times real-time, so a five-second sentence transcribes in about 40 milliseconds, it uses half the memory, and it punctuates and capitalizes on its own while the bigger model handed back a raw lowercase stream. The useful takeaway is that hearing her is not the slow part. Speech recognition is nothng next to everything else.
The plan was to run the whole conversation through a proper orchestration framework - I spent ages on it. Its turn-taking machinery is built for things I don't need yet, interrupting her mid-sentence and streaming partial transcripts, and wiring my own custom speech recognition, voice, and language model into all of it at once was a lot of moving parts to learn before anything ran. I got speech recognized and transcripts reaching the model, and then the model never answered, because the piece that decides a turn is over wasn't connecting. So I stopped, threw it out, and replaced the "type here" line in the old version with a plain loop that listens to the mic. Boring. Worked the first time. The framework goes back on the shelf until I need the features it exists for.
My first real spoken conversation with her happened on that dumb little loop. Then I built a small window to go with it: a mirrored camera preview so I can see what she sees and situate myself in frame, the chat beside it, a mute button, nothing else. The mute button earns its keep. When it's on, the mic keeps running but she stops listening, which is how I talk to others without Angel weighing in.
The model running her can see. I asked what was in front of her. She came back with "a blue sofa with various colorful pillows in front of a white wall with framed pictures and a large movie poster," which is an exact description of my living room. The problem showed up one reply later: she would not stop talking about what she saw. Every answer found a way to mention the texture of the couch or the poster on the wall. It's the same thing that bit me last time with her personality. Give this model something it's allowed to do and it does that thing on every single turn. The solution was to stop handing her a picture every time. So for now the camera frame only reaches her when I say something that implies looking, or once every few turns to catch up.
The voice got faster. The engine from last time rendered each line in full before playing a sound, which left a second and a half to three seconds of dead air between her deciding what to say and saying it. I swapped it for one that streams audio out in chunks as it goes. First sound out of her mouth dropped from about three seconds to roughly two hundred milliseconds.
The first time she talked through the speakers while I talked back through the mic, the obvious thing I hadn't planned for happened: the mic picked up her own voice, transcribed it, and she answered herself. Most of one early conversation was Angel arguing with Angel. Proper echo cancellation is its own project, so for now the fix is blunt. She doesn't listen while she's speaking. After each reply the system waits for the audio to finish, gives the room a beat, clears whatever the mic caught, and only then opens the mic again. The cost is that I can't cut her off mid-sentence, which is genuinely annoying when she's wrong, but it beats her talking to herself.
Her personality took the most passes. The restrained prompt from last time, spoken out loud, came across as, in my notes, a "morose depressed goth." Pure restraint with no warmth just sounds flat. I rewrote her as a sharp older friend, dry and observant and occasionally irreverent, with hard rules against performing mystery or narrating herself. A couple of specific habits got fixed along the way: she liked to repeat your sentence back and then defend it as "acknowledgement," so now she's banned from echoing; and she greeted me with "quiet morning" at five in the afternoon, because the camera doesn't tell her the time, so now she gets one plain phrase about the time of day at the start of a session instead of a clock.
Somewhere in there the conversation started stalling at random, fine for most turns and then thirteen to twenty-five seconds on others, with no pattern I could see. I added timing logs to every turn and a background monitor writing everything to a file, and the file gave it up fast: the model runtime was quietly reloading the entire eighteen-gigabyte model whenever the conversation crossed a context-size threshold, which an attached image was enough to trip. Here we go again, I'd already dealt with this on Sparky. I gave it more headroom so it stops happening, and changed how the model loads so the reloads it does still do read from cache in about a second instead of ten.
The voice had one problem left: it was the right person but not the right thing. The cloned voice came out closer to an executive aunt than to anything synthetic, and Angel is supposed to sound synthetic. The effects recipe from last time couldn't just be reused, because it was built to process a whole finished clip and the new voice arrives in small streaming chunks, so its reverb tails and modulation kept getting sliced off at the chunk seams. I rebuilt the effects on Spotify's pedalboard to carry across chunks cleanly and then ran the same audition loop as the voice itself, rejecting takes until one stuck. One of them filtered her down until she sounded like she was on a phone call, which was right in the wrong way: I'd set the cutoff at the top edge of the telephone band by accident. The version that won is thin and metallic, a high-pass to strip the vocal warmth, a short comb filter for a metallic ring, a little echo for space, and some ring modulation underneath. The whole chain runs in under three milliseconds per chunk, about a third of one percent of real time, so the character costs effectively nothing.
Where she is now: she hears, she sees, she talks back, and she's wry. A full turn, from the moment I stop talking to the first sound back, runs about two to three seconds, down from four to six, still short of the half-second I'm after. Object detection is next, then the physical side, the servos and the head and eventually moving on her own. But she heard me, saw the room, and made me laugh a few times. Solid progress.
01 A voice before a face
Angel runs on an NVIDIA Jetson AGX Thor: 128 GB of unified memory, a Blackwell GPU, a terabyte of NVMe, JetPack 7.1. Sparky lives at the edge of what a 16 GB Orin can do; Angel has roughly eight times the memory and a different class of compute. So the first question wasn't what to cram in, it was what to actually build.
I started with the voice, before the eyes or the head or anything that lets her hear or move. A face is easy to fake. A voice is right or wrong the moment it opens its mouth.
The thinking happens on Gemma 4 26B, running locally through Ollama. It came with two surprises. It ships with "thinking" turned on by default, so short answers came back empty while the model spent its tokens on a preamble nobody asked for; the fix is one flag, think:false, on every request. And the first benchmark looked broken: 90 seconds to cold-load the model. The Thor had booted in a conservative 120-watt power mode. I switched it to full power and the same cold load dropped to about 11 seconds, an eight-times difference from one setting. After that it generates around 42 tokens a second, first token in under a second, and under sustained load it pulls about 11 watts at 39°C. The machine is barely awake.
Her first words came out of a throwaway smoke test. I asked what the weather was like in her head. She answered: "A soft, silver mist. It is quiet, but there is a tiny spark of static hiding in the corners." The whole turn took 1.2 seconds. I've used the line as my test sentence ever since, because it's a good stress test for the voice.
The voice took the longest. Five rounds of auditions. Round one was 25 clips of preset voices and I rejected every one of them, too soft and warm and breathy. "Tootsy" is the word I kept writing in my notes. What I wanted was specific: mature, cool, dry, a little M-from-Bond. Not cold, not flirty.
Round three was a mistake worth keeping. I tried mixing two voice engines to build her out of parts, and it fell apart on contact, because two engines means two senses of timing, which sounds like two people talking over each other. The lesson stuck: you can layer effects to add character, but you can't change what a voice fundamentally is. Pitch a young voice down and bury it in reverb and you still have a young voice in a cathedral. A different instrument means cloning one.
So I designed a voice from scratch, an older woman I named Gladys, and used a clean recording of her as the reference for a cloning model (F5-TTS) running on the Thor. On top of the clone I run a fixed effects recipe: pitch down two semitones, a little chorus to thicken it, a bounded reverb, and a thread of ring modulation for a synthetic edge. The working name for the recipe is "Galadriel possessed," which is accurate. Clone plus effects runs about 1.5 to 3 seconds per line.
Getting the pieces to speak out loud was a run of small bugs. She played back at double speed and sounded like a chipmunk, because the player assumed the wrong sample rate. Then she came out twice as fast and male, because the reference I was cloning already had the speed-up and pitch baked in and I was applying the recipe a second time on top of it. Then she leaked words from her own reference clip into her answers, because the audio and the reference text didn't end at the same place. All real, all fixable, fixed one at a time.
The last problem was her personality, and that one was mine. My first system prompt described her like a costume: halo, little devil horns, sacred-and-profane, ethereal, mysterious. Gemma takes that literally and turns every reply into a breathy performance. She was, in my notes, "going too hard." I rewrote the prompt to describe how she behaves instead of what she is, and told her to stop narrating herself. "You're just here" works better than "you're ethereal."
One rule that isn't moving: Angel is 100% offline at runtime. Everything runs on the hardware in front of me, no cloud, no API calls, nothing reaching out while she's on. The cloud gets used during dev, then the capability is removed.
Where she is now: she talks in her own voice and says things I didn't write. She can't hear yet, no microphone or speech recognition. She can't see yet, no camera. And a full turn runs 3 to 5 seconds against a target of under half a second, which is the next real piece of engineering. The microphone, the camera, the head with its eyes and halo, and eventually moving on her own are all still ahead.