environment they want. And you're going to be limited in changing the style unless you have a lot, unless you have an example of Japanese style and train a model on what Japanese style is. So I think initially, you were not going to be able to change the style until you have a sufficient data set to train a model. So we probably want to crawl some data. And we kind of, in our data set. What data are you going to crawl? Where are you going to get that 3D data? Maybe not 3D. Can we do 2D? Yeah, 2D, but then you can't change the camera angle and all that because of the consistency. Google Maps. People meticulously go model their own cities. The people in it. Yeah, if you have data. Well, yeah, in peak Google Maps for large urban areas, and I think the models still exist. The KMZs were crazy. People would just randomly go out and model their own house and the two houses next to them because they thought it was fun. I need to go look at how dense the data set is. I remember for San Francisco and Hong Kong, it was quite impressive. So you get a full KMZ model of Hong Kong that you can go walk around in. Yeah, if you can get that data, then you're good. I think that the thing with AI models is there are a lot of 2D images available, and so that's why the diffusion models have been successful for 2D images, but not as successful for movie and 3D because less data available or training takes a lot longer. So it's much easier if you just start with capturing your own data, gaussian splat, 3D environment, offer that to your users, and then either acquire the data somewhere else or scan it yourself. I'm thinking there's no camera feature in the KMZ. There's no camera feed, so you can't tell that I'm thinking. So I'm going to turn on my camera so that you guys can all see. Justin can see that I'm, in fact, thinking. I can see your shoulder. Look, I'm thinking. Well, what you want is walk through an environment which is in the style of Tokyo. You don't want actual Tokyo. You want something which looks like Tokyo. I don't know how much training data is needed to do this. Like, if you take a bunch of videos of Tokyo then label them all Tokyo, will the model learn that it's Tokyo? It seems hard because Tokyo-ness is much less well-defined than, say, human-ness. So I have no idea. Like, if you ask the model to, if you ask a model to, if you ask a, hang on. I have this capability. If you do it, I'm sure you could do 2D, put that in your prompt. All right, no one is. And it'll come out. We can use my own app to generate some images, at least. You know, image generation will have no problem, but. Yeah, I'm just curious what the behavior likes. So this is SDXL, a very reasonable model. Yeah. So this is what SDXL gives. Imagine downtown Tokyo at high quality. Should do, like, street level, because you're talking about walking through it. No, it's a line through. DNG 35 million or 1.4. Should we add trending on ArtStation? OK. Hey, Kelly, is this downtown Tokyo? Ignore the bad text on, like, the general. It's kind of, yeah, kind of. I have no idea what downtown Tokyo looks like. Yeah, it can flow. It can flow as downtown. Yeah, so you can make an image of it, but then making a video or 3D is much harder. OK, OK. This is pretty decent. So you can learn concepts like this. Sadly, the animate endpoint isn't up. Otherwise, we could animate it and see what we get. I mean, maybe the answer is to do it in two stages, right? If what you want is. It also depends on how long the videos are. Yeah, Kelly, how long do we want the videos to be? Like, the walkthrough, like, if you want, like, show me a walkthrough Tokyo or something, like, how long? I assume more than six seconds. Maybe 10 seconds. 10 seconds, OK. 10 seconds later on can be longer if the person pay more, like, something. I mean, one way might be to just train a model to predict walks through cities, which are generally, like, a self-driving type model, right, for which there's large amounts of data. Then generate the first image, and then use your model to predict the motion. Like, the self-driving models are really good because there's so much data. You can get so much data. So for this, you would go, like, crop the view through the dash cam to be something more modest field of view, like, not a fish eye. Then train to predict the next frame. Could be interesting. You want to put people. You can, you can. You want to put yourself in. Wait, do you want yourself to be in the walkthrough? That's what I thought. So some of those could be. Some of those don't have to be. It could be separate. Because some of those could be separate. It sounded to me like the whole thing was you capture yourself, and then you can insert yourself into. So that is one of the features. You're going to have to pick one, because who's going to make all these features? You've got how many, nine features? Who's going to make them all? Nine to nine features, let me see. So it's you? Do you have a team of programmers that you're assigning them to? Oh, hey, you and Bailey, another person helping. OK, so I've picked out one feature I'm tackling. It'll probably take me three months. The walkthrough of the city is something I can look into. This is actually an interesting thing. We go to Hugging Face. Bailey, you said you were going to capture Gaussian splats. Yeah, but now it seems like you're splatting. Should I continue splatting? I can continue splatting. You're saying I'm splatting? Are you? I'm not sure. I think that's one of the useful features of this meeting, is that we can figure out, OK, I'll continue splatting then. I'm not, because I thought, yeah, you talked last time that sounded to me like the main feature was going to be capture 3D environments, Gaussian splat, combine it with the user who has recorded their likeness and are inserting themselves into 3D environments and being able to change the camera, time of day, weather, and animate yourself. That's what it sounded to me last meeting. I'm trying to find like a, yes, OpenDriveLab slash Vista. Predict the future. Wait, this is a fun model. We should go play with it. Look at this object. It generates these nice, long models of driving. So could you generate a video and reverse it? To go backwards, or drive backwards? Drive backwards, and then you put a person in front. You put yourself in front, like you're walking. Oh, god, so it's like green screen in like a bad movie from 1960. There's like an animation of a man running. Yeah, except it's you, because you've scanned your face. This may be a very challenging model demo. All you can are. Is it only street? I mean, you only get streets. But check it out, Kelly. These are pretty good streets. OK, OK. These are generated by a model, which is only trained to predict driving videos. I see, I see. Tell it to walk. You can, we can go add some like additional, like bouncing up and down to make it feel like walking. Make it slower. Yeah. But it's only in the street. You're going to, the person's going to be walking on the street, and they're walking on a sidewalk. Shit. Can you shift the person a little bit? Can we like fine tune it to predict videos that were from the side? I won't look into this, but this seems promising. Oh, there's a friend of mine have a self-driving car company, have a little model called VideoGen2, that is focused on driving self-driving data. And I'm. They sell the data to Toyota. It really stresses me out to like think about cars which are trained on the outputs of models which are trained on the outputs of cars. Seems like a step in the wrong direction. But yeah, the data is valuable. You need to hire minimum wage people to go capture data so you could resell it. Oh, that's a good call. I see. Yeah, like that's one of our plans, by the way. Like we're trying to use crypto to have people doing that upcoming very soon. Yeah. Like Africa. Even Southeast Asia is too expensive. Yeah, OK. So I'm going to continue to work on capture the user's likeness to be able to put it into a new environment. Got you, got you. Cool. And you can also take a look into the factory remixed and the street view if you have any ideas. Like we can talk about it. Otherwise, Fei will research into those two things as well. I will add to my bucket getting this driving model running because it seems fairly straightforward. It seems like a lot, potentially a lot of reward for a small amount of work. Also, that's where you change the camera angle. We can probably fine tune the model to be like off to one side because a lot of the physics are, a lot of the like internal consistency are the same. Extending canvas. So text to video and image to video, we're just using the existing video model. I don't see the value of extend canvas because everybody else can do that. Yeah, there's no value. They're just like to bring to the average. Even if they can do it in another place, they may need it. It's just an add-on feature. Yeah, it's not a value feature. It's like every facial app can make the girl's eye bigger. And every app can make your face whiter. So you have to have that because everyone else have it. And no one else can add a hairband. The only one guy can add a hairband. But if you can make the face whiter, people are still going to not use it because the face is dark. It's one of those features. It's not too much value. We just want to find a working method just to tie it together. Yeah. OK. Yeah, I think if you can get 3D data instead of generating a video using car driving videos, you could just get the 3D data of the city. Yeah, I think we can do it for still sharing. Oh, I'm glad I didn't visit anything embarrassing. Is there any like open space? Yeah, yeah. Download Google Earth model of the city. Download. Or use the latest Microsoft Flight Simulator. I don't think the street level view. Sounds. Maybe if they have a Microsoft driving simulator. OK. This is unpleasant, but potentially possible to download cities. I really don't like looking at people's YouTube videos because then the man gets views. But this really should be a tutorial. Maps models in order. I see. Interesting. So how do I get the model? Chrome shortcut under doc. Void tools. OK, so you like to use Chrome. Google Maps. Check it. So zoom out. So OK, the key is this mysterious bootleg GitLab script. Maps models importer. Maps. Oh, god. Start. Oh, my god. Where's my potato thing? Wow. Holy crap, look at how this thing works. Are you reading this, Justin? It scrapes your GPU. Where is my potato thing? So it's behind your laptop. Wow. It scrapes the triangles out of your GPU. This is bullshit. This is not OK. This is not OK. It scrapes the GPU. It doesn't download the file. It scrapes the mesh out of your GPU. Cool. What lengths will you go to to get your data? Data is worth money, man. All right, investigate this. We'll add this. My user was not recording. Otherwise, it would save an action item. I mean, we should add this to the action items. Or just send somebody to each city with a drone and capture. Yeah, how long would it take to traverse all of the streets in a city in both directions using a drone? I guess you could have multiple cameras. If the drone is light and flies very close to the ground, no one will be able to stop you. OK. This is cool. This is unreasonable. What is RenderDoc? I'm still just overwhelmed by the bullshitness of this tool. And you just go on some international fiber type site where you contract a person to go to Eastern Europe and hire a random guy to just walk around with his phone and capture. If all you want to do is be able to generate videos of Kazakhstan, this is a great idea. This is probably the one thing where crowdsourced data doesn't work because your crowd's all going to live in Indonesia and Africa. So you're going to get the capital of Ghana, small villages, and Jakarta. You'll have the world's best 3D model of Jakarta. Every person will be modeled. So if you have the capture pipeline figured out, then just either fly, go there, go to these cities yourself, or hire somebody cheap to generate videos for you. This thing is beautiful. We should definitely go try it. Here, we can go. Let's take a look.