{ "hq": [ { "speaker": 0, "text": "Environment they want, and you're gonna be limited in changing the style unless you have a lot, unless you have an example of Japanese style and a model on what", "start": 0.08, "end": 13.535 }, { "speaker": 1, "text": "Japanese style is. So I think initially, you you were not going", "start": 13.535, "end": 13.654999 }, { "speaker": 0, "text": "to be able to change his style until you have a sufficient data.", "start": 13.654999, "end": 28.11 }, { "speaker": 2, "text": "Yeah. So it's probably crawl. It's probably gonna crawl some data. And we could in our data,", "start": 28.33, "end": 34.17 }, { "speaker": 1, "text": "how are", "start": 34.17, "end": 34.489998 }, { "speaker": 0, "text": "you gonna crawl? Where are you gonna get that 3 d did.", "start": 34.489998, "end": 37.165 }, { "speaker": 2, "text": "Maybe that's 3 d. How are you, 2 d? And", "start": 37.785, "end": 41.865 }, { "speaker": 0, "text": "even then", "start": 41.865, "end": 42.585 }, { "speaker": 1, "text": "you but then you can't change the camera angle and all that", "start": 42.585, "end": 42.806538 }, { "speaker": 0, "text": "because of the to see.", "start": 42.806538, "end": 50.18 }, { "speaker": 3, "text": "Google Maps, people meticulously go model their own cities.", "start": 51.199997, "end": 54.992 }, { "speaker": 1, "text": "The peak yeah. If you have data", "start": 54.992, "end": 55.12 }, { "speaker": 3, "text": "Well, yeah. In peak peak Google maps for large urban areas, and I think the models still exist. The KMss were crazy. People would just randomly go out and model their own house in the two houses next to them because they thought it was fun. I I need to look at how dense the dataset is I remember for San Francisco and, like, Hong Kong, it was quite impressive. So you get, like, a full KMG model of Hong Kong that you can go walk around in.", "start": 55.135998, "end": 86.21 }, { "speaker": 0, "text": "Yeah. If you can get that data, then you're good. That takes a the thing with AI models is there are a lot of 2 d images available. And so that's why the diffusion models have been successful for 2 d images, but not so as successful for movie in", "start": 86.99, "end": 103.87 }, { "speaker": 3, "text": "three d because less data available or training takes a lot longer.", "start": 103.87, "end": 104.94 }, { "speaker": 0, "text": "It's much easier if you just start with capturing your own data, gauging splat, 3 d environment, offer that to your users, and then either wire the data somewhere else or scan it yourself?", "start": 115.625, "end": 134.31 }, { "speaker": 3, "text": "Yeah. I'm I'm thinking there's there's no camera feed, so you can't tell that. I'm thinking it's gonna turn on my camera so that you guys can all see can Justin and see that fact, thinking.", "start": 138.61, "end": 147.535 }, { "speaker": 0, "text": "Your shoulder. Look, I'm thinking.", "start": 147.695, "end": 149.31787 }, { "speaker": 1, "text": "Well, what you", "start": 149.31787, "end": 150.41501 }, { "speaker": 3, "text": "what you walk is walk through walk walk through a environment, which is in the style of Tokyo. Right? Like, you don't want actual Tokyo. You want, like, something which looks like Tokyo. I don't know how much data is needed to do this. Like, if you take a bunch of videos of Tokyo and label them all Tokyo, will the model learn", "start": 150.41501, "end": 177.43 }, { "speaker": 1, "text": "that it's Tokyo? It seems hard because Tokyo", "start": 177.43, "end": 177.55 }, { "speaker": 3, "text": "is much less well defined and say humanness. So I have no idea. Like, like, if you if you if you ask if you ask the model to if you ask a model to show if you ask, like, like, woman, hang on. I have this capability.", "start": 177.55, "end": 200.345 }, { "speaker": 0, "text": "If you do it, I'm sure you could do 2 d. Put that in your prompt.", "start": 201.125, "end": 204.965 }, { "speaker": 3, "text": "Alright. No no one is It'll come out. Or we can we can use my own app to generate some images at least.", "start": 204.965, "end": 210.105 }, { "speaker": 0, "text": "You know, image generation had to, well, no problem.", "start": 211.605, "end": 214.69 }, { "speaker": 1, "text": "But", "start": 214.69, "end": 214.85 }, { "speaker": 3, "text": "Yeah. I'm I'm just curious what the behavior likes.", "start": 214.93001, "end": 216.546 }, { "speaker": 1, "text": "Downtown. So this", "start": 216.594, "end": 216.738 }, { "speaker": 3, "text": "is SDXL, a very reasonable model. Yeah. So this is what STX Health gives. Engine downtown Tokyo, right, high quality. It should do,", "start": 216.738, "end": 231.365 }, { "speaker": 1, "text": "like, treat level because you're talking about walking through it. It's fine. DNG 3501.4. Should we add trending on Art Station?", "start": 232.945, "end": 241.92 }, { "speaker": 3, "text": "Okay. Hey, Kelli. Is this Downtown Tokyo? Mhmm. Like, the word, the bad text and, like, the", "start": 244.62, "end": 263.595 }, { "speaker": 1, "text": "general Kind of. Yeah. Kind of. I I have no idea", "start": 263.595, "end": 263.70166 }, { "speaker": 2, "text": "what Downtown Tokyo looks like. It it can flow. It can flow as though. Yeah.", "start": 263.70166, "end": 269.63998 }, { "speaker": 3, "text": "So you can Yeah. Okay.", "start": 269.63998, "end": 271.4 }, { "speaker": 1, "text": "You can", "start": 271.4, "end": 271.56534 }, { "speaker": 3, "text": "make an image of it, but then making a video or three d is much harder. Okay. Okay. This is pretty decent. So you you you can learn concepts like this. Sadly, the animate endpoint has been up. Otherwise, we could animate it and see what see what we get. I mean, may may maybe the answer is to do it in two stages, right, of what you want is it it also depends on how long the videos are. Yeah, Kelly, how long do we want the videos to be? When do we go", "start": 271.56534, "end": 306.54 }, { "speaker": 1, "text": "to the", "start": 306.54, "end": 306.7 }, { "speaker": 3, "text": "Like, the walk through, like, if you if you want, show me, like, show me a walk through Tokyo or something, like, how long? I assume more than 6 seconds.", "start": 306.86, "end": 315.675 }, { "speaker": 2, "text": "Maybe 10 seconds?", "start": 316.055, "end": 317.01498 }, { "speaker": 3, "text": "10 seconds. Okay.", "start": 317.01498, "end": 317.815 }, { "speaker": 2, "text": "Later on can be longer if a person pay more, like, something.", "start": 317.895, "end": 321.195 }, { "speaker": 3, "text": "I mean, what what what might be to just train train a model to predict walks through cities, which are generally, like, a self driving type model, right, for which there's large amounts of", "start": 321.495, "end": 328.5709 }, { "speaker": 1, "text": "data, then generate the first", "start": 328.5709, "end": 328.69263 }, { "speaker": 3, "text": "image and then use your model predict the motion. Like, the self driving models are really good because there's so much data. You can get so much data. So for this, you would go, like, crop the view through the through dash cam to be something more modest field of view, like, not a fish eye, then", "start": 328.69263, "end": 352.18234 }, { "speaker": 1, "text": "trained to predict the next frame could be interesting.", "start": 352.18234, "end": 352.30942 }, { "speaker": 3, "text": "You wanna put people? You can you", "start": 352.30942, "end": 352.96 }, { "speaker": 0, "text": "wanna put yourself in.", "start": 360.795, "end": 362.33502 }, { "speaker": 3, "text": "Wait. Do do do you want yourself to be in the walk through?", "start": 362.55502, "end": 365.695 }, { "speaker": 2, "text": "Some of", "start": 366.475, "end": 366.795 }, { "speaker": 1, "text": "those thought.", "start": 366.795, "end": 367.45502 }, { "speaker": 2, "text": "So some of those could be some of those don't have to be. It could be separate. It's because some", "start": 367.515, "end": 371.915 }, { "speaker": 4, "text": "to me.", "start": 372.155, "end": 372.815 }, { "speaker": 0, "text": "Like, the whole thing was you capture yourself, and then you can insert yourself into", "start": 373.26, "end": 378.22 }, { "speaker": 2, "text": "It's a so that is wild. That is wild. The that is wild of the feature.", "start": 378.94, "end": 382.54 }, { "speaker": 0, "text": "You're gonna you're gonna have to pick 1 because who who's gonna make all these features? You how many 9 features? Who's gonna make them all?", "start": 382.54, "end": 392.245 }, { "speaker": 2, "text": "No. None of that features. Let me see. It's so it's", "start": 392.465, "end": 395.785 }, { "speaker": 1, "text": "Do you", "start": 395.905, "end": 396.065 }, { "speaker": 0, "text": "do you have a team of programmers that you assigning them to?", "start": 396.065, "end": 400.32498 }, { "speaker": 2, "text": "Oh, hey. You are Bailey, and that are personally healthy.", "start": 400.705, "end": 402.4684 }, { "speaker": 1, "text": "Okay. So I picked out one feature on tackling. It'll probably take me 3 months. The walk through the", "start": 402.4684, "end": 404.4 }, { "speaker": 3, "text": "is something I can look into. This is actually an interesting thing. We go to hugging face.", "start": 414.615, "end": 418.63498 }, { "speaker": 0, "text": "Haley, you said you were gonna capture gauging splats?", "start": 418.775, "end": 421.675 }, { "speaker": 3, "text": "Yeah. I've but now it seems like you're splattering. Should I continue splattering? I can continue splattering.", "start": 422.29498, "end": 427.755 }, { "speaker": 0, "text": "Saying I'm splatting?", "start": 428.31, "end": 429.53 }, { "speaker": 3, "text": "Are are you? I'm not sure. I I think that's one of the useful features of this meeting is to decide if we can figure out, okay, I'll I'll continue splatting, Lynn.", "start": 429.75, "end": 436.25 }, { "speaker": 0, "text": "I'm not because I thought, yeah, you you talked last time that happened to me, like, the main feature was gonna be capture 3 d environments, gauges,", "start": 436.31, "end": 444.319 }, { "speaker": 1, "text": "splat, combine it with a user who has recorded their likeness and are inserting themselves into 3 d", "start": 444.319, "end": 444.54938 }, { "speaker": 0, "text": "environments and being able to change the camera, time of day, weather,", "start": 444.54938, "end": 459.76 }, { "speaker": 4, "text": "and animate yourself. Okay. That's what it sent to me last meeting. I'm trying to find, like, a, yeah,", "start": 460.86002, "end": 461.69 }, { "speaker": 3, "text": "Open DriveLab slash Vista. Predict the future. Wait. This is a fun model we should go play with. Look at this object. It generates these nice, long models of driving So", "start": 470.445, "end": 485.6755 }, { "speaker": 1, "text": "could you could you generate a video and reverse it? To go backwards or drive backwards? Drive backwards and then", "start": 485.6755, "end": 485.78586 }, { "speaker": 0, "text": "you put a person in front. You put yourself in front, like, your wall.", "start": 485.78586, "end": 500.76498 }, { "speaker": 3, "text": "Oh god. So it's, like, green screen in, like, a bad movie from 1960. There's, like, a animation of a band running.", "start": 500.76498, "end": 506.465 }, { "speaker": 0, "text": "Yeah. Except it's you because you've scanned your face.", "start": 508.925, "end": 513.425 }, { "speaker": 3, "text": "This may be a very challenging model,", "start": 513.96497, "end": 515.23334 }, { "speaker": 1, "text": "demo. All you", "start": 515.26, "end": 515.33997 }, { "speaker": 3, "text": "can are. Is it only street? I mean, you only get streets, but check check it out, Kelly.", "start": 515.33997, "end": 534.265 }, { "speaker": 1, "text": "These are pretty good streets. Okay. Okay. Like, these are generated by a model, which is only trained", "start": 534.265, "end": 534.409 }, { "speaker": 3, "text": "to predict driving the CEOs.", "start": 534.409, "end": 542.39 }, { "speaker": 2, "text": "I see. I see. How does it work?", "start": 542.77, "end": 545.19 }, { "speaker": 3, "text": "You can we we can go add some, like, additional, like, bouncing up and down to make it feel like walking.", "start": 546.29004, "end": 550.85004 }, { "speaker": 2, "text": "Slower.", "start": 550.93, "end": 551.43 }, { "speaker": 3, "text": "Yeah.", "start": 551.57, "end": 552.07 }, { "speaker": 0, "text": "But it's only in the street. You're gonna the person's gonna be walking on", "start": 554.945, "end": 558.86505 }, { "speaker": 3, "text": "the street when they yeah. Yeah. Walking on the sidewalk. Shit. Uh-huh. Yeah. The personal Can can we, like, can we, like, fine tune it to predict videos that were from the side?", "start": 558.86505, "end": 570.75214 }, { "speaker": 5, "text": "I won't look into this, but this this seems promising. Oh, there's a a friend of my house, a soft", "start": 570.75214, "end": 574.01 }, { "speaker": 2, "text": "driving car company, have a little model called VideoGen 2 that is focused on driving self driving data.", "start": 580.965, "end": 587.705 }, { "speaker": 3, "text": "I'm I'm", "start": 588.085, "end": 588.48505 }, { "speaker": 2, "text": "So the data to Toyota.", "start": 588.565, "end": 590.025 }, { "speaker": 3, "text": "It really stresses me out to, like, think about cars which are trained on the outputs of models which are trained in the outputs of cars seems like a a step in the wrong direction. Yeah. The data data is valuable. Mhmm. You need you need to hire men", "start": 590.16504, "end": 599.68 }, { "speaker": 0, "text": "wage people to go capture data so you could", "start": 609.655, "end": 613.575 }, { "speaker": 3, "text": "resell it. Oh, it's good. Cool. I see. Yeah. Like,", "start": 613.575, "end": 613.81946 }, { "speaker": 1, "text": "that's a long fire plan, by", "start": 613.81946, "end": 613.9528 }, { "speaker": 0, "text": "the way.", "start": 613.9528, "end": 614.47504 }, { "speaker": 2, "text": "Like, we're we're trying to use crypto to have people doing that upcoming virus too. Yeah. Like, Africa, even even even Southeast Asian is too expensive.", "start": 614.615, "end": 625.76 }, { "speaker": 0, "text": "K. So I'm gonna continue to work on capture the user's likeness to be able to put", "start": 637.375, "end": 640.5751 }, { "speaker": 1, "text": "it into a new environment. Gotcha. Gotcha. Cool. And you found him he could", "start": 640.5751, "end": 641.635 }, { "speaker": 2, "text": "into the factory mix and the street view. If you have any ideas, like, we can talk about it. Otherwise, if they will research into those 2", "start": 650.79, "end": 660.14996 }, { "speaker": 3, "text": "as well. I I I will add to my bucket getting this driving model running because it seems fairly forward and seems like a a lot potentially a lot of reward for a", "start": 660.14996, "end": 669.295 }, { "speaker": 1, "text": "stall amount of work.", "start": 669.295, "end": 669.975 }, { "speaker": 2, "text": "Change the camera angle.", "start": 670.095, "end": 671.395 }, { "speaker": 3, "text": "We can probably fine tune the model to be, like, off to one side because a lot of the physics are a lot of the, like, intern consistency are the same.", "start": 671.615, "end": 679.20996 }, { "speaker": 2, "text": "Exstepping Canvas. Those are so text to video and image to video, we're just using", "start": 679.82996, "end": 685.11 }, { "speaker": 0, "text": "the existing video model. I don't see the value of extend Canvas because everybody else can do that.", "start": 685.11, "end": 691.365 }, { "speaker": 2, "text": "There's no value. They're just like, to bring to the average. Even if I'm doing it at a place, they may need it. It's just like, it is an unknown feature. Yeah. It's not a value feature. It's like, every every official app can make the the girls eye bigger, and everyone I was I wish I can make your face wider so I could so you have to have that because everyone else have it. And but, like, no one else can add a hairband, the only one guy can add hairband, but it can't make the face wider. Like, people are still going to, like, not use it because, like, face is dark. It is, like, each amount of those features. Yeah. It's not too much of value. We just want to find a working method, like, like, just, like, tie it together. Yeah. Okay.", "start": 691.605, "end": 736.7 }, { "speaker": 0, "text": "Yeah. I think you need if you can get 3 d data instead of generating a video using car driving videos. You could just get the 3 d data of Citi.", "start": 736.7, "end": 750.565 }, { "speaker": 3, "text": "Yeah. I think we I I think I think we can do it for, store sharing. Oh, I mean, I didn't visit anything in Paris.", "start": 751.825, "end": 760.085 }, { "speaker": 0, "text": "Any, like, openings?", "start": 761.22, "end": 762.27997 }, { "speaker": 3, "text": "Yeah. Yeah. Download Google Earth", "start": 762.66, "end": 764.81995 }, { "speaker": 1, "text": "model of Citi.", "start": 764.81995, "end": 765.22797 }, { "speaker": 3, "text": "Downloading. Or use the latest Microsoft light simulator? Could could I don't think the the street level view. It sounds. Maybe if they have a Microsoft driving simulator. Okay. It's this is unpleasant, but potentially possible to download cities. I really don't like looking at people's YouTube videos because then the man gets views, but this really shouldn't be a tutorial. Maps models and order.", "start": 765.22797, "end": 803.76373 }, { "speaker": 1, "text": "I see. Interesting. So how do I get the model?", "start": 803.76373, "end": 807.035 }, { "speaker": 5, "text": "Chrome shortcut under dot. Roid tools. Okay.", "start": 816.83997, "end": 820.312 }, { "speaker": 3, "text": "So you'd like to use Chrome data.", "start": 820.312, "end": 820.888 }, { "speaker": 1, "text": "Google Maps. Okay. Check it. So zoom out. So okay. The key is this mysterious bootleg GitLab script. Maps models", "start": 820.888, "end": 834.53503 }, { "speaker": 5, "text": "order. Oh my god. Where's my potato thing?", "start": 846.27997, "end": 847.52 }, { "speaker": 1, "text": "Wow. Holy crap. Look at how this thing works.", "start": 847.52, "end": 848.63 }, { "speaker": 3, "text": "You reading this, Justin? It scrapes your GPU.", "start": 861.195, "end": 864.575 }, { "speaker": 2, "text": "My potato thing.", "start": 865.03503, "end": 866.97504 }, { "speaker": 3, "text": "So it's behind your laptop. Okay. Wow. It scrapes the triangles out of your GPU. This bullshit This is not okay. This is not okay. What are you? It scrapes the g it doesn't download the file. It scrapes the mesh out of your GPU.", "start": 867.275, "end": 884.69 }, { "speaker": 0, "text": "Cool. What lengths will you go to to get your data?", "start": 888.84503, "end": 891.825 }, { "speaker": 3, "text": "Data's worth money, man. Yeah. Alright. Investigate this. I usually was not recording. Otherwise, it would action item. I mean, we should add this to the action items. Or or just", "start": 891.885, "end": 905.11664 }, { "speaker": 1, "text": "send somebody to each city with a drone", "start": 905.11664, "end": 905.22327 }, { "speaker": 5, "text": "and capture it. Yeah. How long would it how long", "start": 905.22327, "end": 906.14996 }, { "speaker": 3, "text": "to take to traverse all of the streets in a city in both directions using a drone. I guess you can have, like, multiple cameras. If the drone is light and flies very close to the ground, no one will be able to stop you. Okay. This is cool. This is", "start": 915.015, "end": 931.79004 }, { "speaker": 1, "text": "not unreasonable.", "start": 935.29004, "end": 935.87 }, { "speaker": 3, "text": "What is render doc? I'm just, like, overwhelmed by the bullshitness of this tool.", "start": 936.89, "end": 942.28503 }, { "speaker": 0, "text": "And you just go on some, like, international fiber type site where you contract a person. So go to, like, Eastern Europe. And hire a random guy to just walk around with his phone and", "start": 943.385, "end": 957.08997 }, { "speaker": 3, "text": "In Houston? Yeah. Yeah. If you if all you wanna do is a bit of generate videos of Kazakhstan, this is a great idea. This is probably the the one thing where, like, crowd source data doesn't work because your crowds all gonna live in Indonesia and Africa. So you're gonna get, like, the capital of Ghana, small villages, and Jakarta. Like, you'll have the world's best three d model of Jakarta. Like, every person will be modeled.", "start": 958.05, "end": 984.44 }, { "speaker": 1, "text": "Yeah. Yeah. So if you want if you have a CAPTURE pipeline", "start": 985.06, "end": 986.19904 }, { "speaker": 0, "text": "figured out, then just Yeah. Either fly, go there go to these cities yourself or hire somebody cheap to generate videos for you? This thing is beautiful. We should definitely go try it. Here, we can go. Let's let's let's take a look.", "start": 986.19904, "end": 1004.955 } ] }