via https://ift.tt/YWOPVHN
gothhabiba https://gothhabiba.tumblr.com/post/719672953828147200/id-first-image-is-an-anonymous-ask-reading-oh-i :
tim-official https://tim-official.tumblr.com/post/719653063051378688/yeah-this-is-a-good-point-ai-literally-doesnt :
resumbrarum https://resumbrarum.tumblr.com/post/719652453546672128/honestly-i-think-part-of-the-problem-is-that :
tim-official https://tim-official.tumblr.com/post/719620304406822912/this-post-had-over-10k-notes-and-lots-of-people-in :
tim-official https://tim-official.tumblr.com/post/719399441820434432/all-the-frothing-at-the-mouth-posts-about-how :
all the frothing-at-the-mouth posts about how “don’t you dare put a fic writer’s work into chatGPT or an artist’s work into stable diffusion” are. frustrating
that isn’t how big models are made. it takes an absurd amount of compute power and coordination between many GPUs to re-train a model with billions of parameters. they are not dynamically crunching up anything you put into a web interface.
chances are, if you have something published on a fanfic site, or your art is on deviantart or any publicly available repository, it’s already in the enormous datasets that they are using to train. and if it isn’t in now, it will be in future: the increases in performance from GPT 2 to 3 to 4 were not gained through novel machine-learning architectures or anything but by ramping up the amount of data they used to train by orders of magnitude. if it can be scraped, just assume it will be. you can prevent your stuff from being used with Glaze, if you’re an artist https://glaze.cs.uchicago.edu/, but for the written word there’s nothing you can do.
not to be cynical but the genie is already far more out of the bottle than most anti-AI people realize, i think. there is nothing you can do to stop these models from being made and getting more powerful. only the organizing power of labor has a shot at mitigating some of the effects we’re all worried about
this post had over 10k notes and lots of people in replies getting very angry and panicky and threatening imaginary bad actors and begging people not to put their fics into chatgpt. the reply is authoritatively saying “anything that is given to AI it can use it later to draw from.” no source! like - i don’t know if they save your prompts. they probably do for some other nefarious purposes. but:
these are the size of the training sets used to train gpt-3. as a rule of thumb in natural language processing, one word is on average two tokens. the common crawl dataset alone is around 205 *billion *words; for gpt-3 they don’t even manage to use all of it. this is the scale of the data they need. they are not re-training their model with the little prompts you put in, and even if they did, it’s like… a drop of water in the ocean. it’s not gonna have an effect on how the model behaves.
i think people are, on a gut level, still understanding these models as “collage machines.” they’re not. they are not borg-assimilating all your best ideas from your fics to frankenstein them back together. they are statistical models. they are compressing gargantuan amounts of data down into smaller (still huge, but much smaller) models of that data by looking at trends and likelihoods and repetitions. i’m not saying you’re a great person if you use gpt to autocomplete old fics but even if they were for some reason adding your prompts to their datasets, it’s not gonna have an effect.
the culture on here about anti-ai stuff has approached, like, mythology - making up shit about what they can do, talking about how scary they are, ghost stories, moral panic. this wild overstatement about what they can do only benefits the companies selling them, and those trying to use them as pretense to undermine labor.
Honestly, I think part of the problem is that we’ve allowed the companies shilling these models to call them “AI” with relatively little pushback. Remember when one- and two-wheeled personal conveyors - like a Segway without the handles - were rebranded as “hoverboards” in 2015 as a Back to the Future reference? It’s the same thing. And the problem here is that just like hoverboards don’t hover, “AI” isn’t intelligent. They’re just statistical learning models with sophisticated outputs.
But allowing the companies to own the branding on them, and allowing that branding to be “AI”, invokes all the science fiction we’ve ever read. If you’ve been on TV Tropes for ten seconds you’ve seen the “AI Is a Crapshoot https://tvtropes.org/pmwiki/pmwiki.php/Main/AIIsACrapshoot” page, and that’s kind of how society is treating these tools, when, honestly, they’re just web scrapers - fundamentally the same web scrapers people have been using for decades - and statistical models.
yeah, this is a good point. “AI” literally doesn’t mean anything.
It has referred to a range of different technologies since the 50s, some of them including no machine learning at all. I forget who coined it, but there’s a lovely quote about how AI is just “whatever problem computers can’t totally solve yet:” as soon as it’s considered acceptably solved, the moniker moves on to the next big thing. (example: voice recognition systems, like the ones you talk to when calling tech support. didn’t use to be a thing! used to be fancy and unreliable! now totally invisible, taken for granted)
[ID: First image is an anonymous ask reading “oh I wasn’t aware it was feeding the AI. I’ve inserted hundreds of fics into ChatGPT for their continuation or for a different plot within the same context just for fun and out of curiosity… but I’ve never posted any of them….” the response is “Indeed, anything that is given to AI it can use later to draw from. That’s why it doesn’t matter if you post them or not as it now has access to those writers’ texts without their permission.” Second image shows a table charting five “Datasets” against their “Quantity (tokens),” “Weight in training mix,” and “Epochs elapsed when training for 300B tokens.” The table is labelled “Table 2.2: Datasets used to train GPT-3.” The largest dataset is “Common Crawl (filtered),” with 410 billion tokens, a weight of 60%, and .44 epochs elapsed. The other datasets are WebText2 (19 billion tokens), Books1 (12 billion), Books 2 (55 billion,) and Wikipedia (3 billion). The table’s caption reads: “Weight in training mix refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.” end ID]
all this said, it’s still fucking gross to contemplate someone reading your 300k-word fanfic and deciding they want a sequel so they put it into chatGPT and somehow think whatever they’re getting out of that is going to be in any way related to what you had intended to say. (Your picture was not posted)