the learning machine (archived post)

Multiple times in my series on digital citizenship, I’ve referred to “Machine Learning models.” I want to unpack what they are and why they affect the ways we behave online.

I’ve written about this topic before, but I’d like to start by summarizing what Machine Learning is anyway, perhaps from a different perspective than that article did.

What is Machine Learning?

Even before computers were powerful enough, we dreamt of programs that could make intelligent decisions. One common term for this is “Artificial Intelligence” (though I prefer “Synthetic Intelligence,” which avoids the incorrect views people often have of AI).

To allow machines to successfully mimic intelligence, we need to teach them to learn. Computer scientists have built systems called “neural networks” to mimic how people make connections between concepts, which has been a significant step toward computers that can learn. The finished program is known as a “Machine Learning model,” which can be reused in future scenarios to solve a common problem.

To create a machine learning model, we basically give a neural network a question and the answer, asking it to work out the connection between the two. For example, we might plug in health data, associating each file with whether or not the person involved developed cancer. The computer could put together these ideas and find relationships that help us detect cancer sooner.

Uses of Machine Learning

While there are amazing uses of Machine Learning, such as detecting when patients are at risk of cancer, there are also far less altruistic uses. In fact, Machine Learning models provide the primary motives websites have for collecting all your data.

Google

Take YouTube (owned by Google), for example. YouTube collects detailed data about whether or not people play the next videos recommended to them. YouTube can then train a Machine Learning model on that data, instructing it to learn which videos it should recommend to get you to watch the next video. More specifically, YouTube wants you to watch as many advertisements as possible. Most social media websites use Machine Learning for similar purposes.

Gmail has recently begun using Machine Learning for a novel purpose: suggesting replies to emails. I’ve found it disturbingly accurate at times, matching my voice extremely well.

Here’s my problem with that: it means Google is letting a script read and process all of my email, trying to learn how to mimic me. I find that amount of power disturbing. I may have been fine with Google having my emails, but I was not expecting them to use them without asking.

GitHub/Microsoft

Disturbing uses of Machine Learning abound; GitHub (owned by Microsoft) recently launched Copilot, which is able to code for you, given instructions.

Now, GitHub may hold more code than any other site; most open source projects are hosted on it, and many large organizations use it for collaboration. To get data to train Copilot, they fed all public code on the site into a Neural Network.

This is primarily concerning to me for copyright reasons; even though most of this code was open source, many open source licenses don’t simply allow use for any purpose. For example, my preferred license for large projects, the GNU Public License or GPL, requires anything made using my code to also be open source, under the same license. Linux, one of the most successful Open Source projects ever, is able to thrive in part due to its GPL-2.0 license, requiring any operating systems that use Linux—such as Ubuntu, Fedora, and Android—to be open source as well.

Copilot, on the other hand, uses all of the code from public projects without paying attention to the license. While the output of Machine Learning models may not technically be able to violate copyright in most nations, it certainly violates the spirit of Open Source. Code from Verbose Guacamole, my Free and Open Source novel editor, might show up in some nonfree software because Copilot made it available.

There are other cans of worms related to Copilot, including that it seems to copy code verbatim, not just plug it into ML models, but even using the code in a model disappoints me, and has been making me seriously consider moving all my own code elsewhere.

Data

The crucial piece of a machine learning model is data: the more data you have, the more accurate the results will be. On the other hand, processing more data takes the neural network more time and, by extension, more money. This limits the most effective business solutions to big companies:

The more customers a company has, the more data they can collect.
The more funding a company has, the more data they can afford to process.

Machine Learning has been a viable tool in businesses’ toolboxes for years now, and they’ve honed some strategies to get hold of more data:

Give a service away! — If a company allows people to use their product for free with an account, they attract more users. The data from all these users can then train machine learning models, helping them design their products in a way that traps users’ attention, increasing the number of ads seen, driving up revenue. A common saying in our digital age is this: “If you aren’t paying for a service, you are the product.” You’re paying with your time and attention, and eventually by clicking on ads.
Use third-party tools — Often, businesses will use tools such as Google Analytics to help them track their users more efficiently. The massive amount of data at Google’s disposal means that they can help websites find out more about their customers. On the other hand, this strategy means that Google (or whatever company is behind the analytics) is also receiving your data.

On a positive note, I read that Google Analytics was just ruled illegal in the European Union—the way it processes data violates the GDPR. I’m hoping this will cause change in both directions: Google will mend some of its policies, and fewer websites will use its services.

Metadata — Even services that “encrypt” your messages can still collect data about you. WhatsApp, a chat program owned by the-company-I-refuse-to-refer-to-as-Meta, encrypts messages but still takes your metadata (perhaps the name “Meta” is apt after all—Meta takes your metadata). This means Facebook stores and uses all the data about who you’re chatting with and when, your phone number, and more, even though they can’t access the actual content of your messages. Under the false guise of a private platform, WhatsApp gains users and data for Facebook to sell to companies and hone their ads.

Not all data used in Machine Learning models is user data, but models using user data are often the most harmful. Google takes data from reCAPTCHA, for example, and uses it to train their self-driving cars. That’s not directly harmful to users, though it does incentivise Google to push a false view of reCAPTCHA’s actual usefulness. For users, the more harmful access is that reCAPTCHA allows Google to know all the sites you’re visiting, tracking what you’re interested in so they can train their Machine Learning models to give you better ads and waste your money.

The Good

Despite all this, Machine Learning is a very good thing. I don’t want to leave you with a wrong impression, so I’ll finish with some good things Machine Learning has helped us achieve.

Self-driving cars — While self-driving cars are not yet used in any significant capacity, the future is bright. I’m enthusiastic that self-driving technology will help change the world for the better over the next few decades. (I might try writing an article about self-driving technology and some misconceptions about it.)
Autocomplete and autocorrect — Typing on my phone would be an awful experience without the wonders of autocorrect, despite some embarrassing mistakes it makes with certain tpyos.
Dictation software — It’s an amazing thing that my computer can now listen to me talking and understand what I’m saying. Sure, there are often abuses of privacy related to this, but overall it’s an excellent innovation when used properly.
Generative ML — I’m always intrigued when Machine Learning models learn to create something new. Sometimes it’s a story, sometimes a song, sometimes art, and sometimes something I never imagined was possible. Again, there are some concerning implications, but I’m glad to see the direction this kind of software is taking. I’ve done some experimenting myself in this area, and am currently preparing for a large hobby project in it.
Grammar correction — Grammarly is an amazing platform, and it uses Machine Learning. I hope that one day there will be a similar open source alternative, but until then I am extremely impressed and use it on my blog posts.

I could go on, but I’ll cut the list off there for now. I really don’t want to spread FUD about Machine Learning; it really is an amazing system, and I’m immensely excited to see where it takes us in the coming decades, despite its negative privacy consequences.

Conclusions

Machine Learning is an amazing field of software that allows computers to learn in ways that mimic our own amazing brains. The positive uses of it are overwhelming, but as conscious digital citizens we need to be aware of the ways companies use it against us:

Machine Learning incentivizes companies to take more of your private information so they can build more accurate predictions of your behavior.
Companies use these predictions to discover new ways to waste your time and take your money.
Machine Learning is most effectively used in the hands of big companies, making it hard for small business—especially ones respecting the rights of their users—to compete.
Machine Learning raises serious questions about copyright that currently go unanswered. I’m planning to talk about my views on copyright law soon—in short, I think it needs massive reform—but I still believe in protecting the rights that the government gives to authors, even if it gives too much.

Remember those issues with Machine Learning—it’s important to know what companies are doing wrong so that you know what to avoid—but never forget that there are plenty of companies doing very amazing things with Machine Learning. It’s our responsibility to make sure companies do the right thing and use Machine Learning responsibly.

The future is bright, but the box says “Some Assembly Required.”

Application

First off, I recommend you do some of your own research on Machine Learning. I’m not an expert, and it’s perfectly possible I’ve made some incorrect inferences. Look at the information available and decide for yourself what you think about Machine Learning.

If you want to have your own chance at making a Machine Learning model without any required coding knowledge, Google made an amazing tool named Teachable Machine that’s worth checking out.

And yes, I am recommending a Google product. Not everything they make is bad.

Perhaps that’s the moral of this article: not everything is bad.