Movies are a great tool to learn English. If you type ‘best movies to learn English’ on Google, countless recommendations will show up. Most of those recommendations come from personal choices, which is good; however, as a data science enthusiast, I turn to data to answer questions. This time is no exception, so I’ll analyze 3000 movies to answer the question, ‘What are the best movies to learn English?’

To answer this question, we may need the help of a huge movie fan who watches tons of movies regularly. Such people may be hard to find; however, movies’ dialogues transcribed by movie fans are available online. Once I collected the movies’ transcripts, I could read them one by one to find the best option among them, but it could take me months to read 3000 transcripts. Fortunately, there are tools like CountVectorizer that can tokenize each transcript for us. Tokenization breaks the raw text into words called tokens. For example, the famous Arnold Schwarzenegger’s line “Hasta la vista, baby’ in the movie Terminator can be split into four tokens: ‘hasta’, ‘la’, ‘vista’, ‘baby’. In this way, it’s possible to build a table that tells us the vocabulary used in each movie. This is easily explained in the following picture.

Each cell represents the number of times a word was mentioned in a movie. In the table above, the term ‘able’ was mentioned once in Movie 1 and 3, but it wasn’t mentioned in Movie 2, then the value assigned is 0. That’s how we find all the words said in each movie, but to determine the best movies to learn English, we also need to know the vocabulary level used in movies’ dialogues.

Vocabulary is one of the most critical factors in achieving enough comprehension of movies, so this analysis aims to determine the movies covering most of the vocabulary at the English levels A1-A2 (A) and B1-B2 (B), according to Oxford’s word lists. For the analysis, I used the Oxford’s 5000 list, which provides words classified by levels from A1 to C1 that every English learner should know.

Movies that often use A-level vocabulary in their dialogues would be easy to understand for people with at least B1 level, while movies with the highest A+B vocabulary coverage would be perfect for learners at C1 level. That being said, let’s find the top 50 movies for learners at B1 and C1 levels!

First, let’s dive into the criteria I’m using to compare the 3000 movies.

Which movie is easier to understand ‘Titanic’ or ‘Toy Story’?

You might’ve heard that children’s movies like Toy Story are easier to understand than movies from other genres. We’ll verify if this is true by comparing the vocabulary covered in the movies Toy Story, Batman Begins, Harry Potter and the Goblet of Fire, Titanic and Stuart Little. The heatmap below shows the vocabulary coverage for the five movies.

In the first column, we see that the box for the movie Toy Story has the lightest color indicating that it has the lowest A-level vocabulary coverage (87.4%) among the five movies. Although Toy Story is meant for children, the character ‘Buzz Lightyear’ throughout the movie tends to use particular phrases like this one:

I’m stationed up in the Gamma Quadrant of Sector Four. As a member of the elite Universe Protection Unit of the Space Ranger Corps, I protect the galaxy
from the threat of invasion from the evil Emperor Zurg, sworn enemy of the Galactic Alliance.

So no doubt why Toy Story has the smallest A-level vocabulary coverage. The results also suggest that the movies with the highest coverage at the A and A+B levels are Titanic (89.8%) and Stuart Little (96.1%), respectively. That is, it’ll be easier for a learner with B1 level to understand the movie Titanic compared to the other films listed, while Stuart Little would be the best movie to watch for people at C1 level. Also, note that when we count the C1 vocabulary in the third column, the coverage slightly varies, since the C1 vocabulary level is barely mentioned in movies.

If you checked the heatmap carefully, you might’ve noticed that coverage values in the A or A+B levels among movies aren’t so far from one another (2.4% gap in the A column and 4.7% in the A+B column). However, the effect a small 1% vocabulary coverage has on English comprehension might surprise you.

Why is 1% coverage so relevant?

According to the National Center for Voice and Speech, the average conversation rate for English speakers in the United States is about 150 words per minute. One percent coverage lower represents 1 unknown word every 100 words. That is, the 2.4% gap between Toy Story and Titanic at A-level vocabulary coverage indicates that Toy Story’s dialogues have around 3 more unknown words per minute than Titanic. The percentage may seem little, but this could determine whether you would have fun watching movies or a hard time trying to figure out what a character is saying.

Now it’s time to show you what movies cover the highest percentage of the A and B levels!

The best movies for people at B1 level

By using Python, it was possible to obtain the vocabulary coverage for all the movies collected as we did before for the five movies. The distribution of the A-level vocabulary coverage for the 3000 movies is presented in the following histogram.

According to the above histogram, many movies covered around 90% A-level vocabulary, so people at B1 level would recognize 90% of movies’ dialogues. However, as we said before, every percentage in coverage matters, then it’ll be much better to recommend movies with the highest coverage (represented by the shortest bar on the right). Note that the smallest bar contains about 200 movies with at least 93% coverage, but not all of those movies would be the best option for you (say, you just don’t like the movie’s genre or the movie isn’t attractive enough for you). For that reason, I had to make a trade-off between vocabulary coverage and movies’ popularity (following the IMDb ratings) to come up with the top 50 movies. The following are the 10 most popular movies with at least 93% A-level vocabulary coverage.

Are any of these movies good enough for you to watch?

The rest of the top 50 movies are listed at the end of the post, in case you want to check if your favorite movie is in it.

The best movies for people at C1 level

The second histogram shows that more than 1000 movies cover around 94% A+B-level vocabulary. However, the most attractive group of movies to watch at C1 level offer at least 96% coverage. The following are the 10 most popular movies with at least 96% A+B-level vocabulary coverage.

As you might expect, many movies already recommended for B1 level would also be recommended for C1 level, but I didn’t include them to avoid repetition in the top 50. If your level is C1 feel free to watch any of the movies previously recommended for B1 level.

Final note

Finally, a few observations on how I worked with the data:

If you’d like to see the code behind this analysis, you can find it here. Keep in mind that vocabulary is not the only factor in obtaining a good understanding of movies. Pronunciation, accent and speech pace might increase or decrease your level of comprehension as well.

Top 50 movies for learners at B1-B2 level (at least 93% A-level vocabulary coverage)

Memento (2000), Goodfellas (1990), Joker (2019), Scarface (1983), Prisoners (2013), Drive (2011), Taken (2008), Her (2013), The Notebook (2004), The Bourne Identity (2002), La La Land (2016), 500 Days Of Summer (2009), The Pursuit Of Happyness (2006), War Of The Worlds (2005), Saw (2004), Room (2015), A Star Is Born (2018), The Ring (2002), Raging Bull (1980), The Deer Hunter (1978), The Book Of Eli (2010), Jumper (2008), Insidious (2010), Bird Box (2018), Before Sunrise (1995), Manchester By The Sea (2016), Dog Day Afternoon (1975), Dawn Of The Dead (2004), The Descendants (2011), Nocturnal Animals (2016), Before Sunset (2004), I Am Number Four (2011), A History Of Violence (2005), Paranormal Activity (2007), No Strings Attached (2011), Big (1988), It Chapter Two (2019), The Impossible (2012), Revolutionary Road (2008), The Babadook (2014), The Darjeeling Limited (2007), About A Boy (2002), Desperado (1995), The Girl On The Train (2016), The Drop (2014), The Lake House (2006), Lars And The Real Girl (2007), Dear John (2010), The Gift (2015), Never Let Me Go (2010)

Top 50 movies for learners at C1 level (at least 96% A+B-level vocabulary coverage)

In Time (2011), Red (2010), 10 Cloverfield Lane (2016), The Adjustment Bureau (2011), Vicky Cristina Barcelona (2008), Saw II (2005), Marriage Story (2019), Warm Bodies (2013), The Vow (2012), The Next Three Days (2010), Side Effects (2013), Melancholia (2011), Rec (2007), First Man (2018), Road Trip (2000), Dangal (2016), Searching (2018), The Score (2001), Sex And The City (2008), The Giver (2014), Triple Frontier (2019), Mirrors (2008), The Commuter (2018), The 5Th Wave (2016), Primer (2004), Tootsie (1982), Collateral Beauty (2016), Monsters (2010), Elektra (2005), Unthinkable (2010), Haywire (2011), Solaris (2002), Gemini Man (2019), Abduction (2011), The Fourth Kind (2009), I Am Mother (2019), Quarantine (2008), Searching For Sugar Man (2012), Trouble With The Curve (2012), Cocoon (1985), Crimes And Misdemeanors (1989), Original Sin (2001), Nine Queens (2000), Labor Day (2013), Morgan (2016), Regression (2015), Sabrina (1995), Dressed To Kill (1980), The Shack (2017), Boy Erased (2018)’

