Where Not to Eat in New York City

Investigate health inspection results for New York City’s 27,000 restaurants using Python

Patrick Kalkman

Jul 28, 2020 — 5 min read

Ruebens Style Pastrami, image by the author

A week ago, I saw a tweet from Ashley Willis (@ashleymcnamara) in which she asked, “What’s the best tech talk you’ve ever seen?”. The response was overwhelming. Make sure to check out the answers for some of the most excellent tech talks.

Since then, I have been watching some of the recommended talks. The first I saw was “Discovering Python” by David Beazley. An exceptional talk in which he uses Python to support a Patent litigation lawsuit.

As I searched for more talks by David Beazley, I found “Builtin Superheroes.” In this presentation, he used standard Python data types to analyze food inspections. An excellent and fun performance to watch!

It inspired me to write this article. In this article, I use the standard Python data structures to investigate the health inspection results for New York City’s 27,000 restaurants.

Reading the data and initial inspection.

The health inspection results for New York City’s 27,000 restaurants are publicly available on the Open Data Site of New York City. You can download the CSV from the site here. Press the export button to download the data as a CSV file.

I use the Python v3.8 command-line to analyze the data. You can follow along as I analyze the data.

I use the DictReader function from the CSV module to read the entire file into a dictionary. I create a standard list from the result of the DictReader. This results in a list in which each item is a dictionary with the data of a single inspection.

As you can see, the CSV with the inspection results contains 400305 records. Your number may be different as the data set is updated frequently. Each record contains the details about a single inspection. We print the contents of the first record to show what is in a single record.

There are several exciting things to note. For example, the VIOLATION CODE. This is a code that indicates the type of violation. The field DESCRIPTION describes the result of the inspection.

Another interesting field CRITICAL FLAG indicates if the result of the inspection was critical. Now I am not sure what critical means but I can imagine.

It would be interesting to see how many different violation codes there are present in this data set. We can filter the violation code using the following Set Comprehension.

The Set Comprehension selects VIOLATION CODE from each record and creates a set from the result. As a set can’t contain duplicates, it automatically removes all duplicates. As you can see, there are 105 different violation codes in the data set.

Now, how many restaurants got a positive review? We can count them by selecting the records that don’t have a violation code.

Five thousand eight hundred fifty-five inspections got a positive review.

Selecting the worst restaurants

Now, it would be nice if we could select and show the restaurants that have the highest number of inspections with a negative result.

Let us first see how many critical inspections there are in this data set using the following statement.

This List Comprehension creates a list of all the inspections that contain a violation code and are critical. We use the len function to show the number of critical reviews. There are 218375 critical inspections. This is almost 55%!

Count the number of negative inspections per restaurant

The field DBA contains the name of the company. It means Doing Business As. We use the Counter function this to count the number of critical inspections per company using the following statement.

The previous code prints the following top 10 results. I reformatted it a bit.

DUNKIN’ 1995
SUBWAY 1442
MCDONALD’S 915
STARBUCKS 749
KENNEDY FRIED CHICKEN 696
CROWN FRIED CHICKEN 597
DUNKIN’, BASKIN ROBBINS 567
BURGER KING 466
POPEYES 450
GOLDEN KRUST CARIBBEAN BAKERY & GRILL 375

There are a couple of things that stand out to me. First, these are well-known restaurants. Secondly, there is still a bit of clean up to do. As you can see, there are different names used for Dunkin’ Donuts.

Also, the result is opinionated as there are more restaurants of a certain brand, the number of inspections will be higher as well as the risk of a negative inspection.

To zoom in and see which restaurant has the most negative inspections we can use CAMIS. This number, according to the data dictionary, is a ten-digit unique identifier for the restaurant. If we execute the selection again utilizing the CAMIS field, we get the following result.

Now, these are the restaurants in New York City that had the most negative inspections. Let’s see if we can add the names of the restaurants to the result.

We add the names by iterating the top 10 restaurant CAMIS numbers. We select the DBA field by searching using the CAMIS number through the critical list.

These are the names of the restaurants that have the most negative inspection results. One thing that we did not take into account yet is the date of the inspections.

If for example, a restaurant improved significantly in the last year, it would not show up in the previous selection. Let’s try to fix that with the next selection.

The worst restaurants per year

First, let see which inspections years are in the data set using a Python Set Comprehension. I use sorted to sort the output.

So the data set contains the inspection results from 2013 until 2020. I don’t expect that there were inspections in the year 1900. The data dictionary clears this as it states, “Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection.”

Now, what if we wanted to see the top 10 restaurants that had the most critical inspections in 2019.

And to get the name of the restaurant, we can use the following.

Now, if I ever wanted to eat at a restaurant in New York City, I would pass on these restaurants.

The worst category restaurants

The last interesting detail I want to extract from this data is which category of restaurants have the most negative inspections.

There is a field in the data called CUISINE DESCRIPTION that describes the type of cuisine of the restaurant. So, let’s first see how many different types of cuisine there are in the dataset. We do this by using a Set Comprehension.

The dataset contains 84 different types of restaurants. We can count the number of negative inspections using Counter.

Here we see that the most negative inspections are in American cuisine.

Now, there is a significant bias in these numbers as we don’t take into account the number of restaurants. So let’s count the number of restaurants per cuisine.

If we combine the worst_cuisine and cuisine_count we should be able to calculate a ratio that makes more sense than the cuisine_count alone.

I calculate the ratio by dividing the number of critical inspections per cuisine by the number of restaurants of that cuisine.

Conclusion

With this article, I showed that it is possible to extract useful information from any data set with the standard Python data structures. I used dictionaries, lists, Set, and List Comprehensions to extract the worst restaurants in New York City.

If you have some time on your hand and want to see what people consider the best tech talks, check out the following list that David Heath created from the replies on the tweet of Ashley Willis.