/blog

🍪 Get structured data from recipe websites

I created an open-source engine for extracting and tagging ingredients from any culinary recipe found on the web.

Extract and tag ingredients from a website

Extract ingredients from other recipe sites right in the browser!

Or input your own URL from any recipe website:

JSON result:


From the command line:

$ curl https://ingredients.schollz.now.sh/?url=https://cooking.nytimes.com/recipes/12320-apple-pie

Issue?

Kindly provide me a report at Github.

This blog post is related to a new Go library I wrote to extract and tag ingredients from recipes: ingredients.

Recipes are curiosity to me. Recipes are essentially programs - a set of ingredients to be combined in a specific way using a series of instructions. Recipes have been around a lot longer than programs, though. Some have existed for thousands of years. The oldest recipes still exist today which little modification but newer recipes are constantly evolving. In modern times, a single recipe may take on many forms as technologies progress, ingredients change, and cultural diets (paleo / dairy-free) flucuate.

Culinary recipes are everywhere on the Internet. I want to harness these to understand more about recipes. I want a tool that could utilize the vast trove of all recipes. What if you could instantly search and compare thousands of chocolate chip cookie recipes?

If there was a database of all known recipes you could easily search, index, you could compare ingredients, you could compare ratios of ingredients, you could look at variations of ingredients. I’ve been working on making such a tool - a sort of “Google” for recipes - a huge indexed dataset of all recipes and their ingredients. This tool requires two components: firstly, I need a way to extract an ingredient list from any website and secondly, I need a way to tag the ingredients, measurements, and amounts from each line in the list.

Ingredient tagging

Ingredient tagging is where you take a line of ingredients and determine the ingredient name, measure, and amount. Here is an example of an ingredient line:

4 tablespoons melted nonhydrogenated margarine, melted coconut oil or canola oil

Tagging this ingredient line should yield the amount, measure, and name for the ingredient, for example:

{
	"amount": 4,
	"measure": "tablespoons",
	"name": "margarine",
}

This way, the ingredient can be used later on to compare to other ingredients based on it’s name and measure/amount.

The prior art of ingredient tagging comes from the New York Times (NYT). The NYT wanted to extract ingredient data for use with their recipe website. Their approach was to use Natural Language Processing, and specifically linear-chain conditional random fields. The NYT succeeded in creating a tool that was able to leverage structural prediction for ingredient tagging.

I spent a long time trying to improve on the NYT’s ingredient tagger. I felt like the NLP approach was too general, especially when applied to something like ingredient lists. Its like taking a bazooka to kill a fly. I thought that there could be a simpler solution.

There are only finite number of ingredients (on the other of thousands) and ingredient lists are highly structured, 95% of the time. The structure highly pervasive, for example, an ingredient list will contain several or more ingredient lines, where each ingredient line will include an ingredient and usually a measure and an amount (of which usually come in specified order: amount, measure, ingredient). Using this insight I made a content-based extractor and parser, first in Python and then later improved and written in Go and now improved further with ingredients.

Tagging of ingredients is done using the simplest possible way: greedy search and contextual finding. Greedy search means it will take the first possible choice, and contextual finding means that it will only start searching from the last position (in the context of each piece). For example, consider 1 1/2 cup (12 oz) mini chocolate chips . To tag this I will first find the “amount”, then the “measure”, and finally the “ingredient”. The “amount” is the first consecutive list of numbers, for example, will see 1 1/2 as the first consecutive numbers (later computed as 1.5). Next, you find the “measure” as the first measurement string after the location of the numbers. Of a list of measurement strings (e.g. cup,tablespoon,etc.) the first one that shows up is cup so that is selected. Finally, I used a corpus of ingredient names, sorted in decreasing order from the length of word. That way, the next match would be chocolate chips. Though chocolate also matches, it is not as long and the greedy search will take the longest match. From this simple set of guidelines we can easily match the majority of ingredient strings!

Ingredient extraction from HTML

I hope to take a website - a set of HTML content - and extract only the ingredient list. An ingredient list is the list of ingredients and their amounts that should be used to prepare the recipe. It usually comes before the directions and is someimtes encoded in HTML like the following:

<h2>Ingredients</h2>
<ul>
	<li>1 cup chocolate chips</li>
	<li>1/2 cup melted butter</li>
	<li>1 cup oats</li>
</ul>

However, there is absolutely no guarantee that any random website will use ul tags or li tags, or any other tag. Also, this procedure is even more complicated by the fact that websites are completely bloated with other content that you do not want to extract.

Website with ingredients

My first attempt at this was to use content extraction. Basically I converted HTML to text and went line by line and computed how many “ingredient”-like words there word and then used the cluster of lines with ingredient words as a proxy for the ingredient list.

In practice this worked well, but not great. What could be improved?

Well, I realized recently that in doing this method I’m essentially throwing away part of the information - the structure of the HTML. Even though I have no idea if the HTML encodes ingredient lists using <li> tags or <div> tags or whatever, I can still look for any parent tag that contains child tags that has the right kinds of words.

What are the right kind of words that designate that a given line is part of an ingredient list? This is where heuristics come in, based on some simple observations. Basically I made the following observations, and for every affirmative observation I add a +1 to a score for that line. A line with a high score is more likely an “ingredient line”.

Heuristics for establishing an “ingredient line”

  • Does the line contain an ingredient word (apple, chocolate, etc.)?
  • Does the line contain a measure word (cups, tablespoon, etc.)?
  • Does the line contain a number?
  • Does the number come before the measure?
  • Does the measure come before the ingredient?
  • Is the line shorter than 100 characters?

Most lines in an ingredient list will answer “yes” to these questions.

Thus, to find an ingredient list I can just parse an HTML tree using a depth-first search and calculating the score for each parent based on the how well the children answer each of those questions. Each parent above some threshold will be designated as the container for the ingredient list, after which the ingredient lines are quite easily determined.

Ingredient extraction from JSON

There is a special case, that nowadays with progressive web apps, the recipe is contained in a JSON in a <script> tag and used dynamically to fill the HTML DOM with the data when the Javascript renders. To account for these pages I can do the same thing as above. First I get the JSON (by trial and error) and then parse the JSON and looking for an array of lines that look like ingredient lines.

Next steps

In the future I will try to use this library to test whether I can collect enough information about enough recipes to make some interesting conclusions and ask questions about average recipes, recipe variations, and analysis of principal components of recipe types. I have thousands of chocolate chip cookie recipes ready to understand.

July 16, 2019

🐋 Dockerfile for Go programs ⛏️ SSH quick tip