Monday, 30 January 2017

Combine documents with other data in Azure Search

Azure Search has built-in support for indexing documents in blob storage, which works great. But what if you want to combine the documents with other data, for example if you are building a recruitment system and want to search on, say, location and CV content at the same time?

TL;DR; Indexers in Azure Search uses a "create or update" approach; as long as your different indexers use the same id, they can all write data to the same document in the index.

Terminology in Azure Search

(deliberately simplified and made Azure Search specific)

An index in Azure Search is more like a table in an RDBMS or a collection in a document-based database. It has a specific "schema" (the index definition) that specifies the structure of the documents stored in the index.

A document is a single entity in Azure Search, think "record" or (indeed) document in a document database. It has an id that is unique within that index.

An indexer is something that takes data from a source, like Blob storage, SQL or something else, re-formats it and writes it to documents in an index in Azure Search.

A data source is a definition in Azure Search of somewhere that an indexer can read data from. It's sort of like a connection string.

Scenario

The specific scenario I have is the need to build a database of candidates for a recruitment system. I want to be able to search the contents of CVs, but I also want to be able to filter and use faceted search on various bits of data about the candidate themselves, such as location, qualification and so on.

Azure Search provides built-in functionality to read the contents from a range of files stored in Azure Blob Storage, so that sounded pretty ideal; just upload the CVs to blob storage, point Azure search to it and be done. Of course, that would not combine it with the other candidate data. Initially I thought about several options;
  • I could extract the content of the files myself and write code to combine it with candidate data in code. But extracting content out of files is not a trivial problem.
  • I could let Azure Search index the files, then write code to extract the content back out of Azure Search and write it back to another index. But that seemed like a very long-winded solution.
  • I asked Microsoft if I could somehow access the content-extraction feature directly. But you can't.

The solution

Liam Cavanagh gave me the outline solution with this statement;
Leverage Blob Storage with Data Store (such as Azure SQL or DocumentDB):  If you store your files in Azure Blob storage, and your structured data in something like Azure SQL or DocumentDB, you can use both the Azure Search Blob Indexer as well as the Azure Search SQL or DocumentDB Indexer together.  The only trick is to make sure that the unique key that is defined for each document matches between Blob and the structured data store.  For example, if you choose to use the Base64 Encoded filename as the key, then you would need to make sure that the matching content in your structured datastore also contains this value.  That way the Indexer which does something called MergeOrUpdate, will first take a row from say the Blob Storage and insert it as a new row in Azure Search and then it will bring in the data from the structured data store and update the existing row with the new fields.

With a bit more help from Liam I put together the sample solution outlined below. I should also mention that in this solution I am using the built-in tools in Azure Search to read data directly from SQL and Blob storage. If you wanted to use code to manually write to the index instead of letting Azure Search read from SQL then you can do that too, of course. Or read from some of the other supported data sources, or combine it all or...

Implementation

You'll need to create an Azure Search account and whatever data sources you want to use. You can do a number of the steps directly in the Azure Portal, though not all of them. For the ones you can't I will show the appropriate JSON payload you need to send. It is ridiculously easy with something like Postman. You just need to make sure you add two headers to your requests;
  • Content-Type : application/json
  • api-key : [an admin key for your Azure Search instance]
In summary, this is what we are going to build:



Create the Index

You can create a new index directly in the Azure Portal, which is what I did. It has a very nice editor that makes it quite easy. I ended up with something looking like this;

Please note the "content" field; When Azure Search indexes files, it will place the content of those files in the content field. Id is there by default and we need that, so leave it alone, but the remaining fields can be whatever you like. I put a name in there for the Candidate Name and the last two fields are some fields I added so I could experiment with faceted search.

Create the data sources

Next we need to tell Azure Search where it can get the data - we need to create the Data Sources.
You can't create a Data Source on it's own in the Portal; there is an option to "import data" but that combines setting up the data source with the index and the indexer so we can't use it for our purposes.
Time to start posting JSON (see above).

POST these to https://yoursearchservice.search.windows.net/datasources?api-version=2016-09-01 and replace the XXX with valid connection strings.
{
    "name" : "blobcvs",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "XXX" },
    "container" : { "name" : "cvs" }
}

{
    "name" : "candidates",
    "type" : "azuresql",
    "credentials" : { "connectionString" : "XXX" },
    "container" : { "name" : "Candidates" }
} 
This tells Azure Search how to access your data.

Create the indexers

POST these to https://yoursearchservice.search.windows.net/indexers?api-version=2016-09-01
{
    "name" : "candidateindexer",
    "dataSourceName" : "candidates",
    "targetIndexName" : "candidates",
    "fieldMappings" : [ { "sourceFieldName" : "Thingiemajics", 
                          "mappingFunction" : { "name" : "jsonArrayToStringCollection" } 
                      } ]
}
This tells Azure Search to take the data in the SQL database specified in the SQL data source and create a document in Azure Search for each row. Azure Search will automatically match fields with the same names; I've got an Id field as well as Name, Type and Thingiemajics columns in SQL. The only one that is a bit special is Thingiemajics; I'm storing an array of tag values in that field in SQL in the format ["red", "white", "blue"] and putting that mapping function in there tells Azure Search to make them individual tags that can be filtered individually. See the docs for more details.

Before I create the indexer for the files, let me just take a little detour. If you remember, the original statement was that for this to work, the two different indexers need to use the same id for data about the same candidate. The SQL indexer in my example uses the database ID of the candidate and we need to ensure that when Azure Search indexes the CV for a candidate it returns the same index. By default Azure Search will use the filename, which is obviously no good in this situation. The way I solved this was to add a custom meta data property to the blob when I uploaded it to Azure Blob Storage, something like this:

using (var fileStream = System.IO.File.OpenRead(file))
{
   await blob.UploadFromStreamAsync(fileStream);
}
blob.Metadata.Add("mykey", identifier);
await blob.SetMetadataAsync();
Here I have called it "mykey", but it could be called anything.

On to the indexer, which is created with this:
{
    "name" : "cvindexer",
    "dataSourceName" : "blobcvs",
    "targetIndexName" : "candidates",
    "fieldMappings" : [ { "sourceFieldName" : "mykey", "targetFieldName" : "id" } ],
    "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}
The most important thing here is that I tell Azure Search to take the metadata property "mykey" and map it to the "id" field in my index. That's all that is required to ensure the contents of the CV ends up in the same search document as the other Candidate information.

Notes

In my solution here I added an ID property to the blob meta data. I could, of course, have gone the other way instead and added the blob reference to the SQL data and then remapped that reference to "id" in the SQL indexer. Both approaches are equally valid, both have pros and cons, including the need to manually maintain uniqueness of the ID property and the need to base64 encode filenames.

You should also note that this works because the different datasources do not have any of the same field names; if the two indexers both return the same field, one will win and delete the data from the other one.

2 comments: