Monday, 30 January 2017

Combine documents with other data in Azure Search

Azure Search has built-in support for indexing documents in blob storage, which works great. But what if you want to combine the documents with other data, for example if you are building a recruitment system and want to search on, say, location and CV content at the same time?

TL;DR; Indexers in Azure Search uses a "create or update" approach; as long as your different indexers use the same id, they can all write data to the same document in the index.

Terminology in Azure Search

(deliberately simplified and made Azure Search specific)

An index in Azure Search is more like a table in an RDBMS or a collection in a document-based database. It has a specific "schema" (the index definition) that specifies the structure of the documents stored in the index.

A document is a single entity in Azure Search, think "record" or (indeed) document in a document database. It has an id that is unique within that index.

An indexer is something that takes data from a source, like Blob storage, SQL or something else, re-formats it and writes it to documents in an index in Azure Search.

A data source is a definition in Azure Search of somewhere that an indexer can read data from. It's sort of like a connection string.

Scenario

The specific scenario I have is the need to build a database of candidates for a recruitment system. I want to be able to search the contents of CVs, but I also want to be able to filter and use faceted search on various bits of data about the candidate themselves, such as location, qualification and so on.

Azure Search provides built-in functionality to read the contents from a range of files stored in Azure Blob Storage, so that sounded pretty ideal; just upload the CVs to blob storage, point Azure search to it and be done. Of course, that would not combine it with the other candidate data. Initially I thought about several options;
  • I could extract the content of the files myself and write code to combine it with candidate data in code. But extracting content out of files is not a trivial problem.
  • I could let Azure Search index the files, then write code to extract the content back out of Azure Search and write it back to another index. But that seemed like a very long-winded solution.
  • I asked Microsoft if I could somehow access the content-extraction feature directly. But you can't.

The solution

Liam Cavanagh gave me the outline solution with this statement;
Leverage Blob Storage with Data Store (such as Azure SQL or DocumentDB):  If you store your files in Azure Blob storage, and your structured data in something like Azure SQL or DocumentDB, you can use both the Azure Search Blob Indexer as well as the Azure Search SQL or DocumentDB Indexer together.  The only trick is to make sure that the unique key that is defined for each document matches between Blob and the structured data store.  For example, if you choose to use the Base64 Encoded filename as the key, then you would need to make sure that the matching content in your structured datastore also contains this value.  That way the Indexer which does something called MergeOrUpdate, will first take a row from say the Blob Storage and insert it as a new row in Azure Search and then it will bring in the data from the structured data store and update the existing row with the new fields.

With a bit more help from Liam I put together the sample solution outlined below. I should also mention that in this solution I am using the built-in tools in Azure Search to read data directly from SQL and Blob storage. If you wanted to use code to manually write to the index instead of letting Azure Search read from SQL then you can do that too, of course. Or read from some of the other supported data sources, or combine it all or...

Implementation

You'll need to create an Azure Search account and whatever data sources you want to use. You can do a number of the steps directly in the Azure Portal, though not all of them. For the ones you can't I will show the appropriate JSON payload you need to send. It is ridiculously easy with something like Postman. You just need to make sure you add two headers to your requests;
  • Content-Type : application/json
  • api-key : [an admin key for your Azure Search instance]
In summary, this is what we are going to build:



Create the Index

You can create a new index directly in the Azure Portal, which is what I did. It has a very nice editor that makes it quite easy. I ended up with something looking like this;

Please note the "content" field; When Azure Search indexes files, it will place the content of those files in the content field. Id is there by default and we need that, so leave it alone, but the remaining fields can be whatever you like. I put a name in there for the Candidate Name and the last two fields are some fields I added so I could experiment with faceted search.

Create the data sources

Next we need to tell Azure Search where it can get the data - we need to create the Data Sources.
You can't create a Data Source on it's own in the Portal; there is an option to "import data" but that combines setting up the data source with the index and the indexer so we can't use it for our purposes.
Time to start posting JSON (see above).

POST these to https://yoursearchservice.search.windows.net/datasources?api-version=2016-09-01 and replace the XXX with valid connection strings.
{
    "name" : "blobcvs",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "XXX" },
    "container" : { "name" : "cvs" }
}

{
    "name" : "candidates",
    "type" : "azuresql",
    "credentials" : { "connectionString" : "XXX" },
    "container" : { "name" : "Candidates" }
} 
This tells Azure Search how to access your data.

Create the indexers

POST these to https://yoursearchservice.search.windows.net/indexers?api-version=2016-09-01
{
    "name" : "candidateindexer",
    "dataSourceName" : "candidates",
    "targetIndexName" : "candidates",
    "fieldMappings" : [ { "sourceFieldName" : "Thingiemajics", 
                          "mappingFunction" : { "name" : "jsonArrayToStringCollection" } 
                      } ]
}
This tells Azure Search to take the data in the SQL database specified in the SQL data source and create a document in Azure Search for each row. Azure Search will automatically match fields with the same names; I've got an Id field as well as Name, Type and Thingiemajics columns in SQL. The only one that is a bit special is Thingiemajics; I'm storing an array of tag values in that field in SQL in the format ["red", "white", "blue"] and putting that mapping function in there tells Azure Search to make them individual tags that can be filtered individually. See the docs for more details.

Before I create the indexer for the files, let me just take a little detour. If you remember, the original statement was that for this to work, the two different indexers need to use the same id for data about the same candidate. The SQL indexer in my example uses the database ID of the candidate and we need to ensure that when Azure Search indexes the CV for a candidate it returns the same index. By default Azure Search will use the filename, which is obviously no good in this situation. The way I solved this was to add a custom meta data property to the blob when I uploaded it to Azure Blob Storage, something like this:

using (var fileStream = System.IO.File.OpenRead(file))
{
   await blob.UploadFromStreamAsync(fileStream);
}
blob.Metadata.Add("mykey", identifier);
await blob.SetMetadataAsync();
Here I have called it "mykey", but it could be called anything.

On to the indexer, which is created with this:
{
    "name" : "cvindexer",
    "dataSourceName" : "blobcvs",
    "targetIndexName" : "candidates",
    "fieldMappings" : [ { "sourceFieldName" : "mykey", "targetFieldName" : "id" } ],
    "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}
The most important thing here is that I tell Azure Search to take the metadata property "mykey" and map it to the "id" field in my index. That's all that is required to ensure the contents of the CV ends up in the same search document as the other Candidate information.

Notes

In my solution here I added an ID property to the blob meta data. I could, of course, have gone the other way instead and added the blob reference to the SQL data and then remapped that reference to "id" in the SQL indexer. Both approaches are equally valid, both have pros and cons, including the need to manually maintain uniqueness of the ID property and the need to base64 encode filenames.

You should also note that this works because the different datasources do not have any of the same field names; if the two indexers both return the same field, one will win and delete the data from the other one.

Friday, 6 January 2017

How many ways can I host a web app in Azure?

I got talking to a colleague about how many ways there are to host web apps in Azure and even managed to surprise myself by just how many ways I could think of. It inspired me to compile this list, which is just off the top of my head. I'm sure there are more - if you can think of other ways, please leave a comment.

For the purposes of this, I am defining a web app as something that has a user-facing UI and some server-side functionality. I have listed a few additional options at the bottom if you only need one of those things.

I put this list together real quick like and the description of each service is just a very quick summary, mainly just from memory so please do not view this as a definitive or highly accurate document; It's mainly just a fun exercise. That said, do please point out any factual errors in the comments so I can correct them.

Web Apps

This is the default option and probably what you should choose if in doubt. It's getting a lot of love from Microsoft at the moment and is constantly getting new features. You get fail-over and auto-scaling by default, plans from free to expensive and it is available in both Windows and Linux flavours (in preview). It's very easy to get started with and supports pretty much anything from static pages to complex deployment processes, staging slots and even a built-in basic CD pipeline.
The main thing to be aware of is that you don't get admin access to the server, so if you need to, say, customise which SSL protocols are available or you need to install fonts, you are out of luck. From experience, it is very rare that you need this, though. 

Cloud Services

This is the original Platform-as-a-Service option in Azure. It doesn't get much love these days, but I am still fond of it for those few situations where I need more than Web Apps can give me. Essentially, you provide Azure with an application package and Azure will take care of deploying that to one or more servers for you. You do get full admin access to the servers so you can do what you like - as long as you script it as part of your package. Patching, fail-over, load-balancing, auto-scaling, health monitoring etc is all taken care of for you.

Functions

This isn't really meant for doing a proper web app, but you can write a collection of functions that acts as an API or return some HTML etc, so you could certainly do it if you really wanted. Understand me right, though, Functions are brilliant for what they are meant to do, even if building whole web apps isn't it. That said, if you just need a couple of simple APIs to support a front-end app then it is certainly something you should consider.

Service Fabric

Not very well known, but Azure provides a pretty advanced micro-services framework that you can use for building sophisticated, large-scale applications. It supports both a service model and a basic Actor model out of the box. It's got a pretty high base cost relative to other Azure services due to the minimum number of nodes you have to use, but if you have a need for lots of scale then you should definitely look at this. Azure uses it to power a lot of their own architecture, including both SQL Azure and DocumentDb. 

Virtual Machine

You can, of course, deploy good old-fashined virtual machines and run your web app on them. You are then responsible for patching and some level of maintenance yourself. You can define images so Azure can do auto scaling for you as well. I personally try to avoid using VMs as far as I can as I don't like to be responsible for patching and maintenance etc - yet, I still have about 40 of them :).

Container Services / Docker

If you are one of those cool kids who like Docker, you are in luck. Azure has native support for Docker and support DC/OS, Docker Swarm and Kubernetes out of the box.


So, how many was that? Six, I think, though I probably shouldn't count the Functions one :)

Related

As if all the above wasn't enough, there are a couple of other technologies that can also be used to deal with web sites or APIs.

Blob Storage / CDN

Blob storage is Azure's file storage system (okay, it's more than that, but let that suffice for now). You can share a container publicly and put html, js, css and whatever other files you like in it. I quite often use it for static assets, though in theory you could host a whole website in there. Azure CDN can sit on top of blob storage and gives you geo-replication of the files. You can map custom domains to blob storage as well. Of course, this is all static so I couldn't quite include it in my list above. 

Traffic Manager

Traffic Manager doesn't host anything, but it is worth understanding where it fits in. Pretty much all of the above options include a load balancer (automatic, you don't really have to worry about it) and fail-over within a data centre. If you need fail-over between data centres or want to route traffic to servers close to your users, you can use Traffic Manager. It works at the DNS level and is used to direct user requests for a given domain name to the nearest available data centre. 

API Management

Not strongly related to web app hosting, but I just thought I'd mention it; Basically, if you develop an API and you want to give external users access to it you probably want to do things like controlling who can use it (maybe so you can bill them), rate limiting, authentication and so on. Azure API management deals with all that as a service that just sits in front of your naked API so you don't have to write it all yourself.

Takeaway

When I started this list, I really didn't expect it would be this long. I think it's great there is so much choice and I know that each option has it's own set of strengths and weaknesses. We use most of the technologies listed here on different projects and for different reasons, and I'm very happy that I can choose. At the same time, I think it is probably quite tough for someone new to Azure to even get started on figuring out which of the many options are right for their particular scenario. And I have only dealt with web app hosting here, not the multitude of other things you can do.

I just meant for this post to be a fun little exercise but, having written it, I should mention that at NewOrbit we have recently started helping other companies move to the cloud and sharing our years of Azure experience with them. 

If you have any questions or want to talk more, ping me on twitter or add a comment below.