What's New in the ML.NET CLI

The ML.NET CLI has gotten some interesting updates. This post will go over the main items that are new.

For a video version of this post, check below.

New Install Name

The first thing to make note of is that there is a new name when installing the newer versions of the ML.NET CLI. Since the file size got too big for a single .NET tool, it is now split up into multiple installs depending on what operating system and CPU architecture you're running.

So getting the newest version will require a new install even if you have the older version installed. Actually, I would recommend to go ahead and uninstall the older version of the CLI if you already have it installed. This can be done with the dotnet tool uninstall mlnet --global command.

So depending on your machine is what you will install. I have a M1 MacBook Pro, so I would install the mlnet-osx-arm version. If you're on Windows, you will probably be installing the mlnet-win-x64 version.

If you want to update a previously installed newer version, you can use the dotnet tool update command.

Train with a mbconfig File

With the new CLI release, it comes with a couple of new command. The first we'll go over is the train command. This takes in a single required argument, which is a mbconfig file. This will use the information in the mbconfig file and will perform another training run.

This can be good for a few scenarios, including continuous integration where the mbconfig file is checked into version control and can be run each day to see if a new model can be discovered.

Forecasting

Along with the train command a new scenario has been added - forecasting. Forecasting is primarily used for time series data to forecast values in the future. Similar to the other scenarios, we have a few arguments we can pass in.

The dataset and label-col arguments are similar to the other scenarios, but forecasting has a couple of others that are required - horizon and time-col .

The horizon argument is simply the number of items in the future you want the forecasting algorithm to predict.

The time-col argument is just the column that has the time or dates that the algorithm can use.

And we can run this like other scenarios with the below command. We'll let it run only for 10 seconds with the --train-time argument. The data can be found here if you want to run it as well.

mlnet forecasting --dataset C:/dev/wind_gen.txt --horizon 3 --label-col 1 --time-col 0 --train-time 10


A couple of big additions to the CLI and I'm sure more are coming. It is nice to see that the ML.NET team is continuing to keep the CLI's features on par with Model Builder.

Introduction to QnA Maker

Suppose you have a FAQ page that has a lot of data and want to use that as a first line of customer service support for a chat bot on your main page. How can you integrate this with minimal effort? Enter Microsoft Q&A Maker.

For the video verison of this post, check below.

What is Microsoft QnA Maker

From the official docs it states that...

QnA Maker is a cloud-based Natural Language Processing (NLP) service that allows you to create a natural conversational layer over your data.

So, basically, this service allows you to creates question and answering based on the data that you have.

Creating the Azure Resource

First thing, like for all Cognitive Services items, we need to create the Azure resource. When creating a new resource, you can search for "QnA" and select "QnA Maker".

After clicking "Create" on that, we're on a screen where we have to enter a few things. First, we will supply the subscription, resource group, name, and pricing tier. Note that this does have a free tier so this can be used for proof of concepts or to simply try out the service to see if it meets your needs.

Next, it will ask for details for Azure Search. This is used to host the data that you give it for the QnA maker and Azure Search is used for that. Only the location and pricing tier is needed for this. This has a free tier, as well.

Last, it will ask details for an App Service. This is used to host the runtime for QnA maker which will be used to make queries against your data. The app name and location is required for this.

You can optionally enable app insights for this service as well, but feel free to disable that since it's not really needed unless you are creating this for actual business purposes and want to see if anything goes wrong.

With all that, we can click the "Review and Create" button to create and deploy the resources.

Creating a QnA Knowledge Base

With the resource created we can now go to it. Per usual with the Cognitive Services items, we have a "Keys and Endpoint" item on the left where we can get the key and endpoint. This will be used later when using the API.

But first, we need to create our QnA Knowledge Base and to do that we need to go to the "Overview" tab on the left navigation. A bit down it has a link where we can go to the QnA portal to create a new knowledge base.

We can skip step one since we already created the Azure resource. In step two, we will connect the QnA portal to our Azure resource. Simply give it the subscription and the Azure service name. Luckily, all of this pre-populates so they are all dropdowns.

In step three, we will give our knowledge base a name. We'll name it "ikea" since we will use the Ikea FAQ to populate the knowledge base.

Step four is where we'll populate the knowledge base. If you already have a FAQ on your website you can put the URL in. Since I'm using the Ikea FAQ we can do that. You can also add a file for this. If you have neither, you can leave this blank and fill out the questions and answers manually on the next page.

Below this you can customize your bot with a chit-chat. This just helps give your QnA bot a bit more personality based on what you select. Here's a screenshot of an example from the docs of each of the chit-chat items that you can choose from.

For step five, we can create our knowledge base by clicking on the button.

Training and Testing

Once we have our knowledge base created we can look see that it read in the Ikea FAQ quite well. To see just how well QnA Maker is, we can instantly click on the "Save and train" button to train a model on our data.

Once that finishes we can click on the "Test" button to give the model a test. This is an integrated chat bot where we can ask it questions and will receive answers based on the model.

So we can ask "What is the return policy?" and QnA Maker will give us the best answer based on our data.

Early on we get some good results from QnA Maker. But what if we want to add a new pair?

Adding a New QnA Pair

If we want to add new QnA pairs to our existing knowledge base, just click on the Add QnA Pair button.

We can add an alternate phrasing such as "This is a question". A phrasing is mostly a question that a user would input into the system that will get sent to QnA Maker. We can input an answer as well, such as "This is an answer". Notice that we have a rich text editor which can be toggled in the upper left. With this, we can add graphics, links, and emojis. Let's add a smile emoji to our answer.

Now we can click the "Save and train" button to train a new model on what we just added. We can then do another test and in put "This is a question" and we should get the answer that we put as the output.

Using the API

Before we can actually use the API we need to publish our knowledge base. Simply click the "Publish" tab near the top and then the "Publish" button.

Once that completes you can either create a bot that will use this knoweldge base, or you can use the API directly. We'll use the API and the publish page shows how you can call the API using Postman or through curl. We'll use Postman here so we can easily test the API out.

To build the URL for the API, use the "Host" item in the deployment details and then append the first item where it has "POST" after that.

And since it does say "POST" we will make this as a POST call.

Next, we need to set the authorization header. In the Authorization tab in Postman, set the type to "API Key". The "Key" item will be "Authorization" and the "Value" will be the API key which is the third part of the deployment details.

Now, we can add in the JSON body and the simplest JSON that we can add has only one item, a "question" item. And this is the prompt that a user would send to QnA Maker. Let's add the question that we added earlier.

{
 "question":"This is a question"
}

Once we hit "Send" in Postman, we will get the below response.

{
     "answers": [
     {
         "questions": [
             "This is a question"
         ],
         "answer": "This is an answer😀",
         "score": 100.0,
         "id": 176,
         "source": "Editorial",
         "isDocumentText": false,
         "metadata": [],
         "context": {
             "isContextOnly": false,
             "prompts": []
         }

    }
 ],
 "activeLearningEnabled": false
}

The main part to notice here is the "answer", which is what we expect to get back.


Hopefully, this showed how useful the QnA Maker can be, especially if you already have a FAQ page with questions and answers. With QnA Maker it can be turned into a chat bot or any other automation tool where users or customers may need to ask questions.

Use Bing Image Search to Get Training Image Data

When going through the FastAI book, Deep Learning for Coders, I noticed that in one of the early chapters they mention using the Bing Image Search API to retrieve images for training data. While they have a nice wrapper for the API, I thought I'd dive into the API as well and use it to build my own way to download training image data.

Let's suppose we need to make an image classification model to determine what kind of car is in an image. We'd need quite a bit of different images for this model, so let's use the Bing Image Search to gather images of the Aston-Martin car so we can start getting our data.

Check out the below for a video version of this post.

Before going into the technical side of Bing Image Search, let's go over why use this in the first place. Why not just download the images ourselves?

Bing Images Search has a few features in it that we can utilize in our code when getting our images. Some of these features are important to take into account.

Automation

I'll be honest, I'm lazy and if I can script something to do a task for me then I'll definitely spend the time to build the script rather than do the task manually. Rather than manually finding images and downloading them, we can use the Bing Image Search API to do this for us.

Image License

We can't always just take an image from a website and use it however we want. A lot of images that are online are copyrighted and if we don't have a license to use the copyright we are actually in violation of the creator's copyright. If they find out we use their image without a license or permission then they can, more than likely, take legal action against us.

However, with Bing Image Search, we have an option to specify what license the images has that get returned to us. We can do this with the licenseType query parameter in our API call. This utilizes Creative Commons licenses. We can specify exactly what type of license our images has. We can specify that want images that are public where the copyright is fully waived, which is what we will do. There are many Creative Commons license types that the Bing Image Search supports and there's a full list here.

Image Type

There are quite a few images types that we could download from Bing Image Search. For our purposes, though, we only want photos of Aston Martin cars. Due to that, we can specify the image type in our API calls to just photo. If we don't specify this we could get back animated GIFs, clip art, or drawings of Aston Martin cars.

Safe Content

When downloading images from the internet you never really know what you're going to get. Bing Image Search can help ease that worry by specifying that you want only safe content to be returned.

Bing can do this filtering for us so we don't have to worry about it when we do our API call. This is one less thing we have to worry about and, because it's the internet, it's definitely something to worry about when download images.

Create Azure Resource

Before we can use the Bing Image Search API we need to create the resource for it. In the Azure Portal create a new resource and search for "Bing Search". Then, click on the "Bing Search v7" resource to create it.

When creating the resource give it a name, what resource group it will be under, and what subscription it will be under. For the pricing tier, it does have a free tier to allow you to give the service a try for evaluation or for a proof of concept. Once that is complete, click "Create".

When that completes deployment, we can explore a bit on the resource page. One thing to note is that there are a few things we can look at here. There's a "Try me" tab where we can try the Bing Search API and see what results we get. There is some sample code to see real quick how to use the Bing Search API. And there are a lot of tutorials that we can look at if we want to look at something more specific, such as the image or video search APIs.

Retrieve Key and Endpoint

To use the API in our code we will need the API key and the endpoint to call. There are a couple of ways we can get to it. First, on the "Overview" page of the resource there's a link that says to "click here to manage keys". Clicking that will take you to another page where you can get the API keys and the endpoint URL.

You can also click on the "Keys and Endpoint" section on the left navigation.

Now save the API key and the endpoint since we'll need those to access the API in the code.

Using the API

Now we get to the fun stuff where we can get into some code. I'll be using Python, but you're very welcome to use the language of your choice since this is a simple API call. I'm also using Azure ML since it's very easy to get a Jupyter Lab instance running plus most machine learning and data science packages already installed.

Imports

First, we need to import some modules. We have four that we will need to import.

  • JSON: This will be used to read in a config file for the API key and endpoint
  • Requests: Will be used to make the API calls. This is pre-installed in an Azure ML Jupyter instance, so you may need to run pip install requests if you are using another envrionment.
  • Time: Used to delay API calls so the server doesn't get hit too much by requests.
  • OS: Used to saved and help clean image data on the local machine.
  • PPrint: Used to format JSON when printing.
import json
import requests
import time
import os
import pprint

The API Call

Now, we can start building and making the API call to get the image data.

Building the Endpoint

To start building the call, we need to get the API key which is kept in a JSON file for security reasons. We'll use the open method to open the file to be able to read it and use the json module to load the JSON file. This creates a dictionary where the JSON keys are the key names of the dictionary where you can get the values.

config = json.load(open("config.json"))

api_key = config["apiKey"]

Now that we have the API key we can build up the URL to make the API call. We can use the endpoint that we got from the Azure Portal and help build up the URL.

endpoint = "https://api.bing.microsoft.com/"

With the endpoint, we have to add some to it to tell it that we want the Image Search API. To learn more about the exact endpoints we're using here, this doc has a lot of good information.

url = f"{endpoint}v7.0/images/search"

Building the Headers and Query Parameters

Some more information we need to add to our call are the headers and the query parameters. The headers is where we supply the API key and the query parameters detail what images we want to return.

Requests makes it easy to specify the headers, which is done as a dictionary. We need to supply the Ocp-Apim-Subscription-Key header for the API key.

headers = { "Ocp-Apim-Subscription-Key": api_key }

The query parameters are also done as a dictionary. We'll supply the license, image type, and safe search parameters here. Those are optional parameters, but the q parameter is required which is what query we want to use to search for images. For our query here, we'll search for aston martin cars.

params = {
    "q": "aston martin", 
    "license": "public", 
    "imageType": "photo",
    "safeSearch": "Strict",
}

Making the API Call

With everything ready, we can now make the API call and get the results. With requests we can just call the get method. In there we pass in the URl, the headers, and the parameters. We use the raise_for_status method to throw an exception if the status code isn't successful. Then, we get the JSON of the response and store that into a variable. Finally, we use the pretty print method to print the JSON response.

response = requests.get(url, headers=headers, params=params)
response.raise_for_status()

result = response.json()

pprint.pprint(result)

And here's a snapshot of the response. There's quite a bit here but we'll break it down some later in this post.

{'_type': 'Images',
 'currentOffset': 0,
 'instrumentation': {'_type': 'ResponseInstrumentation'},
 'nextOffset': 38,
 'totalEstimatedMatches': 475,
 'value': [{'accentColor': 'C6A105',
            'contentSize': '1204783 B',
            'contentUrl': '[https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg](https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg)',
            'creativeCommons': 'PublicNoRightsReserved',
            'datePublished': '2021-02-06T20:45:00.0000000Z',
            'encodingFormat': 'jpeg',
            'height': 1530,
            'hostPageDiscoveredDate': '2021-01-12T00:00:00.0000000Z',
            'hostPageDisplayUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
            'hostPageFavIconUrl': '[https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api](https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&pid=Api)',
            'hostPageUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&picture=aston-martin-car)',
            'imageId': '38DBFEF37523B232A6733D7D9109A21FCAB41582',
            'imageInsightsToken': 'ccid_WTqn9r3a*cp_74D633ADFCF41C86F407DFFCF0DEC38F*mid_38DBFEF37523B232A6733D7D9109A21FCAB41582*simid_608053462467504486*thid_OIP.WTqn9r3aKv5TLZxszieEuQHaF5',
            'insightsMetadata': {'availableSizesCount': 1,
                                 'pagesIncludingCount': 1},
            'isFamilyFriendly': True,
            'name': 'Aston Martin Car Free Stock Photo - Public Domain '
                    'Pictures',
            'thumbnail': {'height': 377, 'width': 474},
            'thumbnailUrl': '[https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api](https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&pid=Api)',
            'webSearchUrl': '[https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486](https://www.bing.com/images/search?view=detailv2&FORM=OIIRPO&q=aston+martin&id=38DBFEF37523B232A6733D7D9109A21FCAB41582&simid=608053462467504486)',
            'width': 1920}]

A few things to note from the response:

  • nextOffset: This will help us page items to perform multiple requests.
  • value.contentUrl: This is the actual URL of the image. We will use this URL to download the images.

Paging Through Results

For a single API call we may get around 30 items or so by default. How do we get more images with the API? We page through the results. And the way to do this is to use the nextOffset item in the API response. We can use this value to pass in another query parameter offset to give the next page of results.

So if I only want at most 200 images, I can use the below code to page through the API results.

new_offset = 0

while new_offset <= 200:
    print(new_offset)
    params["offset"] = new_offset

    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()

    result = response.json()

    time.sleep(1)

    new_offset = result["nextOffset"]

    for item in result["value"]:
        contentUrls.append(item["contentUrl"])

We initialize the offset to 0 so the initial call will give the first page of results. In the while loop we limit to just 200 images for the offset. Within the loop we set the offset parameter to the current offset, which will be 0 initially. Then we make the API call, we sleep or wait for one second, and we set the offset parameter to the nextOffset from the results and save the contentUrl items from the results into a list. Then, we do it again until we reach the limit of our offset.

Downloading the Images

In the previous API calls all we did was capture the contentUrl items from each of the images. In order to get the images as training data we need to download them. Before we do that, let's set up our paths to be ready for images to be downloaded to them. First we set the path and then we use the os module to check if the path exists. If it doesn't, we'll create it.

dir_path = "./aston-martin/train/"

if not os.path.exists(dir_path):
    os.makedirs(dir_path)

Generally, we could just do the below code and loop through all of the content URL items and for each one we create the path with the os.path.join method to get the correct path for the system we're on, and open the path with the open method. With that we can use requests again with the get method and pass in the URL. Then, with the open function, we can write to the path from the image contents.

for url in contentUrls:
    path = os.path.join(dir_path, url)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)

            f.write(image_data.content)
    except OSError:
        pass

However, this is a bit more complicated than we would hope it would be.

Cleaning the Image Data

If we print the image URLs for all that we get back it would look something like this:

https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg
https://images.pexels.com/photos/592253/pexels-photo-592253.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=750&amp;w=1260
https://images.pexels.com/photos/2811239/pexels-photo-2811239.jpeg?cs=srgb&amp;dl=pexels-tadas-lisauskas-2811239.jpg&amp;fm=jpg
https://get.pxhere.com/photo/car-vehicle-classic-car-sports-car-vintage-car-coupe-antique-car-land-vehicle-automotive-design-austin-healey-3000-aston-martin-db2-austin-healey-100-69398.jpg
https://get.pxhere.com/photo/car-automobile-vehicle-automotive-sports-car-supercar-luxury-expensive-coupe-v8-martin-vantage-aston-land-vehicle-automotive-design-luxury-vehicle-performance-car-aston-martin-dbs-aston-martin-db9-aston-martin-virage-aston-martin-v8-aston-martin-dbs-v12-aston-martin-vantage-aston-martin-v8-vantage-2005-aston-martin-rapide-865679.jpg
https://c.pxhere.com/photos/5d/f2/car_desert_ferrari_lamborghini-1277324.jpg!d

Do you notice anything in the URLs? While most of then end in jpeg there are a few with some extra parameters on the end. If we try to download with those URLs we won't get the image. So we need to do a little bit of data cleaning here.

Luckily, there are two patterns we can check, if there is a ? in the URL and if there is a ! in the URL. With those patterns we can update our loop to download the images to the below to get the correct URLs for all images.

for url in contentUrls:
    split = url.split("/")

    last_item = split[-1]

    second_split = last_item.split("?")

    if len(second_split) > 1:
        last_item = second_split[0]

    third_split = last_item.split("!")

    if len(third_split) > 1:
        last_item = third_split[0]

    print(last_item)
    path = os.path.join(dir_path, last_item)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)
            #image_data.raise_for_status()

            f.write(image_data.content)
    except OSError:
        pass

With this cleaning of the URLs we can get the full images.

Conclusion

While this probably isn't as sophisticated as the wrapper that FastAI has, this should help if you need to get training images from Bing Image Search manually. You can also tweak this if needed.

Using Bing Image Search is a great way to get quality and license appropriate images for training data.

The ML.NET Deep Learning Plans

One of the most requested features for ML.NET is the ability to create neural networks models from scratch to perform deep learning in ML.NET. The ML.NET team has taken that feedback and the feedback from the customer survey and has come out with a plan to start implementing this feature.

Current State of Deep Learning in ML.NET

Currently, in ML.NET, there isn't a way to create neural networks to have deep learning models from scratch. There is great support for taking an existing deep learning model and using it for predictions, however. If you have a TensorFlow or ONNX model then those can be used in ML.NET to make predictions.

There is also great support for transfer learning in ML.NET. This allows you to take your own data and train it against a pretrained model to give you a model of your own.

However, as mentioned earlier, ML.NET does not yet have the capability to let you create your own deep learning models from scratch. Let's take a look at what the plans are for this.

Future Deep Learning Plans

In the ML.NET GitHub repo there is an issue that was fairly recently created that goes over the plans to implement creating deep learning models in ML.NET.

There are two reasons for this:

  1. Communicate to the community about what the plans are and that this is being worked on.
  2. Get feedback from the community on the current plan.

While we'll touch on the main points in the issue in this post, I would highly encourage you to go through it and give any feedback or questions about the plans you may have to help the ML.NET team in their planning or implementation.

The issue details three parts in order to deliver creating deep learning models in ML.NET:

  1. Make consuming of ONNX models easier
  2. Support TorchSharp and make it production ready
  3. Create an API in ML.NET to support TorchSharp

Let's go into each of these in more detail.

Easier Use of ONNX Models

While you can currently use ONNX models in ML.NET right now, you do have to know the input and output names in order to use it. Right now we rely on the Netron application to load the ONNX models to give us the input and output names. While this isn't bad, the team wants to expose an internal way to get these instead of having to rely on a separate application.

Of course, along with the new way to get the input and output names for ONNX models, the documentation will definitely be updated to reflect this. I believe, not only documentation, but examples would follow to show how to do this.

Supporting TorchSharp

TorchSharp is the heart of how ML.NET will implement deep learning. Similar to how Tensorfow.NET supports scoring TensorFlow models in ML.NET, this will provide access to the PyTorch library in Python. PyTorch is starting to lead the way in building deep learning models in research and in industry so it makes sense to implmement in ML.NET.

In fact, one of the popular libraries to build deep learning models is FastAI. Not only is FastAI one of the best courses to take when learning deep learning, but the Python library is one of the best in terms of building deep learning models. Under the hood, though, FastAI uses PyTorch to actually build the models that it produces. This isn't by accident. The FastAI developers decided that PyTorch was the way to go for this.

TensorFlow is great to support for predicting existing models, but for building new ones from scratch I really think PyTorch and TorchSharp is the preferred way. To do this, TorchSharp will help ML.NET lead the way.

Implementing TorchSharp into ML.NET

The final stage is, once TorchShap has been made production ready, create a high-level API in ML.NET to train deep learning models from scratch.

This will be like when Keras came along for TensorFlow. It was an API on top of TensorFlow to help make building the models much easier. I believe ML.NET can do that for TorchSharp.

This will probably be a big undertaking but definitely worth doing. This will be the API people will use to build their models so taking the time to get this the best way possible. will be worth it in the long run to let us build our models the most trivial way possible which will make us more productive in the long run.

Conclusion

Creating deep learning models from scratch is, by far, one of the most requested features for ML.NET and their plan to do this is definitely going to reach this goal. In fact, I think it will surpass this goal since it will use PyTorch on the backend which is where research and the industry is leaning towards.

If you have any feedback or questions, definitely feel free to comment on the GitHub issue.

AI Ethics and Fairness Resources

AI and data ethics and fairness is becoming a very hot topic lately. With computer vision models not being able to see everyone equally to the debacle at Google's AI division, it's something that we all need to look out for when doing any type of work with data.

With that, I'd like to show some resources I found that has been useful when researching this topic. Some are videos that go over how bias can get into data and others are actual research papers that go over how to help mitigate bias.

For a video version of this post, check below:

Videos

There are quite a lot of videos that go over AI ethics. Below are a few of my favorites that have a good amount of information in them.

  • The Trouble with Bias by Kate Crawford - This talk, given at the Neural Neural Information Processing Systems (NIPS) in 2017. Not only does Kate goes over what exactly is bias in machine learning models, but she also goes over the harms that it can cause.

  • Machine Learning and Fairness by Hanna Wallach and Jennifer Wortman Vaughan - This is actually one of my favorite resources on the list. This video goes into several aspects of fairness in machine learning including types of bias that can be in your data as well as ways to help mitigate it such as the Datasheets for Data paper that's linked in the papers section.

  • Transparency and Intelligibility Throughout the Machine Learning Life Cycle by Jennifer Wortman Vaughan - This goes through the entire machine learning life cycle to best incorporate transparency throughout the life cycle.

Courses

There are a couple of courses that go over AI ethics and I believe more will be on the way as time goes on.

  • FastAI Ethics - FastAI's ethics course is probably one of the most comprehensive out there. It has several lectures and each lecture has supplemental materials such as articles and even research papers.

Books

Just like courses are coming to teach people about AI ethics, books are also coming to do the same and also to help how you can prevent bias from creeping into your models.

  • Interpretable AI by Ajay Thampi - One of the first books I've seen on this subject, this book helps you understand why the need for having models that are interpretable and shows how to do it.

Papers and Documents

A lot of the information in the other categories come from earlier research done on data bias and AI ethics. As a result of the research some documents have also come out of it to help people creating models to mitigate the amount of bias in their data.

  • Manipulating and Measuring Model Interpretability - This paper goes into how to measure model interpretability. It also helps answer the question about what is interpretability in terms of a machine learning model.

  • Datasheets for Datasets - In electronics, there is a datasheet accompanied by each component that describes its characteristics, any testing done on it, etc. This paper proposes the idea of having the same for machine learning data.

  • AI Fairness Checklist - This document has a checklist that one can follow throughout the lifecycle of creating a model to lookout for fairness.

Tools

Thankfully, there are some tools out there that can help us interpret how models are making their predictions as well as assessing fairness within the models.

  • Microsoft Fairlearn - This Python tool helps access the fairness in your data. There is a demo available for this that helps show how it works.

  • Microsoft InterpretML - Another Python tool to help interpret machine learning models. This one also has a demo available.

Hopefully, this list gave you a good idea about data and AI ethics and fairness. There are definitely many more resources out there and I have been partial to Microsoft for their research and resources.

There will be more posts on ethics and fairness in the future, as well. Especially covering the two tools from Microsoft, Fairlearn and InterpretML.

What's New in ML.NET Version 1.6

Another new release of ML.NET is now out! The release notes for version 1.6 has all the details, but this post will highlight all of the more interesting updates from this version. I'll also include the pull request for each item in case you want to see more details on it or learn how something was implemented.

There were a lot of things added to this release, but they did make a note that there are no breaking changes from everything that was added.

For the video version of this post, check below.

Support for ARM

Perhaps the most exciting part of this update is the new support for ARM architectures. This will allow for most training and inference in ML.NET.

Why is this update useful? Well, ARM architectures are almost everywhere. As mentioned in the June update blog post this ARM architectures are included on mobile and embedded devices. This can open up a whole world of opportunities for ML.NET for mobile phones and IoT devices.

DataFrame Updates

The DataFrame API is probably one of the more exciting packages that's currently in the early stages. Why? Well, .NET doesn't have much in terms of competing with pandas in Python for data analysis or data wrangling to handle some preprocessing that you may need before you send the data into ML.NET to make a model.

Why am I including DataFrame updates in a ML.NET update? Well, the DataFrame API has been moved into the ML.NET repository! The code used to be in the CoreFx Lab repository as an experimental package, but now it's no longer experimental and now part of ML.NET. This is great news since it is planned to have many more updates to this API.

Other DataFrame updates include:

  • GroupBy operation extended - While the DataFrame API already had a GroupBy operation, this update adds new property groupings and makes it act more like LINQ's GroupBy operation.
  • Improved CSV parsing - Implemented the TextFieldParser that can be used when loading a CSV file. This allows the handling of quotes in columns.
  • ConvertIDataViewtoDataFrame - We've already had a way to convert a DataFrame object into an IDataView to be able to use data loaded with the DataFrame API into ML.NET, but now we can do the opposite where we can load data in ML.NET and convert it into a DataFrame object to perform further analysis on it.
  • Improved DateTime parsing - This allows for better parsing of date time data.
  • Improvements to the Sort and Merge methods - These updates allow for better handling of null fields when performing a sort or merge.

By the way, if you're looking for a way to help contribute to the ML.NET repository, helping with the DataFrame API is a great way to get involved. They have quite a few issues already that you can take a look at and help out with. It would be awesome if we got this package on par with pandas to help make C# a great ecosystem to perform data analysis.

You can use the Microsoft.Data.Analysis label on the issues to filter them out so you can see what all they need help with.

Code Enhancements

Quite a few of the enhancement updates were code quality updates. In fact, feiyun0112 did several pull requests that improved the code quality of the repo helping to make it easier for folks to read and maintain it.

Miscellaneous Updates

There were also quite a lot of updates that didn't really tie in to a single theme. Here are some of the more interesting ones.

These are just a few of the changes in this release. Version 1.6 has a lot of stuff in it so I encourage you to go through the full release notes to see all the items that I didn't include in this post.


What was your favorite update in this release? Was it ARM support or the new DataFrame enhancements? Let me know in the comments!