How to use IoT datasets in #AI applications (full stack)

How to use IoT datasets in #AI applications (full stack)


Introduction
Recently, google launched a Dataset search – which is a great resource to find Datasets.  In this post, I list some IoT datasets which can be used for Machine Learning or Deep Learning applications. But finding datasets is only part of the story.  A static dataset for IoT is not enough i.e. some of the interesting analysis is in streaming mode. To create an end to end streaming implementation from a given dataset, we need knowledge of full stack skills. These are more complex (and in high demand). In this post, I hence describe the datasets but also a full stack implementation. An end to end flow implementation is described in the book Agile Data Science, 2.0 by Russell Jurney. I use this book in my teaching at the Data Science for Internet of Things course at the University of Oxford. I demonstrate the implementation from this book below. The views here represent my own.
In understanding an end to end application, the first problem is .. how to capture data from a wide range of IoT devices. The protocol used for this is typically MQTT. MQTT is lightweight IoT connectivity protocol. MQTT is publish-subscribe-based messaging protocol used in IoT applications to manage a large number of IoT devices who often have limited connectivity, bandwidth and power. MQTT integrates with Apache Kafka. Kafka provides high scalability, longer storage and easy integration to legacy systems. Apache Kafka is a highly scalable distributed streaming platform. Kafka ingests, stores, processes and forwards high volumes of data from thousands of IoT devices. (source Kai Waehner)
 
Full stack – End to End
With this background, let us try to understand the end to end (full stack) implementation of an IoT dataset. This section is adapted from the Agile Data Science 2.0 book

Image source:  Agile Data Science, 2.0 by Russell Jurney
We have the following components

Events: represents an occurrence with a relevant timestamp. Events can represent various things (ex logs from the server). In our case, they represent time series data from sensors typically represented as JSON objects

Collectors are event aggregators which collect events from various sources and queue them for action by real-time workers. Typically, Kafka or Azure event hub may be used at this stage.

Bulk storage – represents a file system capable of high I/O – for example S3 or HDFS

Distributed document store – ex MongoDB

A web application server – ex flask, Node.js
The data processing is done via spark. Pyspark is used for the Machine learning (either scikit learn or Sparl MLlib libraries) and the results are stored in MongoDB. Apache Airflow can be used for scheduling
 
Code
from github repository of Agile Data Science, 2.0 
https://github.com/rjurney/Agile_Data_Code_2/tree/training
The EC2 scripts: https://github.com/rjurney/Agile_Data_Code_2/blob/training/aws/ec2_bootstrap.sh *
The real-time notebook with Spark ML/Streaming : https://github.com/rjurney/Agile_Data_Code_2/blob/training/ch08/Deploying%20Predictive%20Systems.ipynb
 
Finally, below are some of the reference datasets you can use with IoT.

To conclude
To conclude, using the strategy and code described here – you could in principle, create an end to end streaming IoT application. 
IoT datasets
Utilities
Gas Sensor Array Drift Dataset Data Set
Water Treatment Plant Data Set
Internet Usage Data Data Set
Commercial Building Energy Dataset
Individual household electric power consumption Data Set
AMPds2: The Almanac of Minutely Power dataset (Version 2)
Commercial Building Energy Dataset Energy, – Smart Building Energy related data set from a commercial building where data is sampled more than once a minute.
Individual household electric power consumption Energy, Smart home One-minute sampling rate over a period of almost 4 years 
Energy, Smart home AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring
UK Domestic Appliance-Level Electricity Energy, Smart Home Power demand from five houses
Gas sensors for home activity monitoring Smart home Recordings of 8 gas sensors 
 
 
Smart cities
Traffic Sign Recognition Testsets
Pollution Measurements for the City of Brasov in Romania
GNFUV Unmanned Surface Vehicles Sensor Data Data Set
CGIAR dataset Agriculture, Climate – High-resolution climate datasets for a variety of fields including agricultural
Uber trip data Transportation About 20 million Uber pickups in New York City during 12 months.
Traffic Sign Recognition Transportation
Malaga datasets Smart City A broad range of categories such as energy, ITS, weather, Industry, Sport, etc
CityPulse Dataset Collection Smart City Road Traffic Data, Pollution Data, Weather, Parking
Open Data Institute – node Trento Smart City Weather, Air quality, Electricity, Telecommunication
Taxi Service Trajectory Transportation Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal
T-Drive trajectory data Transportation Chicago Bus Traces data Transportation Bus traces from the Chicago Transport Authority for 18 days
Citypulse ataset Collection
Taxi service trajectories
 
Health and home activity
Educational Process Mining Education, Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator
PhysioBank databases Healthcare – Archive of over 80 physiological datasets
Saarbruecken Voice Database Healthcare – A collection of voice recordings from more than 2000 persons for pathological voice detection
CASAS datasets for activities of daily living – Smart home Several public datasets related to Activities of Daily Living (ADL) performance in a two- story home, an apartment, and an office settings
ARAS Human Activity Dataset – Smart home Human activity recognition datasets collected from two real houses with multiple residents during two months
MERLSense Data – Smart home, building Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records
SportVU Sport Video of basketball and soccer games captured from 6 cameras
RealDisp Sport Includes a wide range of physical activities (warm up, cool down and fitness exercises)
GeoLife GPS Trajectories Transportation A GPS trajectory by a sequence of time-stamped points
Various sensor driving datasets
IoT Network Dataset
Various MHEALTH / physical activity datasets
 
 
Source: for some of the datasets Deep Learning for IoT Big Data and Streaming Analytics: A Survey
 

Link: How to use IoT datasets in #AI applications (full stack)