There is no bigger data than telecommunications and Multiple System Operator (MSO) data. There is no data which affects the operations of these industries more than network telemetry. And there is no data that is less standard.
Telecommunications companies are dealing with one of the most complex data problems you could imagine:
Getting a whole of network view is complex, because it is one of the largest data roll-up activities that could be imagined.
This article will show how using neural networks can reduce this time significantly—we have seen reductions in Time to Data of up to 99.998% for business as usual data pipelines.
One of Datalogue’s largest customers, a top US telco, has employed these techniques to improve time to data and increase the number of errors which are automatically responded to—leading to savings on the order of tens of millions of dollars per year in reduced operational costs.
Let’s say, for example, that a network provider wants to access handset telemetry from across their various data sources.
a company would ingest a data store, mapping the data to a schema that makes sense to them.
The company would then transform the data as necessary to feed it into the output format required.
Then, a second source appears. Another mapping process. Another pipeline.
A third, source. Same again.
(If you are getting bored, imagine how the engineers feel.)
Then the structure of the first source changes without warning. The company would have to catch that, delete the mis-processed data, remap and repipe, and start again.
A painful cycle, and one that doesn’t benefit from economies of scale. A user might get marginally faster at building a data pipeline, but it is still about one unit of labor for each new source.
… and each change in a destination too. Each iteration in the output requires more of the above as well.
That’s a lot of upfront work. A lot of maintenance work. And a lot of thankless work.
And critically, this data grooming is not supporting good business process. The people who know, own and produce the data, the people who live with the data every day, those people aren’t the people who are comprehending the data here. Rather, by leaving it to the data engineering, analysis or operations teams, the domain-specific knowledge of those groups is largely left by the wayside. It is lost knowledge.
That means that errors are only caught once the data product loop is complete. Once the data org massages the data according to their understanding, analyzes the data according to the their understanding and provides insights based on their understanding. Only once that insight is delivered, and is counterintuitive enough to alarm the experts would errors be caught.
would be worthwhile only if it addressed the flaws in the existing processes:
This is where a neural network based solution really shines.
A neural network based workflow would:
I’ll outline the above by walking through this example.
We were given three files to work from:
The training data was well known and understood by the data producers, and structured according to their want. Crucially, this data represented their work in solving a subset of the data problem.
Designing the ontology a training a neural network is like uploading expertise
The US and EU telemetry datasets were unseen—we were to standardize all three files to the same format.
To do so, we created a notebook (see appendix for code snippets) that utilized the Datalogue SDK to:
An ontology is a taxonomy or hierarchy of data classes that fall within a subject matter area. You can think of an ontology as a “master” schema for all the data in a particular domain.
Ontologies are powerful tools for capturing domain knowledge from subject matter experts. That knowledge can then be utilized to unify data sources, build data pipelines, and train machine learning models.
In this example, the producers of the data helped us create an ontology that explained the data that we were seeing: which data is sensitive and which can be freely shared, which was handset data and which was network data, which telemetry field belonged to what.
The first few nodes of the ontology
This allows the data operators to understand and contextualize the data. Both business purpose (here, obfuscation) and context (here, field descriptions and nesting according to the subject matter) are embedded directly into the data ontology.
The next step is adding training data to the ontology. This data is used to train a neural network to understand and identify each class in the ontology.
This further embeds the domain experts’ knowledge of the data to the process (as their knowledge of the training data set allowed them to perfectly map that data to the ontology).
The above-created ontology showing attached training data: 2099 data points from, amongst other sources, a “British Names” and “US Names” dataset.
Once we have the training data attached to each leaf node in the ontology, the user can automatically train a neural network model to classify these classes of data.
This default options trains a model that takes each string in as a series of character embeddings (a matrix representing the string), and uses a very deep convolutional neural network to learn the character distributions of these classes of data.
This model also heeds the context of the datastore—where the data themselves are ambiguous, other elements are considered, such as neighboring data points and column headers.
This “on rails” option will be sufficient for most classification problems, and allows a non-technical user to quickly create performant models.
Just click TRAIN
Where there is a bit more time and effort available, and where more experimentation and better results may be required, a ML engineer can use “science mode” to experiment with hyperparameter tuning, and generally have more control over the training process.
Once the model has been trained, the user is able to see the performance of the model on the validation and test sets, with stats like:
Model metrics report for the model trained above, showing an f1 score of ~0.83, with the majority of the confusion coming from non-telemetry classes like `first name` v `family name` v `city` v `state`.
As you can see, this model is able to, with little work, disambiguate the telemetry classes effectively.
Now that we have a model, we can use it to create pipelines that work for both the European and US data stores, and are resilient to changes in the incoming schemata of these sources.
This pipeline has some novel concepts:
The marginal cost for adding a new source, or remediating a changed source schema is now only the cost of verifying the results of the model—no new manual mapping or pipelining required.
The results of the classification showing that for completely differently structured sources, the neural network is able to assign the correct classes. See for example “Connection_Start” v “main_ConnectionStart”—a simple but illustrative example.
The classification based pipeline that will standardize European to the identical format.
From completely differently named and structured sources, we now have a standardized output:
European data format
US data format
A single standardized output that is agnostic to source format.
Now you have a clean dataset to use for analytics. One datastore to be used for:
Neural network based data pipelines were measured as being 99.998% faster than traditional methods
That faster time to data means:
And for the aforementioned telco, tens of millions of dollars in savings per year.
The above was a simple example used to highlight the model creation, deployment and pipelining.
There were pipelines from three sources used.
In deployment, for the above telco, more than 100k pipelines are created each month, and that number is growing exponentially.
100,000 pipelines created per month