A lot of people have asked – and challenged us – in what we mean when we say ‘proprietary UGC technology’. What do we really mean and have we really developed something of worth or are we just a glorified online photo album??
It’s hard to answer this in the correct frame – as everyone I know, data scientists, developers, investors and journalists have all asked it and they each have their own dept of technical ability.
Regardless, I’ll start at the top and work down
Artificial Intelligence System
At the heart of our system is a something called a ‘rule engine’ or ‘expert system’ (wikipedia here). Rule engines tend to exist in systems whereby many factors come into play in making a decision(s), typical examples for applying rule engines are in medical diagnostics and in financial lending fields. ie. areas whereby numerous pieces of data are combined to form a final decision.
What they are exactly is a clever way to store and manage many different predefined decisions in a single place.
As a basic high level example of a decision or ‘rule’ to use the correct term – Newslinn has a rule whereby if the image is too small – it is not accepted into our photo stream. So that’s a very simple rule and very easy to program.
But rules can get far more complex and involved an infinite about of data points. As another example (without going too extreme). Below is an actual rule we currently use – this validates the dimensions, the ‘dpi’ of a photo and also if the user is a known user to us.
This is an example of a single rule – currently we have over 23 rules (as of September 2015) and growing, our aim is to have about 70-100 rules. Each rule caters for a particular use case, and for us declines or accepts photos. The rules can be as creative and inventive as possible and use a mixture data points from user internet devices, user session data, image data, photo metadata, photo object data et al.
What the rule engine allows us to do, is create an infinite amount of decisions, tie them together, and management them in a really efficient way.
There are 2 majors challenges with rule engines.
1) gathering all the little data points you need – we address this using our own proprietary image data extraction system which ranges from basic metadata collection to ELA and face detection.
2) identifying the rules you need ahead of time – this is the magic of the rule engine. It needs ‘experts’ to understand the domain and the application of the rule engine in the system. This is what we actively do each time we talk to journalists or analyse what photos people are sharing – and it’s the focus of our ‘Dublin Social Project’.
Machine Learning System
The Rule Engine aids in fraud measures and validation of photos – sitting outside of the rule engine is our machine learning application this is part of the validation workflow but can be thought of as separate from the rule engine.
So while the rule engine manages fraud, the machine learning system manages ‘classification’ of whether or not a photo is ‘more like’ a ‘good’ photo or if a photo is ‘more like’ a ‘bad’ photo.
To do this we’re using something called ‘supervised classification’ – which is a top level term for a method of grouping things together based on comparing it with other things that you already know about (wikipedia here).
This is part of our further research into the application of machine learning into news validation and UGC – so I’ll go into it further in another post.
We are currently experimenting with Sci-Kit learn SGD Classifier and Linear SVC.
Exact Technologies we use