Big data has become one of the most notorious buzz words of the past 20 years. We are living in an era where the internet-centric compute and data storage platforms are making their way into traditional corporations.
For any corporate employee—within IT or any line of business, in any department, and at any level from front-line worker to CxO—the most important thing to understand about big data is where it came from and how it applies to any company’s operations to produce value.
The Short History of Big Data - Big Data 101
While some big data history articles capture key scientific articles in big data history or capture a large number of big data milestones in a long-term, interactive format, the history below cites the most relevant and recent trends that have brought big data to corporations across the globe and helped them monetize it.
Big data really began to show up when the internet first became popular along with search engines. In the mid 1990s, companies like Excite, Yahoo!, and Lycos developed web crawlers to index all the websites and web pages on the World Wide Web so that someone could type in a search term and get a list of relevant webpages. These capabilities had to deal with a quantity of data that had never really been addressed before, except in scientific circles.
Back then, big data was about capturing and crunching web page content and transforming it into search applications. Algorithms would parse billions of web pages with raw text, track the pages, count link relationships, determine and store key words, and ultimately score each page based on a match to certain search terms along with the ability to bid on advertising for those key words. Essentially, search engines had to consume the data in a wide variety of unstructured formats, and make sense of it in a more structured way.
Of course, Google began to take precedence with their PageRank algorithm and brought new methods to the infant search industry. During this period, the volume of data grew along with the number of internet users and daily use of search engines. According to Internet Live Stats:
- In 1998, when Google was founded, it was serving 10,000 search queries per day.
- In 1999, Google was answering 3.5 million queries per day
- By the time Google IPO-ed in 2004, it supported 200 million queries per day.
- In 2012, Google was supporting 3.3 billion searches per day and crawling 30 trillion unique URLs.
The Rise of Hadoop in Solving Big Data Problems
During this time, the cost of computing was steadily declining, making processing on commodity servers less and less expensive. Given this economic trajectory and explosive growth of the Internet, Google, Amazon, and other internet-centric software companies brought new methods of computer science to the table to process big data. These methods dealt with the scale and volume of the world’s largest applications, data sets, and data centers.
Solutions included distributed application workloads and distributed data storage platforms—these took parallel processing to an extreme and utilized hundreds of thousands of servers. One of the most widely known of these big data technologies is MapReduce, which Google wrote about in a paper in 2004. In 2005, this concept was ultimately transformed into the open source project known as Apache Hadoop™, Internet applications, such as Hadoop, were not like the applications used behind corporate firewalls but supported hundreds of millions of users and petabytes of data. These “internet grade” architectures went way beyond the ability of client-server applications, databases, data warehouses, and simple internet applications used by most traditional corporations.
Big Data Challenges in the Enterprise
At the same time the search companies were dealing with a voracious growth of the Internet, enterprises also continued to see data grow within their data centers and on external web resources. Analytics, data marts, business intelligence (BI), and data warehouses became critical applications for management and spawned in a myriad of places, even duplicating data.
In addition to steady growth of traditional ERP, SCM, and CRM systems were used more, unstructured data in emails, documents, images, and videos were outpacing growth of structured data. Self-service forums proliferated. Call centers supported emails and chats, not just calls. Email use grew. Internet application use grew. These trends form the basis of big data in the enterprise, and, today, the world continues to buy 10 million servers a year, and storage has often been cited as a top enterprise pain point during the past decade.
Beyond the applications of the 1990s and 2000s, there are also many new sources of data for corporations to capture. Recently, mobile phone use has significantly grown. Social media use has significantly grown. Security and access data is tracked more than ever. With the Internet of Things (IoT), companies are adding sensors to every object in existence—from consumer goods to utility meters to cars.
Meanwhile, regulations are forcing more data to be stored for historical purposes, for reporting, and legal discovery. Every business is facing an environment where petabytes of data are regularly generated. As examples of innovative uses of big data in traditional industries emerged, enterprises began to recognize operational and competitive advantages that could come from analyzing their growing datasets, alongside external data (e.g. social media, weather, traffic, etc.)
These two worlds—the internet and traditional enterprises—are converging, but leave many challenges. Companies are using technologies developed to solve Internet challenges and are applying them to traditional industries - sometimes disrupting longtime industry leaders. But rather than one large search application, these big data technologies must now solve new problems and support new applications. Determining how big data is used to help a company operate—growing its top line, reducing its bottom line, reducing risk, and increasing customer loyalty through data science.
Big Data and Data Science—Getting Value from the Needles in the Haystack
Most commonly, big data has been described from a META Group report as data with three "V"s—a large volume, of a wide variety, and can come into an enterprise at velocity that needs to be both captured, analyzed, and put to use at very high speeds. Dozens of articles describe it and many people have articulated their own version and meaning of the term big data. However, there is consensus that simply storing large data volumes is not, in fact, "big data" - what is important is that big data can be used to do something of value.
This is where data science comes into play. Most importantly, data science uses a very wide set of algorithms, even borrowing from a wide variety of sciences, to crunch through massive amounts of data and find new insights. The data scientist has many similarities to those who use statistics and math, but also adds machine learning techniques, pattern recognition, natural language processing, predictive analytics, deep learning, and other methods to understand broader sets of data and in different ways.
As a comparison, business intelligence (BI) apps might use a data warehouse to look at a metric, such as revenue, across 100 dimensions like continent, country, region, territory, state, store, month, quarter, year, product line, product, category, price, type, and other characteristics. Big data, on the other hand, would capture the data from an individual consumer and behavioral level, then tie the consumer to their web traffic data, in-store traffic, purchase history, mobile app use, coupon use, call center incidents, returns, and social media use to determine the most important drivers of revenue and loyalty for a very specific micro-segment. This type of data set would include a lot more information than a traditional market-basket analysis or BI report.
In this example, the data scientist would find patterns from over 1000 different, unique elements or features of data to even predict a consumer’s next purchase. Similarly, data science can be used with big data in B2B settings. For the maintenance of any equipment like planes or power generators, big data might capture what 100 sensors are doing at one second intervals, but data science would look at the history of events to predict when maintenance is needed and in what areas. Data science can also connect records of data, like financial trades, to unstructured email text to flag problematic situations for compliance.
Conclusions: Big Data Drives Big Value with Big Algorithms
When we look back at the history of search engines, Google innovations, and the rise of Hadoop, we should also add the context of their motivations. To monetize their search engines and show revenue growth, these companies needed to sell search advertising. To do that, they needed to make search results, including the paid-for search ads, as relevant and high quality as possible. In other words, they needed to optimize the process of matching searchers to advertising bids that monetize through ad clicks. Data science algorithms allowed them to continue optimizing this process.
Data science is what takes big data and turns it into something valuable for companies. But companies have struggled with shortages in data science skills and questions about how to leverage existing investments in BI tools. As these challenges are overcome, companies can embrace data science and big data to transform into data-driven enterprises.