by

Microsoft Fabric Terminologies

Following are the basic terminologies that are used inside Microsoft Fabric ecosystem. This has been referenced from official Fabric documentation to serve as a repo for all our future articles in Fabric.

Generic Terms

  1. Capacity: It’s a dedicated set of resources available for use. It defines how much work a resource can handle. Different tasks consume different amounts of capacity. Fabric provides capacity through the Fabric SKU and Trials.
  2. Experience: Think of it as a bundle of capabilities focused on a specific task. In Fabric, experiences include things like Synapse Data Warehouse, Data Engineering, Data Science, Real-Time Analytics, Data Factory, and Power BI.
  3. Item: An item is a specific set of capabilities within an experience. Users can create, edit, or delete items. For example, in the Data Engineering experience, you’ll find items like lakehouses, notebooks, and Spark job definitions.
  4. Tenant: A tenant is like a single instance of Fabric for an organization. It’s tied to a Microsoft Entra ID.
  5. Workspace: Picture it as a collaborative environment where different functionalities come together. It’s like a container that uses capacity for executing work. In a workspace, users create reports, notebooks, semantic models, and more.

Synapse Data Engineering

  1. Lakehouse: A lakehouse is a collection of files, folders, and tables representing a database over a data lake. It’s used by the Apache Spark and SQL engines for big data processing. Lakehouses include enhanced capabilities for ACID transactions using open-source Delta formatted tables. They’re hosted within unique workspace folders in Microsoft OneLake, containing files in various formats organized in folders and subfolders.
  2. Notebook: A Fabric notebook is a multi-language interactive programming tool. It allows users to author code and markdown, run and monitor Spark jobs, view and visualize results, and collaborate with teams. Data engineers and data scientists use notebooks to explore and process data, build machine learning experiments with both code and low-code experiences, and easily transform them into pipeline activities for orchestration.
  3. Spark Application: An Apache Spark application is a program written by a user using Spark’s API languages (Scala, Python, Spark SQL, or Java) or Microsoft-added languages (.NET with C# or F#). When an application runs, it’s divided into one or more Spark jobs that run in parallel to process data faster.
  4. Spark Job: A Spark job is part of a Spark application, running in parallel with other jobs. Each job consists of multiple tasks.
  5. Spark Job Definition: A set of user-defined parameters indicating how a Spark application should run. It allows submission of batch or streaming jobs to the Spark cluster.
  6. V-order: A write optimization for the Parquet file format that enables fast reads and provides improved performance.

Data Factory

  1. Connector: Data Factory offers a rich set of connectors that allow you to connect to different types of data stores. Once connected, you can transform the data. 
  2. Data pipeline: In Data Factory, a data pipeline is used for orchestrating data movement and transformation. These pipelines are different from the deployment pipelines in Fabric.
  3. Dataflow Gen2: Dataflows provide a low-code interface for ingesting data from hundreds of data sources and transforming your data. Dataflows in Fabric are referred to as Dataflow Gen2. Dataflow Gen2 offers extra capabilities compared to Dataflows in Azure Data Factory or Power BI. You can’t upgrade from Gen1 to Gen2. 

Synapse Data Science

  1. Data Wrangler: It’s a notebook-based tool for exploratory data analysis. Users can display data in a grid, perform data-cleansing operations, and generate reusable code snippets.
  2. Experiment: A machine learning experiment is the main organizational unit for related machine learning runs. It helps manage and control different runs.
  3. Model: A machine learning model is a trained file that recognizes specific patterns. It uses an algorithm to reason over and learn from a dataset.
  4. Run: A run corresponds to a single execution of model code. In MLflow, tracking is based on experiments and runs.

Synapse data warehousing

  1. SQL Analytics Endpoint:
    • It’s like a special gateway for querying data within a Lakehouse.
    • You can use TSQL (a type of SQL) to ask questions about the data stored in delta tables.
    • Think of it as a way to explore and analyze your lake data using familiar SQL commands.
  2. Synapse Data Warehouse:
    • This is like a classic data warehouse, but with a twist.
    • It supports full transactional T-SQL capabilities (that’s the language used for talking to databases).
    • So, you can do all the usual data warehousing stuff—like creating tables, loading data, and managing objects.
    • It’s like having a powerful toolbox for organizing and analyzing your enterprise data.

Synapse Real-Time Analytics

  1. KQL database: The KQL database holds data in a format that you can execute KQL queries against.
  2. KQL Queryset: The KQL Queryset is the item used to run queries, view results, and manipulate query results on data from your Data Explorer database. The queryset includes the databases and tables, the queries, and the results. The KQL Queryset allows you to save queries for future use, or export and share queries with others.
  3. Event stream: The Microsoft Fabric event streams feature provides a centralized place in the Fabric platform to capture, transform, and route real-time events to destinations with a no-code experience. An event stream consists of various streaming data sources, ingestion destinations, and an event processor when the transformation is needed. 

OneLake

Shortcut: Shortcuts are embedded references within OneLake that point to other file store locations. They provide a way to connect to existing data without having to directly copy it.

Write a Comment

Comment